Isolating Cluster Jobs for Performance and Predictability
Brooks Davis
Abstract
The Aerospace Corporation operates a federally
funded research and development center in support
of national-security, civil and commercial space programs.
Many of our 2400+ engineers use a variety of
computing technologies to support their work. Applications
range from small models which are easily
handled by desktops to parameter studies involving
thousands of cpu hours and traditional, large scale
parallel codes such as computational fluid dynamics
and molecular modeling applications. Our primary
resources used to support these large applications are
computing clusters. Our current primary cluster, the
Fellowship cluster consists of 352 dual-processor nodes
with a total of 14xx cores. Two additional clusters,
beginning at 150 dual-processor nodes each are being
constructed to augment Fellowship.
As in In any multiuser computing environment with
limited resources, user competition for resources is a
significant burden. Users want everything they need
to do their job, right now. Unfortunately, other users
may need those resources at the same time. Thus,
systems to arbitrate this resource contention are necessary.
On Fellowship we have deployed the Sun Grid
Engine scheduler which scheduled batch jobs across
the nodes.
In the next section we discuss the performance problems
that can occur when sharing resources in a
high performance computing cluster. We then discuss
range of possibilities to address these problems. We
then explain the solutions we are investigating and describe
our experiments with them. We then conclude
with a discussion of future work.