Isolating Cluster Jobs for Performance and Predictability

Brooks Davis

Abstract

The Aerospace Corporation operates a federally funded research and development center in support of national-security, civil and commercial space programs. Many of our 2400+ engineers use a variety of computing technologies to support their work. Applications range from small models which are easily handled by desktops to parameter studies involving thousands of cpu hours and traditional, large scale parallel codes such as computational fluid dynamics and molecular modeling applications. Our primary resources used to support these large applications are computing clusters. Our current primary cluster, the Fellowship cluster consists of 352 dual-processor nodes with a total of 14xx cores. Two additional clusters, beginning at 150 dual-processor nodes each are being constructed to augment Fellowship. As in In any multiuser computing environment with limited resources, user competition for resources is a significant burden. Users want everything they need to do their job, right now. Unfortunately, other users may need those resources at the same time. Thus, systems to arbitrate this resource contention are necessary. On Fellowship we have deployed the Sun Grid Engine scheduler which scheduled batch jobs across the nodes. In the next section we discuss the performance problems that can occur when sharing resources in a high performance computing cluster. We then discuss range of possibilities to address these problems. We then explain the solutions we are investigating and describe our experiments with them. We then conclude with a discussion of future work.