Abstract
High-performance computing (HPC) workloads are increasingly leveraging loosely coupled large scale simula- tions. Unfortunately, most large-scale HPC platforms, including Cray/ALPS environments, are designed for the execution of long-running jobs based on coarse-grained launch capabilities (e.g., one MPI rank per core on all allocated compute nodes). This assumption limits capability-class workload campaigns that require large numbers of discrete or loosely coupled simulations, and where time-to-solution is an untenable pacing issue. This paper describes the challenges related to the support of fine-grained launch capabilities that are necessary for the execution of loosely coupled large scale simulations on Cray/ALPS platforms. More precisely, we present the details of an enhanced runtime system to support this use case, and report on initial results from early testing on systems at Oak Ridge National Laboratory.