Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Show authors

Publication Type

Conference Paper

Publication Date

November, 2007

Conference Name

SuperComputing 2007

Conference Location

Reno, Nevada, United States of America

Conference Date

Nov 10, 2007 - Nov 16, 2007

Abstract

Procurement and optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained
performance and availability of such large centers is a key technical challenge significantly impacting their usability.
As recent research shows, storage systems can be a primary fault source leading to unavailability of even today's
supercomputer. Due to data unavailability, jobs are frequently resubmitted resulting in reduced compute center performance
as well as in a lack of coordination between I/O activities and job scheduling.
In this work, we explore two mechanisms, namely the coordination of job scheduling and data staging/offloading and
on-demand job input data reconstruction to address the availability of job input/output data and to improve center-wide
performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled
and recovered. Collectively, from a center standpoint, these techniques optimize resource usage and increase its
data/service availability. From a user job standpoint, they reduce job turnaround time and optimize the usage of allocated
time. We have implemented our approaches within commonly used supercomputer software tools such as the PBS
scheduler and the Lustre parallel file system. We have gathered reconstruction data from a production supercomputer
environment using multiple data sources. We conducted simulations based on the measured data recovery performance,
the job traces and staged data logs from leadership-class supercomputer centers. Our results indicate that the average
waiting time of jobs is reduced. This trend increases significantly for larger jobs and also as data is striped over more I/O
nodes.

Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery

Abstract

Organizations