Skip to main content
SHARE
Publication

Recovering Transient Data: Automated On-demand Data Reconstruction and Offloading for Supercomputers...

by Sudharshan S Vazhkudai, Xiaosong Ma
Publication Type
Journal
Journal Name
ACM SIGOPS Operating Systems Review
Publication Date
Page Numbers
14 to 18
Volume
41
Issue
1

It has become a national priority to build and use PetaFlop
supercomputers. The dependability of such large systems
has been recognized as a key issue that can impact their
usability. Even with smaller, existing machines, failures are
the norm rather than an exception. Research has shown
that storage systems are the primary source of faults leading
to supercomputer unavailability. In this paper, we envision
two mechanisms, namely on-demand data reconstruction
and eager data offloading, to address the availability of
job input/output data. These two techniques aim to allow
parallel jobs and post-job processing tools to continue execution
despite storage system failures in supercomputers. Fundamental
to both approaches is the definition and acquisition
of recovery-related parallel file system metadata, which
is then coupled with transparent remote data accesses. Our
approach attempts to maximize the utilization of precious
supercomputer resources by improving the accessibility of
transient job data. Further, the proposed methods are best-effort in nature and complement existing file system recovery
schemes, which are designed for persistent data. Several of
our previous studies help in demonstrating the feasibility of
the proposed approaches.