Skip to main content
SHARE
Research Highlight

Mini-Ckpts: Surviving OS Failures in Persistent Memory

Mini-Ckpts: Surviving OS Failures in Persistent Memory
The scientific application runtime overhead stemming from the mini-ckpts recovery and rejuvenation process scales linear with the number of injected operating system failures (kernel panics).

Achievement: Developed a novel warm-reboot capability for the operating system (OS) in extreme-scale high-performance computing (HPC) systems to enable recovery from OS failures with minimal impact on running scientific applications.

Significance and Impact: The mini-ckpts framework offers a significantly more efficient solution to recover from operating system failures than traditional checkpoint/restart. It improves the efficiency and productivity of DOE’s extreme-scale HPC systems.

Research Details:

  • Developed a capability to make scientific applications transparently memory resident despite OS warm reboots
  • Developed a facility that enables continued operation of scientific applications after an OS warm reboot

Sponsor/Facility: Work was performed at North Carolina State University, Sandia National Laboratories, and Oak Ridge National Laboratory. This work was sponsored by the US Department of Energy's Office of Advanced Scientific Computing Research.

PI and affiliation: Ron Brightwell, Sandia National Laboratories

Overview:

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory are more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs – and in HPC also for any other nodes a parallelized application runs on and communicates with: Any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures. In this work, we introduce mini-ckpts, a framework that enables application survival despite the occurrence of a fatal OS failure or crash. Mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime system can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current, coarse-grained, application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional fault scenarios.