Mini-Ckpts: Surviving OS Failures in Persistent Memory
by Christian Engelmann, et al.
Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory are more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs – and in HPC also for any other nodes a parallelized application runs on and communicates with: Any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures. In this work, we introduce mini-ckpts, a framework that enables application survival despite the occurrence of a fatal OS failure or crash. Mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime system can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current, coarse-grained, application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional fault scenarios.