Supercomputing and Computation

SHARE

Resilience


Contact Christian Engelmann

Hardware and software faults are an unavoidable aspect of any computer system and their management accounts for a great deal of effort at all levels in the system. While faults occur continuously they are only significant if they result in an interruption of work or in a wrong answer. Resilience is about keeping a computer system running and producing correct output in a timely manner. As a supercomputer consists of millions of individual components, including hundreds-of-thousands of processor and memory chips, the probability of faults is much higher than in a consumer product. Supercomputers also constantly push the envelope in what is achievable with today's technology, such as by relying on the latest accomplishments in processor and memory technology, additionally increasing the potential for faults. The goal of resilient high-performance computing (HPC) is to provide efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery across all levels of HPC hardware and software. Our research and development in HPC resilience focuses on fault characterization, prevention, detection, notification, and handling as part of HPC hardware/software co-design that considers the cost/benefit trade-off between the key system design factors: performance, resilience, and power consumption. Our efforts target (1) fault injection tools to study the vulnerability, propagation properties, and handling coverage of processors, memory, system software, and science applications, (2) fault detection and notification software frameworks for communicating information across all levels of the system, (3) reactive software mechanisms, like checkpoint/restart and message logging, (4) proactive software approaches, such as migration of work in anticipation of faults, reliability-aware scheduling, and rejuvenation, (5) programming model approaches, like the fault-tolerant Message Passing Interface, (6) algorithm-based fault tolerance with recovery or fault-oblivious approaches embedded in science applications, and (7) resilience co-design tools to study the cost/benefit trade-off between the key system design factors. Our work in HPC resilience, including the obtained knowledge and the developed solutions, ensures that DOE's Leadership computing systems continue to enable scientific breakthroughs by operating with an acceptable efficiency and productivity.

 

ASK ORNL

We're always happy to get feedback from our users. Please use the Comments form to send us your comments, questions, and observations.