A team of researchers from Oak Ridge National Laboratory (ORNL) designed, implemented, and evaluated a high-performance computing (HPC) runtime system that uses the design pattern concept to orchestrate resilience capabilities for efficient protection against faults, errors, and failures.
The developed pattern-oriented resilient runtime solution offers a novel way to orchestrate efficient resilience strategies based on HPC system and application reliability properties, resilience capabilities, and resilience needs. It permits system designers and users to actively balance the cost-benefit trade-offs between performance overhead and protection coverage of different resilience solutions. The result is a resilient supercomputing software stack that can adapt to emerging reliability threats with efficient responses, delivering science through advanced computing with high productivity and correctness.
PI/Facility Lead(s): Christian Engelmann (ORNL CSMD)
ASCR Program/Facility: Early Career Program
Publication(s) for this work: Saurabh Hukerikar and Christian Engelmann, “PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems.” Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, Perth, Australia, December 1-4, 2020. DOI: 10.1109/PRDC50213.2020.00014