Skip to main content
SHARE
Research Highlight

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

The architecture of the Plexus resilient runtime system interfacing with programming model runtimes, libraries, system monitoring, and job and resource management. Computer Science and Mathematics Division CSMD ORNL
The architecture of the Plexus resilient runtime system interfacing with programming model runtimes, libraries, system monitoring, and job and resource management.

The Science

A team of researchers from Oak Ridge National Laboratory (ORNL) designed, implemented, and evaluated a high-performance computing (HPC) runtime system that uses the design pattern concept to orchestrate resilience capabilities for efficient protection against faults, errors, and failures.


The Impact

The developed pattern-oriented resilient runtime solution offers a novel way to orchestrate efficient resilience strategies based on HPC system and application reliability properties, resilience capabilities, and resilience needs. It permits system designers and users to actively balance the cost-benefit trade-offs between performance overhead and protection coverage of different resilience solutions. The result is a resilient supercomputing software stack that can adapt to emerging reliability threats with efficient responses, delivering science through advanced computing with high productivity and correctness.

PI/Facility Lead(s): Christian Engelmann (ORNL CSMD)
ASCR Program/Facility: Early Career Program
Funding: ASCR
Publication(s) for this work: Saurabh Hukerikar and Christian Engelmann, “PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems.” Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, Perth, Australia, December 1-4, 2020. DOI: 10.1109/PRDC50213.2020.00014