Skip to main content
Research Highlight

RDPM: An Extensible Tool for Resilience Design Patterns Modeling

RDPM: An Extensible Tool for Resilience Design Patterns Modeling CSMD ORNL Computer Science and Mathematics
The performance, reliability, and availability of multi-level rollback (e.g., accelerator-level and application-level checkpoint/restart) modeled by the RDPM software tool with a varying system mean-time-to-failure (MTTF) of 24-168 hours (1-7 days), 80% of the computation offloaded to the accelerator and protected by both levels, a 1 second checkpoint/restart time at the accelerator level and a 1, 5 or 10 minute checkpoint/restart time at the application level ( = 0.02, 0.08 or 0.17 respectively).

Resilience to faults, errors, and failures in extreme-scale high-performance computing systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this work extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.

Secondary Media Contact

Christian Engelmann
Oak Ridge National Laboratory