Models for Resilience Design Patterns...

by Mohit Kumar, Christian Engelmann
Publication Type
Conference Paper
Journal Name
IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Book Title
2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Publication Date
Page Numbers
21 to 30
Conference Name
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop
Conference Location
Atlanta, Georgia, United States of America
Conference Sponsor
Conference Date

Resilience plays an important role in supercomputers by providing correct and efficient operation in case of faults, errors, and failures. Resilience design patterns offer blueprints for effectively applying resilience technologies. Prior work focused on developing initial efficiency and performance models for resilience design patterns. This paper extends it by (1) describing performance, reliability, and availability models for all structural resilience design patterns, (2) providing more detailed models that include flowcharts and state diagrams, and (3) introducing the Resilience Design Pattern Modeling (RDPM) tool that calculates and plots the performance, reliability, and availability metrics of individual patterns and pattern combinations.