Skip to main content

RDPM: An Extensible Tool for Resilience Design Patterns Modelling...

by Mohit Kumar, Christian Engelmann
Publication Type
Conference Paper
Book Title
Euro-Par 2021: Parallel Processing Workshops
Publication Date
Page Numbers
283 to 297
Publisher Location
Conference Name
14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids
Conference Location
Lisbon, Portugal
Conference Sponsor
27th International European Conference on Parallel and Distributed Computing (Euro-Par) 2021
Conference Date

Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.