Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires coordination between various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in future HPC systems, they are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. Therefore, the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance \& power efficiency characteristics. Additionally, few implementations of current resilience solutions are portable to newer architectures and software environments that will be deployed on future systems.
We developed a new structured approach to the management of HPC resilience using the concept of resilience-based design patterns. In general, a design pattern is a repeatable solution to a commonly occurring problem. We identified the well-known solutions that are commonly used to deal with faults, errors and failures in HPC systems.
In the initial design patterns specification (version 1.0), we described the various solutions, which address specific problems in the design of resilient HPC environments, in the form of patterns. Each pattern describes a problem caused by a fault, error or failure event in an HPC environment, and then describes the core of the solution of the problem in such a way that this solution may be adapted to different systems and implemented at different layers of the system stack. The catalog of these resilience design patterns provides designers with a collection of design elements. To construct complete resilience solutions using combinations of various patterns, we defined a framework that enhances HPC designers' understanding of the important constraints and the opportunities for the design patterns to be implemented and deployed at various layers of the system stack. The design framework is also useful for establishing interfaces and mechanisms to coordinate flexible fault management across hardware and software components, as well as to consider the trade-off between performance, resilience, and power consumption when constructing a solution. The resilience design patterns specification version 1.1 included more detailed explanations of the pattern solutions, the context in which the patterns are applicable, and the implications for hardware or software design. It also provided several additional examples and detailed case studies to demonstrate the use of patterns to build realistic solutions.
In version 1.2 of the specification document, we have improved the pattern descriptions, including graphical representations of the pattern components. These improvements are largely based on critical comments, feedback and suggestions received from pattern experts and readers of the previous versions of the specification. The pattern classification has been modified to further clarify the relationships between pattern categories. This version of the specification also introduces a pattern language for resilience design patterns. The pattern language presents the patterns in the catalog as a network, revealing the relations among the resilience patterns. The language provides designers with the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language also enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack. The overall goal of this work is to provide hardware and software designers, as well as the users and operators of HPC systems, a systematic methodology for the design and evaluation of resilience technologies in HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.
Version 2.0 expands the resilience design pattern classification and catalog to include self-stabilization patterns and reliability, availability and performance models for each structural pattern.