Achievement: Developed analysis tools for analyzing system logs from Titan, Jaguar and Eos systems from OLCF to extract characteristics of interest and created fault catalogue.
Significance and Impact: Understanding the characteristics of system failures in large-scale supercomputers is essential for designing mitigation techniques. It offers the research community insights into the behavior of the system. This research is developing a suite of analysis methods and gaining insights into the system errors, faults and failures. The result will be a compilation of fault taxonomy, and models that capture the observed behavior of the current systems.
- Developed analysis tools for analyzing system logs from Titan, Jaguar and Eos systems.
- Several pitfalls were encountered in such analysis and they were overcome by combining information from different log files and creating a consistent log format for analysis.
- Used previously known standard methods and some new methods to model the temporal and spatial behavior of these failure events.
- Looked at how temporal and spatial behavior has evolved over the years for these systems.
- Analyzed what is correlation of these failure types with each other (i.e., is one failure type very likely to follow another type?).
- Compare and contrast the three systems in their behavior based on this analysis.
Sponsor/Facility: Work was performed at Oak Ridge National Laboratory. This work was sponsored by the US Department of Energy's Office of Advanced Scientific Computing Research.
PI and affiliation: Christian Engelmann from CSMD – Oak Ridge National Laboratory
Resilience is one of the key challenges in maintaining high efficiency in future extreme scale supercomputers leading up to Exascale. Therefore, understanding the characteristics of system failures in large scale supercomputer deployments is essential. In this work, we compare and contrast the reliability characteristics of 3 of such deployments and discuss the take-aways for professionals focusing on system design, procurement, deployment, and operation.
1.How can we extrapolate the insights to future systems.
2.Looking into errors and faults that lead to these failure events and how these errors and faults propagate.