A team of researchers from Oak Ridge National Laboratory applied advanced statistical methods from biomedical research to study an unexpected failure mode of general-purpose computing on graphics processing units (GPGPUs) in the Titan supercomputer, ORNL’s flagship computing system from 2012 to 2018. Starting in 2016, GPGPUs started failing due to silver sulfide corrosion of a resistor on the GPGPU boards. The analysis revealed strong correlations between GPGPU failures and heat dissipation, explainable by cooling air transport. It also showed strong correlations between GPGPU failures and usage, explainable by scheduling of computational jobs.
The performed research revealed new insights into root causes and failure cascades using novel statistical methods. It highlights that leadership systems should expect the unexpected with detailed system data collection and advanced statistical methods to (a) enable configurable causal analysis, (b) provide early warning of reliability issues, and (c) inform mitigation strategies.
PI/Facility Lead(s): Georgia Tourassi (OLCF) and Christian Engelmann (ORNL CSMD)
ASCR Program/Facility: OLCF, Early Career
Funding: U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Facilities and Resilience for Extreme Scale Supercomputing Systems Program
Publication: Ostrouchov, George, Don Maxwell, et al., “GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability.” In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Atlanta, GA, USA, November 15-20, 2020