Abstract
Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.