Skip to main content
SHARE
Publication

Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer...

by Seung-hwan Lim, Ross G Miller, Sudharshan S Vazhkudai
Publication Type
Conference Paper
Journal Name
Proceedings of 34th IEEE International Parallel and Distributed Processing Symposium
Book Title
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Publication Date
Page Numbers
180 to 190
Issue
1
Conference Name
34th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
Conference Location
New Orleans, Louisiana, United States of America
Conference Sponsor
IEEE
Conference Date
-

Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.