Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

by Seung-hwan Lim, Ross G Miller, Sudharshan S Vazhkudai

Publication Type

Conference Paper

Book Title

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Publication Date

May, 2020

Page Numbers

180 to 190

Issue

Conference Name

34th IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Conference Location

New Orleans, Louisiana, United States of America

Conference Sponsor

IEEE

Conference Date

May 18, 2020 - May 22, 2020

View DOI Listing

Abstract

Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.

Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Abstract

Researchers

Organizations