From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments...

Show authors

Publication Type

Conference Paper

Book Title

SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Publication Date

November, 2024

Page Numbers

484 to 495

Publisher Location

New Jersey, United States of America

Conference Name

Workshop on Fault-Tolerance for HPC at eXtreme Scale (FTXS 2024) at Supercomputing Conference (SC24)

Conference Location

Atlanta, Georgia, United States of America

Conference Sponsor

IEEE Computer Society, TCHPC, ACM, sighpc

Conference Date

Sep 17, 2024 - Sep 22, 2024

View DOI Listing

Abstract

Disk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community.

From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC Environments...

Abstract

Researchers

Organizations