US Department of Energy, Office of Science, High-Performance Computing Facility 2024 Operational Assessment Oak Ridge Leadership Computing Facility ORNL Report April, 2025
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility Conference Paper November, 2015
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems... Conference Paper June, 2015
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective Conference Paper May, 2015
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer Conference Paper April, 2015
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation... Conference Paper February, 2015