Empowering Scientific Innovation Through An Integrated Research Infrastructure: The Role of the Advanced Computing Ecosystem Conference Paper November, 2024
OLCF’s Advanced Computing Ecosystem (ACE): FY24 Efforts for the DOE Integrated Research Infrastructure (IRI) Program ORNL Report September, 2024
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility Conference Paper November, 2015
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems... Conference Paper June, 2015
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective Conference Paper May, 2015
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer Conference Paper April, 2015
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation... Conference Paper February, 2015