![Relative Failure Hazards depend on scheduling (via X torus coordinate) and location (via col, row, cage, and node) due to heat generation and architecture of air transport heat dissipation. Fill-in scheduling and distance from cool air increase failure hazard. Computer Science and Mathematics Division CSMD ORNL](/sites/default/files/styles/list_page_thumbnail/public/2021-05/gpu_lifetimes_on_titan_supercomputer-_survival_analysis_and_reliability.png?h=452a48d2&itok=sfaD_mLZ)
A team of researchers from Oak Ridge National Laboratory applied advanced statistical methods from biomedical research to study an unexpected failure mode of general-purpose computing on graphics processing units (GPGPUs).
A team of researchers from Oak Ridge National Laboratory applied advanced statistical methods from biomedical research to study an unexpected failure mode of general-purpose computing on graphics processing units (GPGPUs).
Researchers developed a novel algorithm for resilient and communication-efficient parallel matrix multiplication in HPC systems.