Using Performance Tools to Support Experiments in (HPC) Resilience

The high performance computing (HPC) community is working to address concerns associated with fault tolerance and resilience in current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing interface (MPI) to support fault tolerant computing capabilities. As these enhancements emerge, tools for resilience experimentation are becoming more important. In the workshop paper titled, "Using Performance Tools to Support Experiments in HPC Resilience", we consider how HPC performance-focused tools and methods can be extended ("repurposed") to benefit the resilience community.

The paper describes the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe the extension of an existing MPI tracing package (i.e., DUMPI) to support the User Level Failure Mitigation (ULFM) specification that has been proposed to the MPI Forum by the Fault Tolerance Working Group (MPI-FTWG). The data obtained from these traces can assist application developers and FT implementers for diagnosing problems and help with postmortem analysis. To investigate the usefulness of the trace tool we extended a simple molecular dynamics application to use the ULFM enhancements to MPI. Our initial experiments used the trace files from the tests to help gain insights into the context of the job during resilience experiments. The traces helped to highlight two problems we encountered during fault injection experiments: i) a fault-injection logic error that resulted in correct results (application output), but more ranks than anticipated being killed; ii) an issue in failure detection/propagation with the ULFM prototype that was effected by the method used to simulate the rank failure. The trace files also help to explain changes to overall performance when MPI fault tolerance mechanisms are employed.

"Using Performance Tools to Support Experiments in {HPC} Resilience", Thomas Naughton, Swen Boehm, Christian Engelmann and Geoffroy Vallee", (To appear) Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, August 26, 2013, Aachen, Germany. Springer Verlag, Berlin, Germany.

ORNL Team Members: Thomas Naughton, Swen Boehm, Christian Engelmann and Geoffroy Vallee


We're always happy to get feedback from our users. Please use the Comments form to send us your comments, questions, and observations.