Skip to main content
SHARE
Publication

Towards a Resilience Investigation Framework for High Performance Computing...

by Thomas J Naughton Iii
Publication Type
Thesis / Dissertation
Publication Date

As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/component counts. Fault tolerance, or resilience, is a key challenge for computing and a major factor in the successful utilization of high-end scientific computing platforms. As the importance of fault tolerance increases, methods for ex- perimentation into new mechanisms and policies are critical. The methodical investigation of failure in these systems is hampered by their scale, and a lack of tools for controlled experimentation. The focus of this research is to provide a versatile, low-overhead platform for fault tolerance/resilience experimenta- tion in a high-performance computing (HPC) environment. The objective is to extend the HPC workflow and toolkit to provide ways for studying large-scale scientific applications at extreme scales with synthetic faults (errors) in a controlled environment. As part of this research we leverage prior work in the areas of HPC system software and performance evaluation tools to enable controlled experimentation through fault injection, while maintaining acceptable performance for scientific workloads. The research identifies two crucial characteristics that are balanced for fault-injection experiments: (i) integration (context), and (ii) isolation (protection). The result of this research is a Resilience Investigation Framework (RIF) that provides HPC users and developers a versatile experimental framework that balances integration and isolation when exploring resilience methods and policies in large-scale systems.