Skip to main content
SHARE
Publication

Proactive Fault Tolerance Using Preemptive Migration...

by Christian Engelmann, Geoffroy R Vallee, Thomas J Naughton Iii, Stephen L Scott
Publication Type
Conference Paper
Book Title
Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009
Publication Date
Page Numbers
252 to 257
Publisher Location
Los Alamitos, California, United States of America
Conference Name
17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009
Conference Location
Weimar, Germany
Conference Date
-

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.