Proactive Fault Tolerance for HPC with Xen Virtualization...

by Arun Nagarajan, Frank Mueller, Christian Engelmann, Stephen L Scott

Publication Type

Conference Paper

Publication Date

June, 2007

Page Numbers

23 to 32

Conference Name

21th ACM International Conference on Supercomputing (ICS) 2007

Conference Location

Seattle, Washington, United States of America

Conference Date

Jun 16, 2007 - Jun 20, 2007

Abstract

with thousands of processors. At such large counts of compute
nodes, faults are becoming common place. Current techniques to
tolerate faults focus on reactive schemes to recover from faults and
generally rely on a checkpoint/restart mechanism. Yet, in today's
systems, node failures can often be anticipated by detecting a deteriorating
health status.
Instead of a reactive scheme for fault tolerance (FT), we are
promoting a proactive one where processes automatically migrate
from unhealthy nodes to healthy ones. Our approach relies on
operating system virtualization techniques exemplied by but not
limited to Xen. This paper contributes an automatic and transparent
mechanism for proactive FT for arbitrary MPI applications.
It leverages virtualization techniques combined with health monitoring
and load-based migration. We exploit Xen's live migration
mechanism for a guest operating system (OS) to migrate an
MPI task from a health-deteriorating node to a healthy one without
stopping the MPI task during most of the migration. Our proactive
FT daemon orchestrates the tasks of health monitoring, load
determination and initiation of guest OS migration. Experimental
results demonstrate that live migration hides migration costs and
limits the overhead to only a few seconds making it an attractive
approach to realize FT in HPC systems. Overall, our enhancements
make proactive FT a valuable asset for long-running MPI
application that is complementary to reactive FT using full checkpoint/
restart schemes since checkpoint frequencies can be reduced
as fewer unanticipated failures are encountered. In the context of
OS virtualization, we believe that this is the rst comprehensive
study of proactive fault tolerance where live migration is actually
triggered by health monitoring.

Proactive Fault Tolerance for HPC with Xen Virtualization...

Abstract

Researchers

Organizations