Abstract
with thousands of processors. At such large counts of compute
nodes, faults are becoming common place. Current techniques to
tolerate faults focus on reactive schemes to recover from faults and
generally rely on a checkpoint/restart mechanism. Yet, in today's
systems, node failures can often be anticipated by detecting a deteriorating
health status.
Instead of a reactive scheme for fault tolerance (FT), we are
promoting a proactive one where processes automatically migrate
from “unhealthy” nodes to healthy ones. Our approach relies on
operating system virtualization techniques exemplied by but not
limited to Xen. This paper contributes an automatic and transparent
mechanism for proactive FT for arbitrary MPI applications.
It leverages virtualization techniques combined with health monitoring
and load-based migration. We exploit Xen's live migration
mechanism for a guest operating system (OS) to migrate an
MPI task from a health-deteriorating node to a healthy one without
stopping the MPI task during most of the migration. Our proactive
FT daemon orchestrates the tasks of health monitoring, load
determination and initiation of guest OS migration. Experimental
results demonstrate that live migration hides migration costs and
limits the overhead to only a few seconds making it an attractive
approach to realize FT in HPC systems. Overall, our enhancements
make proactive FT a valuable asset for long-running MPI
application that is complementary to reactive FT using full checkpoint/
restart schemes since checkpoint frequencies can be reduced
as fewer unanticipated failures are encountered. In the context of
OS virtualization, we believe that this is the rst comprehensive
study of proactive fault tolerance where live migration is actually
triggered by health monitoring.