As scientific computing applications push the boundaries of the most capable High Performance Computing (HPC) systems, the HPC community responds with larger and increasingly complex systems. Unfortunately as HPC systems grow they become more susceptible to component failures that can unexpectedly cripple a scientific application during computation. Long running applications running at more modest scales must also worry about component failure as they push the mean-time-to-failure (MTTF) of the HPC system. As a result HPC application users, developers, and system administrators must work together to manage these increasingly dynamic HPC environments.
The fault tolerance research in the Open Systems Laboratory (OSL) focuses on developing scalable, transparent and semi-transparent middleware solutions to support the management of faults in dynamic HPC environments. This site explores many of the fault tolerance and reliability projects currently under development in the OSL.
Transparent checkpoint/restart process fault tolerance allows an application to be preserved to a stable storage device and recovered at a later time. This technique does not require any changes to the application source code making it a convenient solution for complex, legacy applications and scheduler based dynamic resource management.
We provide a transparent checkpoint/restart process fault tolerance solution for MPI-1.3 compliant applications using Open MPI. Our solution was incorporated into the development trunk of Open MPI in March 2007, and later released as part of the v1.3 release series.
The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) project aims to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occurring in the operating environment in a holistic manner. Our work on this project focuses on integrating the CIFTS infrastructure into Open MPI, and making Open MPI more robust in the face of failure.