PTL Logo

(Local Copy)Fault Tolerance Research @ Open Systems Laboratory

Checkpoint/Restart Enabled Debugging



The isolation and correction of programmatic errors (or debugging) is often the most time consuming part of software development, especially for parallel programming. Typically a cyclic debugging technique is used in conjunction with a debugger. During cyclic debugging the developer analyzes multiple executions of a program building up knowledge about the program state surrounding the problem area. The debugging operation usually focuses on analyzing a relatively small time slice of program execution surrounding the suspected program error. The current debugging process requires the developer to start the program under debug from the beginning of execution for every pass of the cyclic debugging process. By checkpointing the process under debug at regular and user defined intervals a debugger can allow the developer to return to an intermediary state of program execution closer to the bug thus saving considerable amounts of developer time.

This project is attempting to achieve checkpoint/restart enabled parallel debugging by defining an interface between a parallel debugger and a checkpoint/restart enabled MPI implementation. This website documents the progress towards this goal, and presents some examples on how the debugger, MPI, and checkpoint/restart services interact to achieve this goal.


  • Joshua Hursey, Chris January, Mark O'Connor, Paul H. Hargrove, David Lecomber, Jeffrey M. Squyres, and Andrew Lumsdaine. Checkpoint/Restart-Enabled Parallel Debugging. EuroMPI 2010. September 2010. [Slides]


Currently Supported


No additional notes at this time