PTL Logo

(Local Copy)Fault Tolerance Research @ Open Systems Laboratory

Transparent Checkpoint/Restart in Open MPI

  •  

Command Line Tools

Command line tools to support checkpoint/restart in Open MPI.

ompi-checkpoint

The ompi-checkpoint command is provided to checkpoint an MPI application. The one required argument to this command is the PID of the mpirun process. This command must be launched on the same machine as the running mpirun process. Once a checkpoint request has completed ompi-checkpoint will return a global snapshot reference and a sequence number. This information will allow you to properly restart the MPI job at a later time.

Interface

ompi-checkpoint PID_OF_MPIRUN \
    [-h | --help]
    [-v | --verbose]
    [-V #]
    [--term]
    [--stop]
    [-w | --nowait]
    [-s | --status]
    [-l | --list]
    [-attach | --attach]
    [-detach | --detach]
    [-crdebug | --crdebug]

Example

shell$ mpirun my-app <args> &
shell$ export PID_OF_MPIRUN=1234
shell$ ompi-checkpoint $PID_OF_MPIRUN
Snapshot Ref.: 0 ompi-global-snapshot-1234
shell$ ompi-checkpoint $PID_OF_MPIRUN
Snapshot Ref.: 1 ompi-global-snapshot-1234

Arguments

Argument Description
PID_OF_MPIRUN PID of the mpirun process
-h | --help Display help
-v |
--verbose
Display verbose output
-V # Display verbose output up to a specified level
--term Terminate the application after checkpoint.
--stop Send SIGSTOP to application just after checkpoint (checkpoint will not finish until SIGCONT is sent) (Cannot be used with --term)
-w | --nowait Not Implemented:
Do not wait for the application to finish checkpointing before returning.
-s | --status Display status messages describing the progression of the checkpoint.
-l | --list Display a list of checkpoint files available on this machine
-attach |
--attach
Introduced in r23587. Included in v1.5.1 and later releases.
Wait for the debugger to attach directly after taking the checkpoint.
-detach |
--detach
Introduced in r23587. Included in v1.5.1 and later releases.
Do not wait for the debugger to reattach after taking the checkpoint.
-crdebug |
--crdebug
Introduced in r23587. Included in v1.5.1 and later releases.
Enable C/R Enhanced Debugging.

Notes

Users familiar with LAM/MPI checkpoint/restart commands should notice that ompi-checkpoint does not require the user to tell it which checkpoint/restart service (e.g., BLCR or SELF) to use when checkpointing the application. This information is automatically detected and stored with the checkpoint snapshot.

Back to top

ompi-restart

The ompi-restart command is provided to restart a previously-checkpointed MPI application. The one required argument to this command is the global snapshot reference returned by ompi-checkpoint. The global snapshot reference contains all of the necessary information to properly restart an MPI application. Invoking ompi-restart results in a new mpirun being launched.

Interface

ompi-restart GLOBAL_SNAPSHOT_REF \
    [-h | --help]
    [-v | --verbose]
    [--fork]
    [-s | --seq]
    [--hostfile]
    [--machinefile]
    [-i | --info]
    [-a | --apponly]
    [-crdebug | --crdebug]
    [-mpirun_opts | --mpirun_opts]
    [--showme]

Example

shell$ ompi-restart ompi-global-snapshot-1234

Arguments

Argument Description
GLOBAL_SNAPSHOT_REF Global snapshot reference
-h | --help Display help
-v | --verbose Display verbose output
--fork Fork off a new process which is the restarted process instead of replacing orte_restart.
-s | --seq # The sequence number of the checkpoint to start from. (Default: -1, or most recent)
--hostfile |
--machinefile
Provide a hostfile to use for launch.
-i | --info Display information about the checkpoint
-a | --apponly Introduced in r23587. Included in v1.5.1 and later releases.
Only create the app context file, do not restart from it.
-crdebug |
--crdebug
Introduced in r23587. Included in v1.5.1 and later releases.
Enable C/R Enhanced Debugging
-mpirun_opts |
--mpirun_opts
Introduced in r23587. Included in v1.5.1 and later releases.
Command line options to pass directly to mpirun (be sure to quote long strings, and escape internal quotes)
--showme Introduced in r23587. Included in v1.5.1 and later releases.
Display the full command line that would have been exec'ed.
-p | --preload Deprecated in r23587. Deprecated in v1.5.1 and later releases.
Preload the checkpoint files before restarting (Default = Disabled)

Notes

Users familiar with LAM/MPI checkpoint/restart commands should notice that ompi-restart does not require the user to tell it which checkpoint/restart service (e.g., BLCR or SELF) was used when checkpointing the application. This information is stored with the checkpoint snapshot and automatically used by the ompi-restart command.

Back to top

ompi-migrate

Introduced in r23587. Included in v1.5.1 and later releases.

The ompi-migrate command is provided to migrate an MPI application. The one required argument to this command is the PID of the mpirun process. This command must be launched on the same machine as the running mpirun process.

Interface

ompi-migrate PID_OF_MPIRUN \
    [-h | --help]
    [-v | --verbose]
    [-r | --ranks]
    [-t | --onto]
    [-x | --off]

Example

shell$ ompi-migrate -x node123,node124 1234
shell$ ompi-migrate -x node123,node124 -t node125,node126 1234
shell$ ompi-migrate -r 1,3,5,7 1234

Arguments

Argument Description
PID_OF_MPIRUN PID of the mpirun process
-h | --help Display help
-v | --verbose Display verbose output
-r | --ranks List of MPI_COMM_WORLD ranks to migrate (comma separated)
-t | --onto List of nodes to migrate onto (comma separated)
-x | --off List of nodes to migrate off of (comma separated)

Back to top