PTL Logo

(Local Copy)Fault Tolerance Research @ Open Systems Laboratory

Transparent Checkpoint/Restart in Open MPI

  •  

Examples and Use Cases

Below are some common use cases and examples.

Use Case: Transparent Checkpoint to NFS:

This use case demonstrates the basic checkpoint/restart functionality of Open MPI. Checkpoints are stored directly to a globally mounted file system in the /home/me/checkpoints/ directory. For this example we assume an unmodified application and Open MPI using the BLCR library for generating local snapshots.

$HOME/.openmpi/mca-params.conf

# Local snapshot directory (not used in this scenario)
# crs_base_snapshot_dir was deprecated in r23587, and in v1.5.1 and later releases.
# crs_base_snapshot_dir=/home/me/tmp
sstore_stage_local_snapshot_dir=/home/me/tmp

# Remote snapshot directory (globally mounted file system)
# snapc_base_global_snapshot_dir was deprecated in r23587, and in v1.5.1 and later releases.
# snapc_base_global_snapshot_dir=/home/me/checkpoints
sstore_base_global_snapshot_dir=/home/me/checkpoints

Shell #1:

Start an MPI job enabling fault tolerance. (Assume that the PID of mpirun is 1234).

shell$ mpirun -am ft-enable-cr my-app <args>
...

Shell #2:

Checkpoint the MPI job with mpirun PID 1234. The second checkpoint, terminate the job.

shell$ ompi-checkpoint 1234
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt
shell$ echo "wait for some time..."
shell$ ompi-checkpoint --term 1234
Snapshot Ref.: 1 ompi_global_snapshot_1234.ckpt
shell$

Shell #1:

Restart the job from the most recent checkpoint

shell$ ompi-restart ompi_global_snapshot_1234.ckpt
...

Back to top

Use Case: Transparent Checkpoint to Local Disk:

This use case demonstrates the basic checkpoint/restart functionality of Open MPI. Checkpoints are stored directly to a globally mounted file system in the /home/me/checkpoints/ directory. For this example we assume an unmodified application and Open MPI using the BLCR library for generating local snapshots.

$HOME/.openmpi/mca-params.conf

# Transfer the files from the local snapshot directory to the global snapshot
# directory
# snapc_base_store_in_place was deprecated in r23587, and in v1.5.1 and later releases.
# snapc_base_store_in_place=0
sstore=stage

# Local snapshot directory (locally mounted file system)
# crs_base_snapshot_dir was deprecated in r23587, and in v1.5.1 and later releases.
# crs_base_snapshot_dir=/tmp/me/local
sstore_stage_local_snapshot_dir=/tmp/me/local

# Remote snapshot directory (locally mounted file system))
# snapc_base_global_snapshot_dir was deprecated in r23587, and in v1.5.1 and later releases.
# snapc_base_global_snapshot_dir=/tmp/me/global
sstore_base_global_snapshot_dir=/tmp/me/global

Shell #1:

Start an MPI job enabling fault tolerance. (Assume that the PID of mpirun is 1234).

shell$ mpirun -am ft-enable-cr my-app <args>
...

Shell #2:

Checkpoint the MPI job with mpirun PID 1234. The second checkpoint, terminate the job.

shell$ ompi-checkpoint 1234
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt
shell$ echo "wait for some time..."
shell$ ompi-checkpoint --term 1234
Snapshot Ref.: 1 ompi_global_snapshot_1234.ckpt
shell$

Shell #1:

Restart the job from the most recent checkpoint. Make sure to pass the --preload option, so the checkpoint files are transfered to the remote system during startup.

shell$ ompi-restart --preload ompi_global_snapshot_1234.ckpt
...

Back to top

Use Case: Checkpointing and SIGSTOP/SIGCONT:

This use case demonstrates how to checkpoint and immediately send SIGSTOP to an Open MPI application. The application can then be continued using SIGCONT. Alternatively the application may also be terminated, and restarted at a later point in time from the generated checkpoint.

This functionality is useful in a gang scheduled environment where a running application may be stopped and held in memory while another application uses the machines. The new application can safely kill the stopped application if it needs more memory, since the stopped application can be restarted from a checkpoint. Alternatively if more resources become available the stopped application can be terminated and restarted on the free resources.

Shell #1:

Start an MPI job enabling fault tolerance. (Assume that the PID of mpirun is 1234).

shell$ mpirun -am ft-enable-cr my-app <args>
...

Shell #2:

Checkpoint the MPI job with mpirun PID 1234 passing the --stop option to send SIGSTOP to the application just after checkpointing. The checkpoint generated can be used as usual. If restarting the SIGCONT signal is automaticly forwarded to the restarted processes.

shell$ ompi-checkpoint --stop -v 1234
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt
shell$ echo "Application is now stopped"
shell$

Shell #2:

To resume the job just send the SIGCONT signal to mpirun. mpirun will forward the signal to all of the processes in the application.

shell$ kill -CONT 1234
shell$ echo "Application resumes computation"

Back to top

Example: Using the SELF Checkpoint/Restart System:

The SELF component will invoke the user-defined functions to save and restore checkpoints. It is simply a mechanism for user-defined function to be invoked at Open MPI's Checkpoint, Continue, and Restart phases. Hence, the only data that is saved during the checkpoint is what is written in the user's checkpoint function - no MPI library state is saved at all. As such, the model for the SELF component is slightly different than, for example, the BLCR component. Specifically, the Restart function is not invoked in the same process image of the process that was checkpointed. The Restart phase is invoked during MPI_INIT of a new instance of the application (i.e., it starts over from main()).

Below is an example of an application that takes advantage of the SELF Checkpoint/Restart System. Checkpointing and restarting of the MPI job occurs exactly as in the Transparent Checkpoint Use Cases.

Compiling

mpicc my-app.c -export -export-dynamic -o my-app

Running

shell$ mpirun -np 2 -am ft-enable-cr my-app

shell$ mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

my-app.c: [Download]

/*
 * Example Open PAL CRS self program
 * Author: Joshua Hursey
 */
#include <mpi.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>

#define LIMIT 100

/************************
 * Function Declarations
 ************************/
void signal_handler(int sig);

/* Default OPAL crs self callback functions */
int opal_crs_self_user_checkpoint(char **restart_cmd);
int opal_crs_self_user_continue(void);
int opal_crs_self_user_restart(void);

/* OPAL crs self callback functions */
int my_personal_checkpoint(char **restart_cmd);
int my_personal_continue(void);
int my_personal_restart(void);

/*******************
 * Global Variables
 *******************/
int am_done = 1;
int current_step  = 0;
char ckpt_file[128] = "my-personal-cr-file.ckpt";
char restart_path[128] = "/full/path/to/personal-cr";

/*********
 *  Main
 *********/
int main(int argc, char *argv[]) {
    int rank, size;
    
    current_step = 0;

    MPI_Init(&argc, &argv);

    /* So we can exit cleanly */
    signal(SIGINT,  signal_handler);
    signal(SIGTERM, signal_handler);

    for(; current_step <  LIMIT; current_step += 1) {
        printf("%d) Step %d\n", getpid(), current_step);
        sleep(1);
        if(0 == am_done) {
            break;
        }
    }

    MPI_Finalize();

    return 0;
}

void signal_handler(int sig) {
    printf("Received Signal %d\n", sig);
    am_done = 0;
}

/* OPAL crs self callbacks for checkpoint */
int opal_crs_self_user_checkpoint(char **restart_cmd) {
    printf("opal_crs_self_user_checkpoint callback...\n");
    my_personal_checkpoint(restart_cmd);
    return 0;
}

int opal_crs_self_user_continue(void) {
    printf("opal_crs_self_user_continue callback...\n");
    my_personal_continue();
    return 0;
}

int opal_crs_self_user_restart(void) {
    printf("opal_crs_self_user_restart callback...\n");
    my_personal_restart();
    return 0;
}

/* OPAL crs self callback for checkpoint */
int my_personal_checkpoint(char **restart_cmd) {
    FILE *fp;
    
    *restart_cmd = NULL;

    printf("my_personal_checkpoint callback...\n");
    
    /*
     * Open our checkpoint file
     */
    if( NULL == (fp = fopen(ckpt_file, "w")) ) {
        fprintf(stderr, "Error: Unable to open file (%s)\n", ckpt_file);
        return;
    }
    
    /*
     * Save the process state
     */
    fprintf(fp, "%d\n", current_step);
    
    /*
     * Close the checkpoint file
     */
    fclose(fp);

    /*
     * Figure out the restart command
     */
    asprintf(restart_cmd, "%s", strdup(restart_path));

    return 0;
}

int my_personal_continue() {
    printf("my_personal_continue callback...\n");
    /* Don't need to do anything here since we are in the
     *  state that we want to be in already.
     */
    return 0;
}

int my_personal_restart() {
    FILE *fp;

    printf("my_personal_restart callback...\n");

    /*
     * Open our checkpoint file
     */
    if( NULL == (fp = fopen(ckpt_file, "r")) ) {
        fprintf(stderr, "Error: Unable to open file (%s)\n", ckpt_file);
        return;
    }

    /*
     * Access the process state that we saved and 
     * update the current step variable.
     */
    fscanf(fp, "%d", ¤t_step);
    
    fclose(fp);

    printf("my_personal_restart: Restarting from step %d\n", current_step);    
    
    return 0;
}

Back to top