Back

November 2022

Checkpointing program

A lightweight, entirely self-contained checkpointing program that can save and restore the complete state of any running program.


Repo: https://github.com/alex-w-99/Checkpointing-Program

Demo video: https://youtu.be/FrD10-QyvNs

The magic of checkpointing

Checkpointing exists because long-running programs fail in annoying ways. Jobs crash, machines reboot, SSH sessions drop, and processes get killed. If the program can only run from start to finish uninterrupted, you’re one power glitch away from losing hours of work.

Checkpointing avoids this by saving the entire execution state, all the way down to the registers, to disk. Restoring the checkpoint puts the program back exactly where it paused, as if it never stopped.

How it works

This project implements checkpointing at the operating system level in C. The workflow is simple:

bashbash

Technical details

The implementation exploits several clever systems-level techniques:

The LD_PRELOAD trick

By setting LD_PRELOAD to our custom library, we insert our code into the program’s startup path before main() runs. In fact, all ELF binaries on Linux begin execution at _start, not main(); _start runs constructors and other initialization code first. That gives us a clean hook to install signal handlers and prepare for checkpointing without modifying the target program.

Capturing program state

When a checkpoint is triggered, the program must save:

  • CPU register values (using ucontext_t to capture the execution context)
  • The entire memory layout (stack, heap, data segments, memory-mapped regions)
  • File descriptors and other process state

All of this gets serialized into a single myckpt.dat file in a structured header-data format.

Restoring the process

When ./restart runs, it:

  1. Reads the checkpoint file
  2. Reconstructs the original memory layout using mmap()
  3. Restores all memory regions to their checkpointed values
  4. Restores CPU registers using setcontext()
  5. Jumps back into the program as if nothing happened

From the program's perspective, it's as if time stood still. All variables maintain their values, the call stack is intact, and execution continues seamlessly.

Universal compatibility

The beauty of this implementation is that it's entirely self-contained. Any executable compiled with -rdynamic (which exports symbols) works with this checkpointing system, no modifications to the source code required.

Want to checkpoint a Python script? A data processing pipeline? A long-running simulation? Just run it through ./ckpt and you're covered.

Why this matters

Checkpointing is a fundamental technique in:

  • High-performance computing: Long simulations on cluster nodes
  • Distributed systems: Fault tolerance and recovery
  • Containerization: Technologies like Docker's checkpoint/restore
  • Live migration: Moving running processes between machines

This project demonstrates these concepts at a foundational level, showing how operating systems can manipulate running processes in powerful ways.

The limitations (and learning)

Building this system revealed the intricate details of how programs execute:

  • Memory layout intricacies and segment permissions
  • The relationship between contexts, signals, and control flow
  • Why certain registers (like %fs) require special handling
  • The challenges of making checkpointing thread-safe

If you understand something well enough, you can exploit that knowledge to do something unconventional.