November 2022
Checkpointing program
A lightweight, entirely self-contained checkpointing program that can save and restore the complete state of any running program.
Repo: https://github.com/alex-w-99/Checkpointing-Program
Demo video: https://youtu.be/FrD10-QyvNs
The magic of checkpointing
Checkpointing exists because long-running programs fail in annoying ways. Jobs crash, machines reboot, SSH sessions drop, and processes get killed. If the program can only run from start to finish uninterrupted, you’re one power glitch away from losing hours of work.
Checkpointing avoids this by saving the entire execution state, all the way down to the registers, to disk. Restoring the checkpoint puts the program back exactly where it paused, as if it never stopped.
How it works
This project implements checkpointing at the operating system level in C. The workflow is simple:
Technical details
The implementation exploits several clever systems-level techniques:
The LD_PRELOAD trick
By setting LD_PRELOAD to our custom library, we insert our code into the program’s startup path before main() runs. In fact, all ELF binaries on Linux begin execution at _start, not main(); _start runs constructors and other initialization code first. That gives us a clean hook to install signal handlers and prepare for checkpointing without modifying the target program.
Capturing program state
When a checkpoint is triggered, the program must save:
- CPU register values (using
ucontext_tto capture the execution context) - The entire memory layout (stack, heap, data segments, memory-mapped regions)
- File descriptors and other process state
All of this gets serialized into a single myckpt.dat file in a structured header-data format.
Restoring the process
When ./restart runs, it:
- Reads the checkpoint file
- Reconstructs the original memory layout using
mmap() - Restores all memory regions to their checkpointed values
- Restores CPU registers using
setcontext() - Jumps back into the program as if nothing happened
From the program's perspective, it's as if time stood still. All variables maintain their values, the call stack is intact, and execution continues seamlessly.
Universal compatibility
The beauty of this implementation is that it's entirely self-contained. Any executable compiled with -rdynamic (which exports symbols) works with this checkpointing system, no modifications to the source code required.
Want to checkpoint a Python script? A data processing pipeline? A long-running simulation? Just run it through ./ckpt and you're covered.
Why this matters
Checkpointing is a fundamental technique in:
- High-performance computing: Long simulations on cluster nodes
- Distributed systems: Fault tolerance and recovery
- Containerization: Technologies like Docker's checkpoint/restore
- Live migration: Moving running processes between machines
This project demonstrates these concepts at a foundational level, showing how operating systems can manipulate running processes in powerful ways.
The limitations (and learning)
Building this system revealed the intricate details of how programs execute:
- Memory layout intricacies and segment permissions
- The relationship between contexts, signals, and control flow
- Why certain registers (like
%fs) require special handling - The challenges of making checkpointing thread-safe
If you understand something well enough, you can exploit that knowledge to do something unconventional.