Distributed checkpoints via torch.distributed.checkpoint (DCP):
what's saved, how resharding works, auto-resume rules, and
HuggingFace interchange.
:maxdepth: 1
dcp-model
resharding
train-state
auto-resume
hf-conversion
Every checkpoint lands in {config.checkpoint.dir}/step_{N}/ and
contains two kinds of state:
| File(s) | Contents | Format |
|---|---|---|
DCP shards (.distcp + .metadata) |
Model + optimizer state, one shard per rank | DCP § Model + optimizer |
train_state.pt |
step, tokens_seen, scheduler, RNG, extras (e.g. phase_idx, wandb_run_id) |
Train state |
metadata.json |
Human-readable {"step": N, "tokens_seen": M} |
Plain JSON |
latest symlink |
Points at the most recent step_N (updated atomically) |
Auto-resume |
kempnerforge/checkpoint/manager.py—CheckpointManager.save()/load()/wait(),latestsymlink maintenance, retention cleanup.kempnerforge/checkpoint/async_save.py—AsyncCheckpointer: sync / async / pinned-memory modes.kempnerforge/checkpoint/state.py—build_train_state/restore_train_state, RNG capture.kempnerforge/resilience/elastic.py—resolve_resume_path()(checks thelatestsymlink and falls back to the higheststep_N).scripts/convert_checkpoint.py—dcp-to-hfandhf-to-dcpCLI.
- New reader: DCP model + optimizer → Train state.
- Resuming a job: Auto-resume first, then Resharding if you're changing GPU count.
- Exporting for inference or HF checkpoints: HF conversion.
- Config knobs:
Configuration § CheckpointConfig
(search for
interval,async_mode,keep_last_n,load_path).