Skip to content

Add max_checkpoints to limit permanent checkpoint retention#694

Open
TimDettmers wants to merge 2 commits into
mainfrom
timd/add-max-checkpoints
Open

Add max_checkpoints to limit permanent checkpoint retention#694
TimDettmers wants to merge 2 commits into
mainfrom
timd/add-max-checkpoints

Conversation

@TimDettmers
Copy link
Copy Markdown

@TimDettmers TimDettmers commented May 27, 2026

Summary

  • Adds a max_checkpoints parameter to CheckpointerCallback (default: 3) that trims the oldest permanent checkpoints after each save
  • Uses the existing _schedule_for_removal / _remove_checkpoint infrastructure — no new deletion code needed
  • Set to None to preserve the previous keep-all behavior
  • Includes input validation (max_checkpoints >= 1)

Motivation

Permanent checkpoints accumulate indefinitely because there is no retention limit — the _checkpoints list is appended to on every save but never trimmed. On AI2's WEKA storage, this has produced ~143 TB of checkpoint waste across the oe-adapt partition (42.6% of used space). A single GRPO run on a 32B model with checkpoint_state_freq=200 creates 50 checkpoints totaling 4.4 TB; with max_checkpoints=3, this caps at ~264 GB.

The older OLMo v1 framework had save_num_checkpoints_to_keep with working cleanup logic. When OLMo-core replaced it, the feature was not carried over. Ephemeral checkpoints already have retention (keep latest 1) — this applies the same pattern to permanent checkpoints.

Every downstream team that overrides checkpoint defaults sets save_num_checkpoints_to_keep=1 (molmo, molmo2, molmoact, MolmoBot). A default of 3 covers the common use cases (latest for resume, a few intermediates for model selection).

Test plan

  • Verify existing checkpoint tests pass (src/test/train/callbacks/, integration tests)
  • Confirm max_checkpoints=None preserves old behavior (no trimming)
  • Confirm max_checkpoints=3 trims oldest after the 4th permanent checkpoint save
  • Verify fixed_steps checkpoints are also subject to the limit

Permanent checkpoints accumulate indefinitely because there is no retention
limit. This adds a max_checkpoints parameter (default=3) that trims the
oldest permanent checkpoints after each save, using the existing
_schedule_for_removal infrastructure. Set to None for the previous
keep-all behavior.

This addresses ~143 TB of checkpoint waste on WEKA storage where training
runs produce dozens of permanent checkpoints that are never cleaned up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TimDettmers
Copy link
Copy Markdown
Author

CI failure analysis: Test transformer (GPU)

This test has failed twice on this PR, both times due to CI infrastructure issues unrelated to the changes here.

Run 1: Beaker job was externally canceled (ExperimentFailedError: Job canceled) while the test suite was at 44% — every test that actually ran passed, including all checkpoint tests (CPU and GPU).

Run 2: Two issues:

  1. GitHub 502 during Gantry setup — uv pip install failed to clone microsoft/dion (fatal: unable to access 'https://github.com/microsoft/dion.git/': The requested URL returned error: 502).
  2. ModuleNotFoundError: No module named 'dataclass_extensions' in olmo_core/fs_cache.py:12 — likely a downstream consequence of the failed install pass above.

Why this isn't related to our change: This PR adds a single field (max_checkpoints) and a _trim_checkpoints() method to CheckpointerCallback in checkpointer.py. It does not touch imports, dependencies, fs_cache.py, or any transformer code. The failing test (test_context_parallel_transformer_ulysses) tests context-parallel transformer forward passes, which have no connection to checkpoint retention logic.

Should pass on retry when the CI environment is stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant