Add max_checkpoints to limit permanent checkpoint retention by TimDettmers · Pull Request #694 · allenai/OLMo-core

TimDettmers · 2026-05-27T11:45:25Z

Summary

Adds a max_checkpoints parameter to CheckpointerCallback (default: 3) that trims the oldest permanent checkpoints after each save
Uses the existing _schedule_for_removal / _remove_checkpoint infrastructure — no new deletion code needed
Set to None to preserve the previous keep-all behavior
Includes input validation (max_checkpoints >= 1)

Motivation

Permanent checkpoints accumulate indefinitely because there is no retention limit — the _checkpoints list is appended to on every save but never trimmed. On AI2's WEKA storage, this has produced ~143 TB of checkpoint waste across the oe-adapt partition (42.6% of used space). A single GRPO run on a 32B model with checkpoint_state_freq=200 creates 50 checkpoints totaling 4.4 TB; with max_checkpoints=3, this caps at ~264 GB.

The older OLMo v1 framework had save_num_checkpoints_to_keep with working cleanup logic. When OLMo-core replaced it, the feature was not carried over. Ephemeral checkpoints already have retention (keep latest 1) — this applies the same pattern to permanent checkpoints.

Every downstream team that overrides checkpoint defaults sets save_num_checkpoints_to_keep=1 (molmo, molmo2, molmoact, MolmoBot). A default of 3 covers the common use cases (latest for resume, a few intermediates for model selection).

Test plan

Verify existing checkpoint tests pass (src/test/train/callbacks/, integration tests)
Confirm max_checkpoints=None preserves old behavior (no trimming)
Confirm max_checkpoints=3 trims oldest after the 4th permanent checkpoint save
Verify fixed_steps checkpoints are also subject to the limit

Permanent checkpoints accumulate indefinitely because there is no retention limit. This adds a max_checkpoints parameter (default=3) that trims the oldest permanent checkpoints after each save, using the existing _schedule_for_removal infrastructure. Set to None for the previous keep-all behavior. This addresses ~143 TB of checkpoint waste on WEKA storage where training runs produce dozens of permanent checkpoints that are never cleaned up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TimDettmers · 2026-05-27T12:58:32Z

CI failure analysis: Test transformer (GPU)

This test has failed twice on this PR, both times due to CI infrastructure issues unrelated to the changes here.

Run 1: Beaker job was externally canceled (ExperimentFailedError: Job canceled) while the test suite was at 44% — every test that actually ran passed, including all checkpoint tests (CPU and GPU).

Run 2: Two issues:

GitHub 502 during Gantry setup — uv pip install failed to clone microsoft/dion (fatal: unable to access 'https://github.com/microsoft/dion.git/': The requested URL returned error: 502).
ModuleNotFoundError: No module named 'dataclass_extensions' in olmo_core/fs_cache.py:12 — likely a downstream consequence of the failed install pass above.

Why this isn't related to our change: This PR adds a single field (max_checkpoints) and a _trim_checkpoints() method to CheckpointerCallback in checkpointer.py. It does not touch imports, dependencies, fs_cache.py, or any transformer code. The failing test (test_context_parallel_transformer_ulysses) tests context-parallel transformer forward passes, which have no connection to checkpoint retention logic.

Should pass on retry when the CI environment is stable.

TimDettmers mentioned this pull request May 27, 2026

Wire max_checkpoints through SFT, DPO, and GRPO paths allenai/open-instruct#1701

Open

4 tasks

Add CHANGELOG entry for max_checkpoints parameter

14306a7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max_checkpoints to limit permanent checkpoint retention#694

Add max_checkpoints to limit permanent checkpoint retention#694
TimDettmers wants to merge 2 commits into
mainfrom
timd/add-max-checkpoints

TimDettmers commented May 27, 2026 •

edited

Loading

Uh oh!

TimDettmers commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimDettmers commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

TimDettmers commented May 27, 2026

CI failure analysis: Test transformer (GPU)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimDettmers commented May 27, 2026 •

edited

Loading