Add max_checkpoints to limit permanent checkpoint retention#694
Add max_checkpoints to limit permanent checkpoint retention#694TimDettmers wants to merge 2 commits into
Conversation
Permanent checkpoints accumulate indefinitely because there is no retention limit. This adds a max_checkpoints parameter (default=3) that trims the oldest permanent checkpoints after each save, using the existing _schedule_for_removal infrastructure. Set to None for the previous keep-all behavior. This addresses ~143 TB of checkpoint waste on WEKA storage where training runs produce dozens of permanent checkpoints that are never cleaned up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI failure analysis: Test transformer (GPU)This test has failed twice on this PR, both times due to CI infrastructure issues unrelated to the changes here. Run 1: Beaker job was externally canceled ( Run 2: Two issues:
Why this isn't related to our change: This PR adds a single field ( Should pass on retry when the CI environment is stable. |
Summary
max_checkpointsparameter toCheckpointerCallback(default: 3) that trims the oldest permanent checkpoints after each save_schedule_for_removal/_remove_checkpointinfrastructure — no new deletion code neededNoneto preserve the previous keep-all behaviormax_checkpoints >= 1)Motivation
Permanent checkpoints accumulate indefinitely because there is no retention limit — the
_checkpointslist is appended to on every save but never trimmed. On AI2's WEKA storage, this has produced ~143 TB of checkpoint waste across the oe-adapt partition (42.6% of used space). A single GRPO run on a 32B model withcheckpoint_state_freq=200creates 50 checkpoints totaling 4.4 TB; withmax_checkpoints=3, this caps at ~264 GB.The older OLMo v1 framework had
save_num_checkpoints_to_keepwith working cleanup logic. When OLMo-core replaced it, the feature was not carried over. Ephemeral checkpoints already have retention (keep latest 1) — this applies the same pattern to permanent checkpoints.Every downstream team that overrides checkpoint defaults sets
save_num_checkpoints_to_keep=1(molmo, molmo2, molmoact, MolmoBot). A default of 3 covers the common use cases (latest for resume, a few intermediates for model selection).Test plan
src/test/train/callbacks/, integration tests)max_checkpoints=Nonepreserves old behavior (no trimming)max_checkpoints=3trims oldest after the 4th permanent checkpoint savefixed_stepscheckpoints are also subject to the limit