Skip to content

improve: warn when training auto-resumes from existing ckpt folder (#631)#780

Open
lonexreb wants to merge 1 commit into
meta-pytorch:mainfrom
lonexreb:improve/631-warn-on-auto-resume-from-ckpt
Open

improve: warn when training auto-resumes from existing ckpt folder (#631)#780
lonexreb wants to merge 1 commit into
meta-pytorch:mainfrom
lonexreb:improve/631-warn-on-auto-resume-from-ckpt

Conversation

@lonexreb

Copy link
Copy Markdown

Summary

Addresses #631.

Torchtitan's checkpointer treats checkpoint.folder as the source of truth: if it already contains saved step-N directories, it loads from there and silently ignores initial_load_path. The YAML configs even document this (# Ignored if folder exists), but it's only visible to someone who reads the config comments — a user running back-to-back experiments will not notice that the second run is picking up from the first run's tail instead of starting from the base model.

This PR adds forge.util.checkpoint.warn_if_resuming_from_existing_folder and calls it from both checkpoint-load sites:

  • TitanTrainer.setupsrc/forge/actors/trainer/titan.py:118
  • ForgeSFTRecipe.setupapps/sft/main.py:142

The helper logs a single WARNING right before the load, naming the folder, the latest step directory found (sorted by numeric suffix, so step-200 beats step-50), and the initial_load_path that's about to be ignored:

WARNING  Resuming training from existing checkpoint folder './checkpoint' (found 3 saved
         step dir(s); latest: step-200). Configured initial_load_path='hf://meta-llama/...'
         will be ignored until the folder is cleared or renamed.

No behavior change to the checkpointer itself — this is a visibility fix per @felipemello1's suggestion in #631 that "the easiest option seems to be to enable a flag." A flag is a bigger API change; a clear WARNING is the same observability improvement with zero config-surface impact and is fully backward compatible.

  • +175 / -0
  • New file: tests/unit_tests/util/test_checkpoint.py (7 cases, all pass against the fix)

Test plan

  • None / empty-string / missing folder → no warning, returns False
  • Folder exists but has no step-* dirs → no warning
  • Folder with step-N dirs → WARNING emitted, names latest by numeric sort
  • WARNING includes initial_load_path text when set
  • OSError on os.listdir (perms etc.) → logged at DEBUG, returns False, no spurious WARN
  • Manual: rerun an SFT/GRPO training with a non-empty checkpoint.folder and confirm the WARNING shows in stderr before the first step (cannot run locally — please verify in GPU CI)

Notes

If maintainers prefer the bigger-API approach (resume_from_ckpt: bool flag that errors when the folder exists but the flag is unset), I'm happy to redo this as a follow-up. Starting with the smallest visibility-only change to keep the PR scope tight.

…eta-pytorch#631)

Torchtitan's checkpointer treats ``checkpoint.folder`` as the source of
truth: if it already contains saved step-N directories, it loads from
there and silently ignores ``initial_load_path``. Users running back-to-
back experiments without clearing the folder hit this footgun without
noticing — the next run starts from the prior run's tail, not the
configured base model.

Add ``forge.util.checkpoint.warn_if_resuming_from_existing_folder`` and
call it from both load sites:
- TitanTrainer.setup (src/forge/actors/trainer/titan.py)
- ForgeSFTRecipe.setup (apps/sft/main.py)

The helper logs a single WARNING right before the load, naming the
folder, the latest step directory found, and (when set) the
initial_load_path that's about to be ignored. No behavior change to the
checkpointer itself — this is a visibility fix so the resume shows up in
the standard training logs.

Step directories are sorted by numeric suffix so the warning reports
the truly-latest step (step-200, not step-50 from lexicographic order).

Test plan: tests/unit_tests/util/test_checkpoint.py (7 cases)
- None / empty / missing folder paths → no warning, returns False
- Folder with no step dirs → no warning
- Folder with step-N dirs → WARNING emitted, names latest by numeric sort
- WARNING includes initial_load_path when provided
- OSError on listdir → logged at DEBUG, returns False, no spurious WARN
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant