improve: warn when training auto-resumes from existing ckpt folder (#631)#780
Open
lonexreb wants to merge 1 commit into
Open
improve: warn when training auto-resumes from existing ckpt folder (#631)#780lonexreb wants to merge 1 commit into
lonexreb wants to merge 1 commit into
Conversation
…eta-pytorch#631) Torchtitan's checkpointer treats ``checkpoint.folder`` as the source of truth: if it already contains saved step-N directories, it loads from there and silently ignores ``initial_load_path``. Users running back-to- back experiments without clearing the folder hit this footgun without noticing — the next run starts from the prior run's tail, not the configured base model. Add ``forge.util.checkpoint.warn_if_resuming_from_existing_folder`` and call it from both load sites: - TitanTrainer.setup (src/forge/actors/trainer/titan.py) - ForgeSFTRecipe.setup (apps/sft/main.py) The helper logs a single WARNING right before the load, naming the folder, the latest step directory found, and (when set) the initial_load_path that's about to be ignored. No behavior change to the checkpointer itself — this is a visibility fix so the resume shows up in the standard training logs. Step directories are sorted by numeric suffix so the warning reports the truly-latest step (step-200, not step-50 from lexicographic order). Test plan: tests/unit_tests/util/test_checkpoint.py (7 cases) - None / empty / missing folder paths → no warning, returns False - Folder with no step dirs → no warning - Folder with step-N dirs → WARNING emitted, names latest by numeric sort - WARNING includes initial_load_path when provided - OSError on listdir → logged at DEBUG, returns False, no spurious WARN
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #631.
Torchtitan's checkpointer treats
checkpoint.folderas the source of truth: if it already contains savedstep-Ndirectories, it loads from there and silently ignoresinitial_load_path. The YAML configs even document this (# Ignored if folder exists), but it's only visible to someone who reads the config comments — a user running back-to-back experiments will not notice that the second run is picking up from the first run's tail instead of starting from the base model.This PR adds
forge.util.checkpoint.warn_if_resuming_from_existing_folderand calls it from both checkpoint-load sites:TitanTrainer.setup—src/forge/actors/trainer/titan.py:118ForgeSFTRecipe.setup—apps/sft/main.py:142The helper logs a single WARNING right before the load, naming the folder, the latest step directory found (sorted by numeric suffix, so
step-200beatsstep-50), and theinitial_load_paththat's about to be ignored:No behavior change to the checkpointer itself — this is a visibility fix per @felipemello1's suggestion in #631 that "the easiest option seems to be to enable a flag." A flag is a bigger API change; a clear WARNING is the same observability improvement with zero config-surface impact and is fully backward compatible.
tests/unit_tests/util/test_checkpoint.py(7 cases, all pass against the fix)Test plan
None/ empty-string / missing folder → no warning, returnsFalsestep-*dirs → no warningstep-Ndirs → WARNING emitted, names latest by numeric sortinitial_load_pathtext when setos.listdir(perms etc.) → logged at DEBUG, returnsFalse, no spurious WARNcheckpoint.folderand confirm the WARNING shows in stderr before the first step (cannot run locally — please verify in GPU CI)Notes
If maintainers prefer the bigger-API approach (
resume_from_ckpt: boolflag that errors when the folder exists but the flag is unset), I'm happy to redo this as a follow-up. Starting with the smallest visibility-only change to keep the PR scope tight.