Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806
Open
vineethsaivs wants to merge 2 commits into
Open
Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806vineethsaivs wants to merge 2 commits into
vineethsaivs wants to merge 2 commits into
Conversation
Resuming from a checkpoint saved mid-epoch intentionally skips on_train_epoch_start in the resumed process (the RestartStage rework in 2.5), but TQDMProgressBar only sets the bar's total and the 'Epoch N' description in that hook. The whole resumed epoch therefore rendered as 'n/?' with the initial 'Training' description. Lazily initialize the bar in on_train_batch_start when the epoch-start hook did not run: set the total from total_train_batches and the epoch description. Normal epochs are untouched (the total is already set), and genuinely infinite dataloaders keep an unknown total exactly as before. Fixes Lightning-AI#20603
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #20603.
After resuming training from a checkpoint saved mid-epoch (e.g.
ModelCheckpoint(every_n_train_steps=N)), the TQDM training bar renders the entire resumed epoch asn/?(unknown total) with the initialTrainingdescription instead ofEpoch N.Root cause: since the 2.5
RestartStagerework,_FitLoop.on_advance_startintentionally skips theon_train_epoch_starthooks whenrestarted_mid_epoch(this is by design and test-enforced intest_hooks.py). ButTQDMProgressBarcreates the bar inon_train_startwithout a total and only sets the total and theEpoch {n}description inon_train_epoch_start, so on a mid-epoch resume the bar is never initialized for the resumed epoch. Worked in 2.4.x, where the hooks ran unconditionally.Fix (in the progress bar, not the loop, since the hook skip is intentional): lazily initialize in
on_train_batch_startwhen the epoch-start hook did not run: if the bar has no total yet, set it fromtotal_train_batchesand set the epoch description. Normal epochs are unaffected (on_train_epoch_starthas already set the total, so the branch no-ops), and genuinely infinite dataloaders keep an unknown total exactly as today (convert_infreturnsNone, nothing is set).Test:
test_tqdm_progress_bar_mid_epoch_resumedoes a real mid-epoch checkpoint (every_n_train_steps=2of 4 batches) and resume, recording the bar's total and description at everyon_train_batch_end: fails on master (totals == [0, 0]underMockTqdm,Noneunder real tqdm, descriptionTraining), passes with the fix ([4, 4],Epoch 0). Fulltest_tqdm_progress_bar.py: 52 passed, 2 skipped.ruff check/ruff format --checkclean; CHANGELOG entry added.None.
Before submitting