Skip to content

Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806

Open
vineethsaivs wants to merge 2 commits into
Lightning-AI:masterfrom
vineethsaivs:tqdm-progress-bar-mid-epoch-resume
Open

Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806
vineethsaivs wants to merge 2 commits into
Lightning-AI:masterfrom
vineethsaivs:tqdm-progress-bar-mid-epoch-resume

Conversation

@vineethsaivs

Copy link
Copy Markdown

What does this PR do?

Fixes #20603.

After resuming training from a checkpoint saved mid-epoch (e.g. ModelCheckpoint(every_n_train_steps=N)), the TQDM training bar renders the entire resumed epoch as n/? (unknown total) with the initial Training description instead of Epoch N.

Root cause: since the 2.5 RestartStage rework, _FitLoop.on_advance_start intentionally skips the on_train_epoch_start hooks when restarted_mid_epoch (this is by design and test-enforced in test_hooks.py). But TQDMProgressBar creates the bar in on_train_start without a total and only sets the total and the Epoch {n} description in on_train_epoch_start, so on a mid-epoch resume the bar is never initialized for the resumed epoch. Worked in 2.4.x, where the hooks ran unconditionally.

Fix (in the progress bar, not the loop, since the hook skip is intentional): lazily initialize in on_train_batch_start when the epoch-start hook did not run: if the bar has no total yet, set it from total_train_batches and set the epoch description. Normal epochs are unaffected (on_train_epoch_start has already set the total, so the branch no-ops), and genuinely infinite dataloaders keep an unknown total exactly as today (convert_inf returns None, nothing is set).

Test: test_tqdm_progress_bar_mid_epoch_resume does a real mid-epoch checkpoint (every_n_train_steps=2 of 4 batches) and resume, recording the bar's total and description at every on_train_batch_end: fails on master (totals == [0, 0] under MockTqdm, None under real tqdm, description Training), passes with the fix ([4, 4], Epoch 0). Full test_tqdm_progress_bar.py: 52 passed, 2 skipped. ruff check / ruff format --check clean; CHANGELOG entry added.

None.

Before submitting

Resuming from a checkpoint saved mid-epoch intentionally skips
on_train_epoch_start in the resumed process (the RestartStage rework in
2.5), but TQDMProgressBar only sets the bar's total and the 'Epoch N'
description in that hook. The whole resumed epoch therefore rendered as
'n/?' with the initial 'Training' description.

Lazily initialize the bar in on_train_batch_start when the epoch-start
hook did not run: set the total from total_train_batches and the epoch
description. Normal epochs are untouched (the total is already set), and
genuinely infinite dataloaders keep an unknown total exactly as before.

Fixes Lightning-AI#20603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Progress bar is broken when loading trainer state from checkpoint

1 participant