Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint by vineethsaivs · Pull Request #21806 · Lightning-AI/pytorch-lightning

vineethsaivs · 2026-07-04T17:03:18Z

What does this PR do?

Fixes #20603.

After resuming training from a checkpoint saved mid-epoch (e.g. ModelCheckpoint(every_n_train_steps=N)), the TQDM training bar renders the entire resumed epoch as n/? (unknown total) with the initial Training description instead of Epoch N.

Root cause: since the 2.5 RestartStage rework, _FitLoop.on_advance_start intentionally skips the on_train_epoch_start hooks when restarted_mid_epoch (this is by design and test-enforced in test_hooks.py). But TQDMProgressBar creates the bar in on_train_start without a total and only sets the total and the Epoch {n} description in on_train_epoch_start, so on a mid-epoch resume the bar is never initialized for the resumed epoch. Worked in 2.4.x, where the hooks ran unconditionally.

Fix (in the progress bar, not the loop, since the hook skip is intentional): lazily initialize in on_train_batch_start when the epoch-start hook did not run: if the bar has no total yet, set it from total_train_batches and set the epoch description. Normal epochs are unaffected (on_train_epoch_start has already set the total, so the branch no-ops), and genuinely infinite dataloaders keep an unknown total exactly as today (convert_inf returns None, nothing is set).

Test: test_tqdm_progress_bar_mid_epoch_resume does a real mid-epoch checkpoint (every_n_train_steps=2 of 4 batches) and resume, recording the bar's total and description at every on_train_batch_end: fails on master (totals == [0, 0] under MockTqdm, None under real tqdm, description Training), passes with the fix ([4, 4], Epoch 0). Full test_tqdm_progress_bar.py: 52 passed, 2 skipped. ruff check / ruff format --check clean; CHANGELOG entry added.

None.

Before submitting

Was this discussed/agreed via a GitHub issue? (Progress bar is broken when loading trainer state from checkpoint #20603)
Did you make sure to update the documentation with your changes? (CHANGELOG)
Did you write any new necessary tests?

Resuming from a checkpoint saved mid-epoch intentionally skips on_train_epoch_start in the resumed process (the RestartStage rework in 2.5), but TQDMProgressBar only sets the bar's total and the 'Epoch N' description in that hook. The whole resumed epoch therefore rendered as 'n/?' with the initial 'Training' description. Lazily initialize the bar in on_train_batch_start when the epoch-start hook did not run: set the total from total_train_batches and the epoch description. Normal epochs are untouched (the total is already set), and genuinely infinite dataloaders keep an unknown total exactly as before. Fixes Lightning-AI#20603

for more information, see https://pre-commit.ci

vineethsaivs requested review from ethanwharris, justusschock and tchaton as code owners July 4, 2026 17:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

65c0ffb

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806

Fix TQDMProgressBar showing unknown total after resuming from a mid-epoch checkpoint#21806
vineethsaivs wants to merge 2 commits into
Lightning-AI:masterfrom
vineethsaivs:tqdm-progress-bar-mid-epoch-resume

vineethsaivs commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

vineethsaivs commented Jul 4, 2026

What does this PR do?

Before submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant