Skip to content

Fix max iters issue and add tests#3439

Draft
goanpeca wants to merge 1 commit intopytorch:masterfrom
goanpeca:fix/max-iters
Draft

Fix max iters issue and add tests#3439
goanpeca wants to merge 1 commit intopytorch:masterfrom
goanpeca:fix/max-iters

Conversation

@goanpeca
Copy link
Copy Markdown
Collaborator

@goanpeca goanpeca commented Aug 29, 2025

@github-actions github-actions Bot added module: engine Engine module module: base Base module module: metrics Metrics module module: handlers Core Handlers module ci CI labels Aug 29, 2025
@goanpeca goanpeca marked this pull request as ready for review August 29, 2025 19:12
Copilot AI review requested due to automatic review settings August 29, 2025 19:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a fix for max_iters handling in the Engine state dictionary and adds comprehensive test coverage. The changes extend the Engine class to properly support max_iters as an alternative to max_epochs for controlling training duration.

Key changes include:

  • Enhanced Engine state dict validation to support both max_epochs and max_iters termination modes
  • Updated serialization logic to include appropriate termination parameters based on the training configuration
  • Extended test coverage for various max_iters scenarios including resuming, state loading, and edge cases

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/ignite/engine/test_max_iters_fix.py Comprehensive test suite for max_iters functionality covering state dict operations, resuming, and validation
tests/ignite/engine/test_engine_state_dict.py Updated existing tests to account for additional state dict keys
tests/ignite/conftest.py Added defensive checks for pytest configuration options
tests/ignite/base/test_mixins_update.py Tests for updated Serializable mixin validation logic
ignite/engine/engine.py Core Engine implementation with max_iters support and improved state management
ignite/base/mixins.py Enhanced Serializable base class with grouped optional key validation
Various workflow and metric files Minor fixes and updates

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread ignite/engine/engine.py Outdated
Comment thread ignite/engine/engine.py
Comment thread ignite/metrics/gan/fid.py
Comment thread ignite/metrics/vision/object_detection_average_precision_recall.py Outdated
Comment thread ignite/metrics/mean_average_precision.py Outdated
@goanpeca goanpeca force-pushed the fix/max-iters branch 4 times, most recently from 075f57c to eba390f Compare August 29, 2025 19:37
@github-actions github-actions Bot added the docs label Aug 29, 2025
@goanpeca goanpeca closed this Aug 29, 2025
@goanpeca goanpeca reopened this Aug 29, 2025
Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @goanpeca !
I made the first pass and left comments essentially to split this PR into multiple ones and explain certain changes.
I'll check in more depth the max_iters related changes a bit later whether it works as expected.
Anyway, great job!

Comment thread .github/workflows/code-style-checks.yml
Comment thread docs/source/conf.py
Comment thread ignite/engine/engine.py Outdated
Comment thread ignite/handlers/clearml_logger.py Outdated
Comment thread ignite/handlers/wandb_logger.py
Comment thread ignite/metrics/gan/fid.py Outdated
Comment thread ignite/metrics/vision/object_detection_average_precision_recall.py Outdated
Comment thread ignite/metrics/mean_average_precision.py Outdated
Comment thread mypy.ini Outdated
@goanpeca goanpeca marked this pull request as draft September 4, 2025 16:10
@goanpeca goanpeca force-pushed the fix/max-iters branch 9 times, most recently from fc1fcb5 to cfd0d13 Compare September 5, 2025 15:24
Comment thread tests/ignite/conftest.py Outdated
Comment thread ignite/engine/engine.py
)
self.state.max_iters = max_iters

def _check_and_set_epoch_length(self, data: Optional[Iterable], epoch_length: Optional[int] = None) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used anywhere. Best to remove it.

Comment thread ignite/engine/engine.py
Comment on lines +952 to +973
# Check if we need to create new state or resume
# Create new state if:
# 1. No termination params set (first run), OR
# 2. Training is done AND generator is None AND no new params provided
# 3. Training is done AND same termination params provided (restart case)
should_create_new_state = (
(self.state.max_epochs is None and self.state.max_iters is None)
or (
self._is_done(self.state)
and self._internal_run_generator is None
and max_epochs is None
and max_iters is None
)
or (
self._is_done(self.state)
and self._internal_run_generator is None
and (
(max_epochs is not None and max_epochs == self.state.max_epochs)
or (max_iters is not None and max_iters == self.state.max_iters)
)
)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss this more. Why are we creating a new state only if the same parameters passed and resuming otherwise? I personally think the master behavior of always restarting is much more intutive. Current proposed change can be misleading for the user.

For instance, on master, engine.run(data, max_epochs=10) after completing 5 epochs would restart from 0 but this PR is proposing for it to continue from 5 and go to 10, while for engine.run(data, max_epochs=5) (same value as initial run), it would restart. This can be confusing

Comment thread ignite/base/mixins.py
Comment on lines +10 to +15
def __init__(self) -> None:
self._state_dict_user_keys: List[str] = []

@property
def state_dict_user_keys(self) -> List:
return self._state_dict_user_keys
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some subclasses like EarlyStopping, Checkpoint, Metric never call super().__init__(). I think a better approach would be to revert it so _state_dict_user_keys stays in Engine as not all subclasses need this.

Comment on lines +414 to +415
engine2.state.max_iters = 25
engine2.run(data)
Copy link
Copy Markdown
Collaborator

@TahaZahid05 TahaZahid05 Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do engine2.run(data, max_iters=25) instead? This way we can verify the actural resume path of run()

Comment on lines +55 to +65
def test_load_state_dict_with_max_iters():
"""Test load_state_dict with max_iters."""
engine = Engine(lambda e, b: 1)

state_dict = {"iteration": 150, "max_iters": 250, "epoch_length": 100}

engine.load_state_dict(state_dict)
assert engine.state.iteration == 150
assert engine.state.max_iters == 250
assert engine.state.epoch_length == 100
assert engine.state.epoch == 1 # 150 // 100
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets test epoch + max_iters as well

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Checkpoint test case would also be good since it is quite a common use so best to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI docs module: base Base module module: engine Engine module module: handlers Core Handlers module module: metrics Metrics module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible issues with max_iters when loading/saving engine's state

4 participants