Fix max iters issue and add tests#3439
Conversation
f62141f to
6f9c0f4
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements a fix for max_iters handling in the Engine state dictionary and adds comprehensive test coverage. The changes extend the Engine class to properly support max_iters as an alternative to max_epochs for controlling training duration.
Key changes include:
- Enhanced Engine state dict validation to support both
max_epochsandmax_iterstermination modes - Updated serialization logic to include appropriate termination parameters based on the training configuration
- Extended test coverage for various max_iters scenarios including resuming, state loading, and edge cases
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/ignite/engine/test_max_iters_fix.py |
Comprehensive test suite for max_iters functionality covering state dict operations, resuming, and validation |
tests/ignite/engine/test_engine_state_dict.py |
Updated existing tests to account for additional state dict keys |
tests/ignite/conftest.py |
Added defensive checks for pytest configuration options |
tests/ignite/base/test_mixins_update.py |
Tests for updated Serializable mixin validation logic |
ignite/engine/engine.py |
Core Engine implementation with max_iters support and improved state management |
ignite/base/mixins.py |
Enhanced Serializable base class with grouped optional key validation |
| Various workflow and metric files | Minor fixes and updates |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
075f57c to
eba390f
Compare
vfdev-5
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR @goanpeca !
I made the first pass and left comments essentially to split this PR into multiple ones and explain certain changes.
I'll check in more depth the max_iters related changes a bit later whether it works as expected.
Anyway, great job!
fc1fcb5 to
cfd0d13
Compare
cfd0d13 to
697ba81
Compare
697ba81 to
243afad
Compare
| ) | ||
| self.state.max_iters = max_iters | ||
|
|
||
| def _check_and_set_epoch_length(self, data: Optional[Iterable], epoch_length: Optional[int] = None) -> None: |
There was a problem hiding this comment.
This function is not used anywhere. Best to remove it.
| # Check if we need to create new state or resume | ||
| # Create new state if: | ||
| # 1. No termination params set (first run), OR | ||
| # 2. Training is done AND generator is None AND no new params provided | ||
| # 3. Training is done AND same termination params provided (restart case) | ||
| should_create_new_state = ( | ||
| (self.state.max_epochs is None and self.state.max_iters is None) | ||
| or ( | ||
| self._is_done(self.state) | ||
| and self._internal_run_generator is None | ||
| and max_epochs is None | ||
| and max_iters is None | ||
| ) | ||
| or ( | ||
| self._is_done(self.state) | ||
| and self._internal_run_generator is None | ||
| and ( | ||
| (max_epochs is not None and max_epochs == self.state.max_epochs) | ||
| or (max_iters is not None and max_iters == self.state.max_iters) | ||
| ) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
We should discuss this more. Why are we creating a new state only if the same parameters passed and resuming otherwise? I personally think the master behavior of always restarting is much more intutive. Current proposed change can be misleading for the user.
For instance, on master, engine.run(data, max_epochs=10) after completing 5 epochs would restart from 0 but this PR is proposing for it to continue from 5 and go to 10, while for engine.run(data, max_epochs=5) (same value as initial run), it would restart. This can be confusing
| def __init__(self) -> None: | ||
| self._state_dict_user_keys: List[str] = [] | ||
|
|
||
| @property | ||
| def state_dict_user_keys(self) -> List: | ||
| return self._state_dict_user_keys |
There was a problem hiding this comment.
Some subclasses like EarlyStopping, Checkpoint, Metric never call super().__init__(). I think a better approach would be to revert it so _state_dict_user_keys stays in Engine as not all subclasses need this.
| engine2.state.max_iters = 25 | ||
| engine2.run(data) |
There was a problem hiding this comment.
Why not do engine2.run(data, max_iters=25) instead? This way we can verify the actural resume path of run()
| def test_load_state_dict_with_max_iters(): | ||
| """Test load_state_dict with max_iters.""" | ||
| engine = Engine(lambda e, b: 1) | ||
|
|
||
| state_dict = {"iteration": 150, "max_iters": 250, "epoch_length": 100} | ||
|
|
||
| engine.load_state_dict(state_dict) | ||
| assert engine.state.iteration == 150 | ||
| assert engine.state.max_iters == 250 | ||
| assert engine.state.epoch_length == 100 | ||
| assert engine.state.epoch == 1 # 150 // 100 |
There was a problem hiding this comment.
Lets test epoch + max_iters as well
There was a problem hiding this comment.
A Checkpoint test case would also be good since it is quite a common use so best to test it out.
Fixes #1521
Depends on: