Skip to content

Add CI tests for dataloader mid-iteration resume#51

Merged
yzhangcs merged 1 commit into
mainfrom
tests/dataloader-resume
Apr 22, 2026
Merged

Add CI tests for dataloader mid-iteration resume#51
yzhangcs merged 1 commit into
mainfrom
tests/dataloader-resume

Conversation

@yzhangcs
Copy link
Copy Markdown
Member

Covers the core invariant that OnlineTokenizedIterableDataset + ParallelAwareDataLoader must preserve: tokens yielded after resuming from a checkpoint match those from an uninterrupted run. Parametrized across num_workers ∈ {0, 2, 4} and rank/world configurations, plus a multi-rank merge/restore case that mimics the per-rank key scheme used at DCP save time. CI runs on Python 3.10 and 3.12.

Covers the core invariant that `OnlineTokenizedIterableDataset` +
`ParallelAwareDataLoader` must preserve: tokens yielded after resuming
from a checkpoint match those from an uninterrupted run. Parametrized
across num_workers ∈ {0, 2, 4} and rank/world configurations, plus a
multi-rank merge/restore case that mimics the per-rank key scheme used
at DCP save time. CI runs on Python 3.10 and 3.12.
@yzhangcs yzhangcs merged commit 24dea91 into main Apr 22, 2026
2 of 3 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a test harness setup to stub torchtitan for CI environments and introduces a comprehensive test suite for OnlineTokenizedIterableDataset and ParallelAwareDataLoader, focusing on mid-iteration resume and state isolation. Feedback suggests catching ImportError specifically in the stubbing logic, parametrizing parallel resume tests for multiple worker counts, and optimizing iterator usage in sequence length preservation tests.

Comment thread tests/conftest.py
import torchtitan.tools.logging # noqa: F401
import torchtitan.tools.utils # noqa: F401
return
except Exception:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can mask unexpected issues (such as syntax errors in dependencies) during the stubbing process. It is safer to catch ImportError specifically, as that is the expected exception when torchtitan is not installed.

Suggested change
except Exception:
except ImportError:

Comment thread tests/test_data.py
_assert_equal_sequences(before, after)


def test_parallel_aware_resume_per_rank(num_workers=0):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is not parametrized for num_workers, unlike other resume tests in this file. Since the PR description mentions testing across num_workers ∈ {0, 2, 4}, and multi-worker state management is a common source of bugs in dataloaders, it would be better to include these cases here as well.

@pytest.mark.parametrize("num_workers", [0, 2, 4])
def test_parallel_aware_resume_per_rank(num_workers):

Comment thread tests/test_data.py
Comment on lines +269 to +270
for _ in range(5):
batch = next(iter(dl2))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling iter(dl2) inside the loop creates a new iterator on every iteration. While StatefulDataLoader and OnlineTokenizedIterableDataset might partially handle this by updating internal state, it is inefficient and unconventional. It is better to create the iterator once and use it throughout the loop to ensure consecutive batches are checked correctly.

Suggested change
for _ in range(5):
batch = next(iter(dl2))
it2 = iter(dl2)
for _ in range(5):
batch = next(it2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant