Add auto-resume and FP8 7B regression smoke tests#66
Merged
Conversation
TestAutoResume covers the checkpoint save/load path end-to-end for dense FSDP and MoE FSDP: train 20 steps with interval=10, relaunch with max_steps=30, and assert the five resume log markers (latest-checkpoint discovery, RNG restore, resume-step, skip_batches, stashed-dataloader apply). MoE variant also checks moe/aux_loss logs post-resume. Exercises the init-path barrier timeout, RNG coverage across all four generators, ownership-gated train_state.pt load, and monotonic batches_yielded. TestRealConfigs::test_fp8_7b_config runs configs/train/7b_16gpu_fp8.toml at reduced scope (3 steps, bs=4, compile off) so drift in the shipped config surfaces in CI. Asserts Float8 application and FSDP2 float8 all-gather are both wired. New CLI flags --data-path, --file-pattern, --data-vocab-size feed these tests. Both TestAutoResume tests require --data-path because scripts/train.py falls back to synthetic torch.randint batches when no data source is configured, which bypasses StatefulDataLoader entirely. Also fix _detect_slurm_env to read AllocTRES gres/gpu=N. Interactive salloc sessions report NumTasks=1 even with --gres=gpu:4, so the previous task-count derivation returned gpus_per_node=1 and every multi-GPU smoke test skipped. Now takes the larger of task-based and gres-based counts so sbatch and salloc both resolve correctly.
Naeemkh
approved these changes
Apr 26, 2026
Member
Naeemkh
left a comment
There was a problem hiding this comment.
LGTM, but please make sure that my comments on previous PRs are addressed and those PRs are merged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #65.
What
Two new smoke-test classes covering the cross-process and config-drift gaps left after the eight post-release fixes (#49, #51, #53, #55, #57, #59, #61, #63), plus a SLURM detection fix so they actually run on interactive
salloc.tests/smoke/test_smoke.py:TestAutoResume::test_auto_resume_denseandtest_auto_resume_moe: train 20 steps withcheckpoint.interval=10, relaunch withmax_steps=30, assert the five resume log markers (latest-checkpoint discovery, RNG restore, resume-step,skip_batches, stashed-dataloader apply). MoE variant additionally checksmoe/aux_losslogs post-resume. Exercises the init-path barrier timeout, RNG coverage across all four generators, the ownership-gatedtrain_state.ptload, and monotonicbatches_yieldedend-to-end.TestRealConfigs::test_fp8_7b_config: runsconfigs/train/7b_16gpu_fp8.tomlat reduced scope (3 steps, batch_size=4, compile off). Asserts both Float8 application and FSDP2 float8 all-gather are wired so silent drift in the shipped config surfaces in CI.tests/smoke/conftest.py:--data-path,--file-pattern,--data-vocab-size. The auto-resume tests require--data-pathbecausescripts/train.pyotherwise falls back to synthetictorch.randintbatches, which bypassStatefulDataLoaderand would make the resume assertions a no-op._detect_slurm_envnow also parsesAllocTRES gres/gpu=Nand takesmax(tasks_per_node, gres_per_node). Interactivesalloc --gres=gpu:4(no--ntasks) reportsNumTasks=1, so the previous derivation returnedgpus_per_node=1and silently skipped every multi-GPU test. Bothsbatch --ntasks-per-node=4and baresallocnow resolve correctly.Verification
4xH200, jobid 7115669: TestAutoResume dense + MoE + TestRealConfigs::test_fp8_7b_config all pass in 7m37s.