Skip to content

Add auto-resume and FP8 7B regression smoke tests#66

Merged
mmshad merged 1 commit intomainfrom
smoke-tests-autoresume-fp8
Apr 26, 2026
Merged

Add auto-resume and FP8 7B regression smoke tests#66
mmshad merged 1 commit intomainfrom
smoke-tests-autoresume-fp8

Conversation

@mmshad
Copy link
Copy Markdown
Collaborator

@mmshad mmshad commented Apr 22, 2026

Closes #65.

What

Two new smoke-test classes covering the cross-process and config-drift gaps left after the eight post-release fixes (#49, #51, #53, #55, #57, #59, #61, #63), plus a SLURM detection fix so they actually run on interactive salloc.

tests/smoke/test_smoke.py:

  • TestAutoResume::test_auto_resume_dense and test_auto_resume_moe: train 20 steps with checkpoint.interval=10, relaunch with max_steps=30, assert the five resume log markers (latest-checkpoint discovery, RNG restore, resume-step, skip_batches, stashed-dataloader apply). MoE variant additionally checks moe/aux_loss logs post-resume. Exercises the init-path barrier timeout, RNG coverage across all four generators, the ownership-gated train_state.pt load, and monotonic batches_yielded end-to-end.
  • TestRealConfigs::test_fp8_7b_config: runs configs/train/7b_16gpu_fp8.toml at reduced scope (3 steps, batch_size=4, compile off). Asserts both Float8 application and FSDP2 float8 all-gather are wired so silent drift in the shipped config surfaces in CI.

tests/smoke/conftest.py:

  • New CLI flags --data-path, --file-pattern, --data-vocab-size. The auto-resume tests require --data-path because scripts/train.py otherwise falls back to synthetic torch.randint batches, which bypass StatefulDataLoader and would make the resume assertions a no-op.
  • _detect_slurm_env now also parses AllocTRES gres/gpu=N and takes max(tasks_per_node, gres_per_node). Interactive salloc --gres=gpu:4 (no --ntasks) reports NumTasks=1, so the previous derivation returned gpus_per_node=1 and silently skipped every multi-GPU test. Both sbatch --ntasks-per-node=4 and bare salloc now resolve correctly.

Verification

4xH200, jobid 7115669: TestAutoResume dense + MoE + TestRealConfigs::test_fp8_7b_config all pass in 7m37s.

TestAutoResume covers the checkpoint save/load path end-to-end for dense
FSDP and MoE FSDP: train 20 steps with interval=10, relaunch with
max_steps=30, and assert the five resume log markers (latest-checkpoint
discovery, RNG restore, resume-step, skip_batches, stashed-dataloader
apply). MoE variant also checks moe/aux_loss logs post-resume. Exercises
the init-path barrier timeout, RNG coverage across all four generators,
ownership-gated train_state.pt load, and monotonic batches_yielded.

TestRealConfigs::test_fp8_7b_config runs configs/train/7b_16gpu_fp8.toml
at reduced scope (3 steps, bs=4, compile off) so drift in the shipped
config surfaces in CI. Asserts Float8 application and FSDP2 float8
all-gather are both wired.

New CLI flags --data-path, --file-pattern, --data-vocab-size feed these
tests. Both TestAutoResume tests require --data-path because
scripts/train.py falls back to synthetic torch.randint batches when no
data source is configured, which bypasses StatefulDataLoader entirely.

Also fix _detect_slurm_env to read AllocTRES gres/gpu=N. Interactive
salloc sessions report NumTasks=1 even with --gres=gpu:4, so the
previous task-count derivation returned gpus_per_node=1 and every
multi-GPU smoke test skipped. Now takes the larger of task-based and
gres-based counts so sbatch and salloc both resolve correctly.
@mmshad mmshad self-assigned this Apr 22, 2026
@mmshad mmshad requested a review from Naeemkh April 22, 2026 01:45
Copy link
Copy Markdown
Member

@Naeemkh Naeemkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please make sure that my comments on previous PRs are addressed and those PRs are merged.

@mmshad mmshad merged commit 499adcf into main Apr 26, 2026
3 checks passed
@mmshad mmshad deleted the smoke-tests-autoresume-fp8 branch April 26, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add auto-resume and FP8 7B regression smoke tests

2 participants