You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The eight post-release fixes shipped in #49, #51, #53, #55, #57, #59, #61, #63 cover failure modes that none of the existing unit, integration, or distributed tests exercised. Without smoke coverage, future regressions in checkpoint resume, RNG seeding, or the FP8 7B config can land without a CI signal.
Gaps
No end-to-end test runs scripts/train.py to step N, kills the job, relaunches, and checks the resume actually picked up the right step, RNG state, dataloader cursor, and stashed loader-state. The unit tests stub the loader and skip the cross-process restart.
No test loads a real shipped training config (configs/train/7b_16gpu_fp8.toml) and runs a few steps. Drift in the config (a removed field, a renamed module, a Float8 path that quietly stops applying) lands without a signal.
The smoke harness's GPU detection broke for interactive salloc sessions: scontrol show job reports NumTasks=1 when the user did not pass --ntasks even with --gres=gpu:4, so gpus_per_node = total_tasks // nodes = 1 and every multi-GPU smoke test silently skipped.
TestRealConfigs::test_fp8_7b_config: runs configs/train/7b_16gpu_fp8.toml at reduced scope (3 steps, batch_size=4, compile off) and asserts both Float8 application and FSDP2 float8 all-gather are wired. Catches drift in the shipped config that unit tests cannot see.
Harness changes
tests/smoke/conftest.py:
New CLI flags --data-path, --file-pattern, --data-vocab-size to feed pre-tokenized shards to the auto-resume and FP8 tests. Both TestAutoResume tests require --data-path because scripts/train.py falls back to synthetic torch.randint batches when no data source is configured, which bypasses StatefulDataLoader entirely (so resume coverage would be a no-op).
_detect_slurm_env now also reads AllocTRES gres/gpu=N and takes max(tasks_per_node, gres_per_node). sbatch --ntasks-per-node=4 --gres=gpu:4 and salloc --gres=gpu:4 (no --ntasks) both resolve to 4 GPUs per node.
Verification
Run on 4xH200 (jobid 7115669): TestAutoResume dense + MoE plus TestRealConfigs::test_fp8_7b_config all pass in 7m37s.
The eight post-release fixes shipped in #49, #51, #53, #55, #57, #59, #61, #63 cover failure modes that none of the existing unit, integration, or distributed tests exercised. Without smoke coverage, future regressions in checkpoint resume, RNG seeding, or the FP8 7B config can land without a CI signal.
Gaps
scripts/train.pyto step N, kills the job, relaunches, and checks the resume actually picked up the right step, RNG state, dataloader cursor, and stashed loader-state. The unit tests stub the loader and skip the cross-process restart.configs/train/7b_16gpu_fp8.toml) and runs a few steps. Drift in the config (a removed field, a renamed module, a Float8 path that quietly stops applying) lands without a signal.sallocsessions:scontrol show jobreportsNumTasks=1when the user did not pass--ntaskseven with--gres=gpu:4, sogpus_per_node = total_tasks // nodes = 1and every multi-GPU smoke test silently skipped.Tests added
tests/smoke/test_smoke.py:TestAutoResume::test_auto_resume_denseandtest_auto_resume_moe: train 20 steps withcheckpoint.interval=10, relaunch withmax_steps=30, and assert the five resume log markers (latest-checkpoint discovery, RNG restore, resume-step,skip_batches, stashed-dataloader apply). MoE variant additionally checksmoe/aux_losslogs post-resume. Together these exercise the init-path barrier timeout (Init and coordination collectives ignore declared per-op timeouts #63), RNG coverage across all four generators (_set_seed covers fewer RNGs than the checkpoint path captures #59), the ownership-gatedtrain_state.ptload (train_state.pt is loaded with weights_only=False without a trust check #61), and monotonicbatches_yielded(StatefulDataLoader loses alignment on second resume within an epoch #57).TestRealConfigs::test_fp8_7b_config: runsconfigs/train/7b_16gpu_fp8.tomlat reduced scope (3 steps, batch_size=4, compile off) and asserts both Float8 application and FSDP2 float8 all-gather are wired. Catches drift in the shipped config that unit tests cannot see.Harness changes
tests/smoke/conftest.py:--data-path,--file-pattern,--data-vocab-sizeto feed pre-tokenized shards to the auto-resume and FP8 tests. BothTestAutoResumetests require--data-pathbecausescripts/train.pyfalls back to synthetictorch.randintbatches when no data source is configured, which bypassesStatefulDataLoaderentirely (so resume coverage would be a no-op)._detect_slurm_envnow also readsAllocTRES gres/gpu=Nand takesmax(tasks_per_node, gres_per_node).sbatch --ntasks-per-node=4 --gres=gpu:4andsalloc --gres=gpu:4(no--ntasks) both resolve to 4 GPUs per node.Verification
Run on 4xH200 (jobid 7115669): TestAutoResume dense + MoE plus TestRealConfigs::test_fp8_7b_config all pass in 7m37s.