Skip to content

Add auto-resume and FP8 7B regression smoke tests #65

@mmshad

Description

@mmshad

The eight post-release fixes shipped in #49, #51, #53, #55, #57, #59, #61, #63 cover failure modes that none of the existing unit, integration, or distributed tests exercised. Without smoke coverage, future regressions in checkpoint resume, RNG seeding, or the FP8 7B config can land without a CI signal.

Gaps

  • No end-to-end test runs scripts/train.py to step N, kills the job, relaunches, and checks the resume actually picked up the right step, RNG state, dataloader cursor, and stashed loader-state. The unit tests stub the loader and skip the cross-process restart.
  • No test loads a real shipped training config (configs/train/7b_16gpu_fp8.toml) and runs a few steps. Drift in the config (a removed field, a renamed module, a Float8 path that quietly stops applying) lands without a signal.
  • The smoke harness's GPU detection broke for interactive salloc sessions: scontrol show job reports NumTasks=1 when the user did not pass --ntasks even with --gres=gpu:4, so gpus_per_node = total_tasks // nodes = 1 and every multi-GPU smoke test silently skipped.

Tests added

tests/smoke/test_smoke.py:

Harness changes

tests/smoke/conftest.py:

  • New CLI flags --data-path, --file-pattern, --data-vocab-size to feed pre-tokenized shards to the auto-resume and FP8 tests. Both TestAutoResume tests require --data-path because scripts/train.py falls back to synthetic torch.randint batches when no data source is configured, which bypasses StatefulDataLoader entirely (so resume coverage would be a no-op).
  • _detect_slurm_env now also reads AllocTRES gres/gpu=N and takes max(tasks_per_node, gres_per_node). sbatch --ntasks-per-node=4 --gres=gpu:4 and salloc --gres=gpu:4 (no --ntasks) both resolve to 4 GPUs per node.

Verification

Run on 4xH200 (jobid 7115669): TestAutoResume dense + MoE plus TestRealConfigs::test_fp8_7b_config all pass in 7m37s.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions