Add auto-resume and FP8 7B regression smoke tests

The eight post-release fixes shipped in #49, #51, #53, #55, #57, #59, #61, #63 cover failure modes that none of the existing unit, integration, or distributed tests exercised. Without smoke coverage, future regressions in checkpoint resume, RNG seeding, or the FP8 7B config can land without a CI signal.

## Gaps

- No end-to-end test runs `scripts/train.py` to step N, kills the job, relaunches, and checks the resume actually picked up the right step, RNG state, dataloader cursor, and stashed loader-state. The unit tests stub the loader and skip the cross-process restart.
- No test loads a real shipped training config (`configs/train/7b_16gpu_fp8.toml`) and runs a few steps. Drift in the config (a removed field, a renamed module, a Float8 path that quietly stops applying) lands without a signal.
- The smoke harness's GPU detection broke for interactive `salloc` sessions: `scontrol show job` reports `NumTasks=1` when the user did not pass `--ntasks` even with `--gres=gpu:4`, so `gpus_per_node = total_tasks // nodes = 1` and every multi-GPU smoke test silently skipped.

## Tests added

`tests/smoke/test_smoke.py`:

- `TestAutoResume::test_auto_resume_dense` and `test_auto_resume_moe`: train 20 steps with `checkpoint.interval=10`, relaunch with `max_steps=30`, and assert the five resume log markers (latest-checkpoint discovery, RNG restore, resume-step, `skip_batches`, stashed-dataloader apply). MoE variant additionally checks `moe/aux_loss` logs post-resume. Together these exercise the init-path barrier timeout (#63), RNG coverage across all four generators (#59), the ownership-gated `train_state.pt` load (#61), and monotonic `batches_yielded` (#57).
- `TestRealConfigs::test_fp8_7b_config`: runs `configs/train/7b_16gpu_fp8.toml` at reduced scope (3 steps, batch_size=4, compile off) and asserts both Float8 application and FSDP2 float8 all-gather are wired. Catches drift in the shipped config that unit tests cannot see.

## Harness changes

`tests/smoke/conftest.py`:

- New CLI flags `--data-path`, `--file-pattern`, `--data-vocab-size` to feed pre-tokenized shards to the auto-resume and FP8 tests. Both `TestAutoResume` tests require `--data-path` because `scripts/train.py` falls back to synthetic `torch.randint` batches when no data source is configured, which bypasses `StatefulDataLoader` entirely (so resume coverage would be a no-op).
- `_detect_slurm_env` now also reads `AllocTRES gres/gpu=N` and takes `max(tasks_per_node, gres_per_node)`. `sbatch --ntasks-per-node=4 --gres=gpu:4` and `salloc --gres=gpu:4` (no `--ntasks`) both resolve to 4 GPUs per node.

## Verification

Run on 4xH200 (jobid 7115669): TestAutoResume dense + MoE plus TestRealConfigs::test_fp8_7b_config all pass in 7m37s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto-resume and FP8 7B regression smoke tests #65

Gaps

Tests added

Harness changes

Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add auto-resume and FP8 7B regression smoke tests #65

Description

Gaps

Tests added

Harness changes

Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions