_set_seed covers fewer RNGs than the checkpoint path captures

`kempnerforge/distributed/setup.py::_set_seed` calls `torch.manual_seed`
and `torch.cuda.manual_seed` on cold start. Python's `random` and NumPy's
legacy global RNG are left un-seeded, so they pick up process-level entropy
(PYTHONHASHSEED, wall clock) instead of the configured `seed`.

`torch.cuda.manual_seed` seeds only the *current* device, so on a rank with
multiple visible devices the additional devices also fall back to
process-level entropy.

`kempnerforge.checkpoint.state.get_rng_state` captures all four generators
(python, numpy, torch_cpu, torch_cuda) and `set_rng_state` restores them.
Warm resumes therefore have strictly stronger reproducibility than cold
starts. Any code path using `random.random()` or `np.random.rand()` on
rank > 0 produces different values on cold start than on warm resume of
the same step.

## Fix

In `_set_seed(seed, rank, pp_rank=0)`:

- Call `torch.manual_seed(effective_seed)` (unchanged).
- Replace `torch.cuda.manual_seed` with `torch.cuda.manual_seed_all` so
  every visible CUDA device is seeded.
- Add `random.seed(effective_seed)` (Python stdlib).
- Add `numpy.random.seed(effective_seed)` (NumPy legacy global RNG, the
  same generator `checkpoint.state` captures).

Cold start and warm resume now seed the same set of generators.

## Drive-by

`tests/unit/test_eval.py::TestRunEval::test_perfect_model_low_loss`
depended on `sum(embed(0))` landing positive from inherited global RNG
state. Zero the embedding and set row 0 to ones so the assertion is
deterministic regardless of test ordering.

## Coverage

`tests/unit/test_distributed_seed.py` adds four tests:
- `test_set_seed_seeds_all_four_generators`
- `test_set_seed_varies_with_pp_rank`
- `test_set_seed_same_across_dp_ranks`
- `test_set_seed_matches_checkpoint_coverage`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_set_seed covers fewer RNGs than the checkpoint path captures #59

Fix

Drive-by

Coverage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_set_seed covers fewer RNGs than the checkpoint path captures #59

Description

Fix

Drive-by

Coverage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions