kempnerforge/distributed/setup.py::_set_seed calls torch.manual_seed
and torch.cuda.manual_seed on cold start. Python's random and NumPy's
legacy global RNG are left un-seeded, so they pick up process-level entropy
(PYTHONHASHSEED, wall clock) instead of the configured seed.
torch.cuda.manual_seed seeds only the current device, so on a rank with
multiple visible devices the additional devices also fall back to
process-level entropy.
kempnerforge.checkpoint.state.get_rng_state captures all four generators
(python, numpy, torch_cpu, torch_cuda) and set_rng_state restores them.
Warm resumes therefore have strictly stronger reproducibility than cold
starts. Any code path using random.random() or np.random.rand() on
rank > 0 produces different values on cold start than on warm resume of
the same step.
Fix
In _set_seed(seed, rank, pp_rank=0):
- Call
torch.manual_seed(effective_seed) (unchanged).
- Replace
torch.cuda.manual_seed with torch.cuda.manual_seed_all so
every visible CUDA device is seeded.
- Add
random.seed(effective_seed) (Python stdlib).
- Add
numpy.random.seed(effective_seed) (NumPy legacy global RNG, the
same generator checkpoint.state captures).
Cold start and warm resume now seed the same set of generators.
Drive-by
tests/unit/test_eval.py::TestRunEval::test_perfect_model_low_loss
depended on sum(embed(0)) landing positive from inherited global RNG
state. Zero the embedding and set row 0 to ones so the assertion is
deterministic regardless of test ordering.
Coverage
tests/unit/test_distributed_seed.py adds four tests:
test_set_seed_seeds_all_four_generators
test_set_seed_varies_with_pp_rank
test_set_seed_same_across_dp_ranks
test_set_seed_matches_checkpoint_coverage
kempnerforge/distributed/setup.py::_set_seedcallstorch.manual_seedand
torch.cuda.manual_seedon cold start. Python'srandomand NumPy'slegacy global RNG are left un-seeded, so they pick up process-level entropy
(PYTHONHASHSEED, wall clock) instead of the configured
seed.torch.cuda.manual_seedseeds only the current device, so on a rank withmultiple visible devices the additional devices also fall back to
process-level entropy.
kempnerforge.checkpoint.state.get_rng_statecaptures all four generators(python, numpy, torch_cpu, torch_cuda) and
set_rng_staterestores them.Warm resumes therefore have strictly stronger reproducibility than cold
starts. Any code path using
random.random()ornp.random.rand()onrank > 0 produces different values on cold start than on warm resume of
the same step.
Fix
In
_set_seed(seed, rank, pp_rank=0):torch.manual_seed(effective_seed)(unchanged).torch.cuda.manual_seedwithtorch.cuda.manual_seed_allsoevery visible CUDA device is seeded.
random.seed(effective_seed)(Python stdlib).numpy.random.seed(effective_seed)(NumPy legacy global RNG, thesame generator
checkpoint.statecaptures).Cold start and warm resume now seed the same set of generators.
Drive-by
tests/unit/test_eval.py::TestRunEval::test_perfect_model_low_lossdepended on
sum(embed(0))landing positive from inherited global RNGstate. Zero the embedding and set row 0 to ones so the assertion is
deterministic regardless of test ordering.
Coverage
tests/unit/test_distributed_seed.pyadds four tests:test_set_seed_seeds_all_four_generatorstest_set_seed_varies_with_pp_ranktest_set_seed_same_across_dp_rankstest_set_seed_matches_checkpoint_coverage