Skip to content

_set_seed covers fewer RNGs than the checkpoint path captures #59

@mmshad

Description

@mmshad

kempnerforge/distributed/setup.py::_set_seed calls torch.manual_seed
and torch.cuda.manual_seed on cold start. Python's random and NumPy's
legacy global RNG are left un-seeded, so they pick up process-level entropy
(PYTHONHASHSEED, wall clock) instead of the configured seed.

torch.cuda.manual_seed seeds only the current device, so on a rank with
multiple visible devices the additional devices also fall back to
process-level entropy.

kempnerforge.checkpoint.state.get_rng_state captures all four generators
(python, numpy, torch_cpu, torch_cuda) and set_rng_state restores them.
Warm resumes therefore have strictly stronger reproducibility than cold
starts. Any code path using random.random() or np.random.rand() on
rank > 0 produces different values on cold start than on warm resume of
the same step.

Fix

In _set_seed(seed, rank, pp_rank=0):

  • Call torch.manual_seed(effective_seed) (unchanged).
  • Replace torch.cuda.manual_seed with torch.cuda.manual_seed_all so
    every visible CUDA device is seeded.
  • Add random.seed(effective_seed) (Python stdlib).
  • Add numpy.random.seed(effective_seed) (NumPy legacy global RNG, the
    same generator checkpoint.state captures).

Cold start and warm resume now seed the same set of generators.

Drive-by

tests/unit/test_eval.py::TestRunEval::test_perfect_model_low_loss
depended on sum(embed(0)) landing positive from inherited global RNG
state. Zero the embedding and set row 0 to ones so the assertion is
deterministic regardless of test ordering.

Coverage

tests/unit/test_distributed_seed.py adds four tests:

  • test_set_seed_seeds_all_four_generators
  • test_set_seed_varies_with_pp_rank
  • test_set_seed_same_across_dp_ranks
  • test_set_seed_matches_checkpoint_coverage

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions