fix: Diverse model seeding across PP ranks by rrutmann · Pull Request #426 · Modalities/modalities

rrutmann · 2025-12-10T09:49:10Z

What does this PR do?

This PR gives a unique model seed for each pp rank, such that parameters are initialized differently across ranks.

General Changes

On each rank, add the pp rank to the model seed.

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

BlueCrescent

Overall LGTM.
Should we also explicitly allow seeding for the "model_initialized" component?
It will probably inherit the random state from the model_raw component but it seems a bit risky to me to assume that (also in the future) no other interaction with the random state happens between these two components (though, probably, only interactions that are asymmetrical between the ranks would be problematic). In particular, since we cannot guarantee the order in which the components are build, something like a dataloader component might even re-seed the random state.

le1nux

I checked the seeding (not the test) and from my understanding the changes do not provide the expected results (also what @BlueCrescent was hinting towards).

When we seed the raw model, the model weights are indeed deterministic at instantiation time. However, we also have model weight initialization which runs afterwards and would just override the weights / seeding.

Additionally, passing device_mesh to the model is coupling two components that should normally not know anything about each other.

I think we have to integrate the seeding to the weight initializer component and can also pass in the device_mesh there.

rrutmann · 2025-12-19T15:36:15Z

I checked the seeding (not the test) and from my understanding the changes do not provide the expected results (also what @BlueCrescent was hinting towards).

When we seed the raw model, the model weights are indeed deterministic at instantiation time. However, we also have model weight initialization which runs afterwards and would just override the weights / seeding.

Additionally, passing device_mesh to the model is coupling two components that should normally not know anything about each other.

I think we have to integrate the seeding to the weight initializer component and can also pass in the device_mesh there.

Yes that makes sense. I moved the seeding to the model initialization component

rrutmann · 2025-12-19T15:36:35Z

Overall LGTM. Should we also explicitly allow seeding for the "model_initialized" component? It will probably inherit the random state from the model_raw component but it seems a bit risky to me to assume that (also in the future) no other interaction with the random state happens between these two components (though, probably, only interactions that are asymmetrical between the ranks would be problematic). In particular, since we cannot guarantee the order in which the components are build, something like a dataloader component might even re-seed the random state.

See #426 (comment)

BlueCrescent

LGTM

le1nux

Generally good state.
Left a couple of comments.
My main concern is the global setting of the seed. A generator object might be favorable.

le1nux · 2025-12-19T17:21:44Z

    """NNModel class to define a base model."""

-    def __init__(self, seed: int = None, weight_decay_groups: Optional[WeightDecayGroups] = None):
+    def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None):


Suggested change

def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None):

def __init__(self, seed: int | None = None, weight_decay_groups: Optional[WeightDecayGroups] = None):

Do we even want to allow setting the seed here?
Could torch.manual_seed below have side effects with the new weight init implementation?

Probably it could have side effects, e.g. default submodule initialization, random ops and the ambient global RNG state for unrelated code. Also it is mostly redundant since we now use a local generator for weight initialization. I would suggest to remove it.

Co-authored-by: Copilot <copilot@github.com>

…tialization Co-authored-by: Copilot <copilot@github.com>

Co-authored-by: Copilot <copilot@github.com>

le1nux

Looks good to me! Nice work :)
The comment regarding the per-device Generator we should discuss, what makes most sense here.

I would add one last test, which checks that two models instantiated with the same config file (with a specified seed), should have 100% matching parameter weights. I'd keep that one simple (no advanced sharding like TP or PP. only FSDP).

le1nux · 2026-05-06T12:54:44Z

+    from transformers.utils.generic import check_model_inputs
+except ImportError:
+
+    def check_model_inputs(func: Callable) -> Callable:


was this removed in transormers?
If it is part of a legacy API I think we should also remove this on our end.
What do you think @BlueCrescent? I think you added it, right?

The function was removed in transformers version 5.2. In our pyproject.yaml we specify the requirement "transformers>=4.57.4,<5.0.0", so I used an unsupported transformers version here. Should we remove it just to be on the safe side?

Yes, I think, we should tackle the transformers 5.0.0+ support soon anyways.

le1nux · 2026-05-06T13:23:41Z

+        self.seed = torch.initial_seed() if seed is None else seed
+        self._generators: dict[str, torch.Generator] = {}
+
+    def _get_generator(self, parameter: torch.Tensor) -> torch.Generator:


a few things are not clear to me.

Do we actually have the case, where in a single process tensors are sitting on different GPUs?

if 1. is the case, then we can end up with tensors that are initialized identically, since we create multiple generators from the same seed.

I'm not sure what the best way to solve this ... also seems to me that the Pytorch API regarding Generators is kinda limited.

Since we start a single process for each rank via torchrun, this shouldn't happen, right? Or do I miss something?

Co-authored-by: Max Lübbering <2804731+le1nux@users.noreply.github.com>

Co-authored-by: Copilot <copilot@github.com>

…tiple devices per rank Co-authored-by: Copilot <copilot@github.com>

le1nux

Minor comment. Otherwise all looks good to me! :)
Nice work! 👍

Co-authored-by: Copilot <copilot@github.com>

rrutmann added 5 commits December 8, 2025 14:18

fix: Initialize different weights across TP ranks

5f9f50e

feat: Consider pp rank for model seed

8c8c5ab

fix: Only consider PP rank for seeding

ab3daa0

test: Add test for different parameters on tp/pp ranks

62a1743

test: Check for equal parameters across data parallel processes

00a595b

BlueCrescent reviewed Dec 15, 2025

View reviewed changes

Comment thread tests/fsdp2_parallelization/test_parallel_seed_initialization.py Outdated

rrutmann requested a review from le1nux December 19, 2025 10:25

rrutmann self-assigned this Dec 19, 2025

le1nux requested changes Dec 19, 2025

View reviewed changes

rrutmann added 5 commits December 19, 2025 13:26

feat: Integrate seeding to model initialization

bf06da7

refactor: Move seeding logic to model initialization component

b137701

chore: Add seed and device_mesh to ComposedModelInitializationConfig

bff99f3

test: Adapt test to latest changes

98ff9db

chore: Remove old code

2e248ed

BlueCrescent approved these changes Dec 23, 2025

View reviewed changes

chore: Merge branch 'main' into seed

093fa33

le1nux requested changes May 5, 2026

View reviewed changes

rrutmann and others added 8 commits May 5, 2026 14:00

fix: Use local-generator weight init

5a9e89e

Co-authored-by: Copilot <copilot@github.com>

refactor: Do not set seed in NNModel

13e7a82

Co-authored-by: Copilot <copilot@github.com>

docs: Add documentation and warning for topology-dependent weight ini…

dc11bbb

…tialization Co-authored-by: Copilot <copilot@github.com>

fix: Fix transformers version mismatch

999cb65

Co-authored-by: Copilot <copilot@github.com>

test: Fix test by removing dependency on global RNG state for seed=None

b02275f

Co-authored-by: Copilot <copilot@github.com>

test: Adapt test to latest changes in main

ddfbe47

Co-authored-by: Copilot <copilot@github.com>

chore: Use consistent typing for optional parameters

76762d9

chore: Remove outdated seed parameter

dea2eef

le1nux self-requested a review May 6, 2026 17:10

le1nux requested changes May 6, 2026

View reviewed changes

fix: Use correct type for parameter_name_regexes

adf11f0

Co-authored-by: Max Lübbering <2804731+le1nux@users.noreply.github.com>

rrutmann and others added 4 commits May 7, 2026 12:58

test: Add option for reliable vscode debugging

4cf0032

test: Add test for seeded model reproducibility

7541df2

Co-authored-by: Copilot <copilot@github.com>

chore: Change order of model initialization

ede150e

Co-authored-by: Copilot <copilot@github.com>

feat: Add multi_device_generator_policy for handling seeding with mul…

67bc596

…tiple devices per rank Co-authored-by: Copilot <copilot@github.com>

le1nux approved these changes May 8, 2026

View reviewed changes

Comment thread src/modalities/nn/model_initialization/initialization_routines.py Outdated

rrutmann and others added 2 commits May 8, 2026 10:43

refactor: Use enum for multi_device_generator_policy

5172fc4

Co-authored-by: Copilot <copilot@github.com>

chore: Update model seed initialization

326823e

Co-authored-by: Copilot <copilot@github.com>

rrutmann merged commit 4705675 into main May 8, 2026
3 checks passed

	def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None):
	def __init__(self, seed: int \| None = None, weight_decay_groups: Optional[WeightDecayGroups] = None):

Conversation

rrutmann commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

BlueCrescent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

rrutmann commented Dec 19, 2025

Uh oh!

rrutmann commented Dec 19, 2025

Uh oh!

BlueCrescent left a comment

Choose a reason for hiding this comment

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

le1nux Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

le1nux Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

rrutmann May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

le1nux May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rrutmann May 7, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

le1nux May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rrutmann May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rrutmann commented Dec 10, 2025 •

edited

Loading

rrutmann May 5, 2026 •

edited

Loading