Skip to content

fix: Diverse model seeding across PP ranks#426

Merged
rrutmann merged 26 commits intomainfrom
seed
May 8, 2026
Merged

fix: Diverse model seeding across PP ranks#426
rrutmann merged 26 commits intomainfrom
seed

Conversation

@rrutmann
Copy link
Copy Markdown
Collaborator

@rrutmann rrutmann commented Dec 10, 2025

What does this PR do?

This PR gives a unique model seed for each pp rank, such that parameters are initialized differently across ranks.

General Changes

  • On each rank, add the pp rank to the model seed.

Breaking Changes

  • None

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

Copy link
Copy Markdown
Member

@BlueCrescent BlueCrescent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
Should we also explicitly allow seeding for the "model_initialized" component?
It will probably inherit the random state from the model_raw component but it seems a bit risky to me to assume that (also in the future) no other interaction with the random state happens between these two components (though, probably, only interactions that are asymmetrical between the ranks would be problematic). In particular, since we cannot guarantee the order in which the components are build, something like a dataloader component might even re-seed the random state.

Comment thread tests/fsdp2_parallelization/test_parallel_seed_initialization.py Outdated
@rrutmann rrutmann requested a review from le1nux December 19, 2025 10:25
@rrutmann rrutmann self-assigned this Dec 19, 2025
Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the seeding (not the test) and from my understanding the changes do not provide the expected results (also what @BlueCrescent was hinting towards).

When we seed the raw model, the model weights are indeed deterministic at instantiation time. However, we also have model weight initialization which runs afterwards and would just override the weights / seeding.

Additionally, passing device_mesh to the model is coupling two components that should normally not know anything about each other.

I think we have to integrate the seeding to the weight initializer component and can also pass in the device_mesh there.

@rrutmann
Copy link
Copy Markdown
Collaborator Author

I checked the seeding (not the test) and from my understanding the changes do not provide the expected results (also what @BlueCrescent was hinting towards).

When we seed the raw model, the model weights are indeed deterministic at instantiation time. However, we also have model weight initialization which runs afterwards and would just override the weights / seeding.

Additionally, passing device_mesh to the model is coupling two components that should normally not know anything about each other.

I think we have to integrate the seeding to the weight initializer component and can also pass in the device_mesh there.

Yes that makes sense. I moved the seeding to the model initialization component

@rrutmann
Copy link
Copy Markdown
Collaborator Author

Overall LGTM. Should we also explicitly allow seeding for the "model_initialized" component? It will probably inherit the random state from the model_raw component but it seems a bit risky to me to assume that (also in the future) no other interaction with the random state happens between these two components (though, probably, only interactions that are asymmetrical between the ranks would be problematic). In particular, since we cannot guarantee the order in which the components are build, something like a dataloader component might even re-seed the random state.

See #426 (comment)

Copy link
Copy Markdown
Member

@BlueCrescent BlueCrescent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good state.
Left a couple of comments.
My main concern is the global setting of the seed. A generator object might be favorable.

Comment thread src/modalities/models/model.py Outdated
"""NNModel class to define a base model."""

def __init__(self, seed: int = None, weight_decay_groups: Optional[WeightDecayGroups] = None):
def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, seed: Optional[int] = None, weight_decay_groups: Optional[WeightDecayGroups] = None):
def __init__(self, seed: int | None = None, weight_decay_groups: Optional[WeightDecayGroups] = None):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even want to allow setting the seed here?
Could torch.manual_seed below have side effects with the new weight init implementation?

Copy link
Copy Markdown
Collaborator Author

@rrutmann rrutmann May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it could have side effects, e.g. default submodule initialization, random ops and the ambient global RNG state for unrelated code. Also it is mostly redundant since we now use a local generator for weight initialization. I would suggest to remove it.

Comment thread src/modalities/nn/model_initialization/composed_initialization.py Outdated
Comment thread src/modalities/nn/model_initialization/initialization_routines.py Outdated
Comment thread src/modalities/nn/model_initialization/composed_initialization.py
Comment thread src/modalities/nn/model_initialization/initialization_routines.py Outdated
rrutmann and others added 8 commits May 5, 2026 14:00
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…tialization

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@le1nux le1nux self-requested a review May 6, 2026 17:10
Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Nice work :)
The comment regarding the per-device Generator we should discuss, what makes most sense here.

I would add one last test, which checks that two models instantiated with the same config file (with a specified seed), should have 100% matching parameter weights. I'd keep that one simple (no advanced sharding like TP or PP. only FSDP).

from transformers.utils.generic import check_model_inputs
except ImportError:

def check_model_inputs(func: Callable) -> Callable:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this removed in transormers?
If it is part of a legacy API I think we should also remove this on our end.
What do you think @BlueCrescent? I think you added it, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function was removed in transformers version 5.2. In our pyproject.yaml we specify the requirement "transformers>=4.57.4,<5.0.0", so I used an unsupported transformers version here. Should we remove it just to be on the safe side?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think, we should tackle the transformers 5.0.0+ support soon anyways.

self.seed = torch.initial_seed() if seed is None else seed
self._generators: dict[str, torch.Generator] = {}

def _get_generator(self, parameter: torch.Tensor) -> torch.Generator:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things are not clear to me.

  1. Do we actually have the case, where in a single process tensors are sitting on different GPUs?
  2. if 1. is the case, then we can end up with tensors that are initialized identically, since we create multiple generators from the same seed.

I'm not sure what the best way to solve this ... also seems to me that the Pytorch API regarding Generators is kinda limited.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we start a single process for each rank via torchrun, this shouldn't happen, right? Or do I miss something?

Comment thread src/modalities/nn/model_initialization/initialization_routines.py
Comment thread src/modalities/nn/model_initialization/initialization_routines.py Outdated
Co-authored-by: Max Lübbering <2804731+le1nux@users.noreply.github.com>
rrutmann and others added 4 commits May 7, 2026 12:58
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…tiple devices per rank

Co-authored-by: Copilot <copilot@github.com>
Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment. Otherwise all looks good to me! :)
Nice work! 👍

Comment thread src/modalities/nn/model_initialization/initialization_routines.py Outdated
rrutmann and others added 2 commits May 8, 2026 10:43
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@rrutmann rrutmann merged commit 4705675 into main May 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants