Skip to content

fix: Sharding for learnable residual connections. #1099

Draft
jakob-schloer wants to merge 1 commit into
mainfrom
fix/sharding_lrc
Draft

fix: Sharding for learnable residual connections. #1099
jakob-schloer wants to merge 1 commit into
mainfrom
fix/sharding_lrc

Conversation

@jakob-schloer
Copy link
Copy Markdown
Collaborator

@jakob-schloer jakob-schloer commented May 7, 2026

Description

Add model-sharding support for SpectralOrnsteinConnection so it works correctly when num_gpus_per_model > 1.

What problem does this change solve?

SpectralOrnsteinConnection was incompatible with model sharding (grid split across multiple GPUs). Two issues existed:

  1. The inverse SHT produces a spatial weight field with the full grid shape, but when sharded, x only contains the local grid shard. Broadcasting fails on the mismatched grid dimension.
  2. When truncate=True, the forward SHT requires all latitude rings to produce correct spectral coefficients. Operating on a grid shard yields incorrect results silently.

What issue or task does this change relate to?

Additional notes

The gather_tensor in _apply_truncation introduces an all-gather collective per forward pass which would become a problem for very large grids.

Todos

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

@github-project-automation github-project-automation Bot moved this to To be triaged in Anemoi-dev May 7, 2026
@jakob-schloer jakob-schloer self-assigned this May 7, 2026
@github-actions github-actions Bot added bug Something isn't working training labels May 7, 2026
@github-actions github-actions Bot added the models label May 7, 2026
@jakob-schloer jakob-schloer requested a review from ssmmnn11 May 7, 2026 09:08
@JPXKQX JPXKQX added the ATS Approval Not Needed No approval needed by ATS label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval Not Needed No approval needed by ATS bug Something isn't working models training

Projects

Status: To be triaged

Development

Successfully merging this pull request may close these issues.

3 participants