fix: Sharding for learnable residual connections. by jakob-schloer · Pull Request #1099 · ecmwf/anemoi-core

jakob-schloer · 2026-05-07T09:07:49Z

Description

Add model-sharding support for SpectralOrnsteinConnection so it works correctly when num_gpus_per_model > 1.

What problem does this change solve?

SpectralOrnsteinConnection was incompatible with model sharding (grid split across multiple GPUs). Two issues existed:

The inverse SHT produces a spatial weight field with the full grid shape, but when sharded, x only contains the local grid shard. Broadcasting fails on the mismatched grid dimension.
When truncate=True, the forward SHT requires all latitude rings to produce correct spectral coefficients. Operating on a grid shard yields incorrect results silently.

What issue or task does this change relate to?

Additional notes

The gather_tensor in _apply_truncation introduces an all-gather collective per forward pass which would become a problem for very large grids.

Todos

Waiting for merging the shard_shapes refactor refactor(models, training): shard_shapes #964
Testing with deterministic model
Testing with ensemble model
Integration test

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

fix: Sharding for SpectralOrnsteinConnection.

fe90981

github-project-automation Bot added this to Anemoi-dev May 7, 2026

github-project-automation Bot moved this to To be triaged in Anemoi-dev May 7, 2026

jakob-schloer self-assigned this May 7, 2026

github-actions Bot added bug Something isn't working training labels May 7, 2026

jakob-schloer assigned japols May 7, 2026

github-actions Bot added the models label May 7, 2026

jakob-schloer requested a review from ssmmnn11 May 7, 2026 09:08

JPXKQX added the ATS Approval Not Needed No approval needed by ATS label May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Sharding for learnable residual connections. #1099

fix: Sharding for learnable residual connections. #1099
jakob-schloer wants to merge 1 commit into
mainfrom
fix/sharding_lrc

jakob-schloer commented May 7, 2026 •

edited by JPXKQX

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jakob-schloer commented May 7, 2026 • edited by JPXKQX Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

Todos

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jakob-schloer commented May 7, 2026 •

edited by JPXKQX

Loading