Skip to content

fix: update TE GroupedLinear integration for single-parameter mode#1680

Open
SwekeR-463 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
SwekeR-463:fix/te-grp-linear-to-1-param
Open

fix: update TE GroupedLinear integration for single-parameter mode#1680
SwekeR-463 wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
SwekeR-463:fix/te-grp-linear-to-1-param

Conversation

@SwekeR-463
Copy link
Copy Markdown
Contributor

What does this PR do ?

Update GroupedExpertsTE to use TE GroupedLinear single-parameter mode and keep AutoModel’s MoE state dict format unchanged.

Changelog

  • Switched TE MoE expert construction to single_grouped_parameter=True.
  • Updated GroupedExpertsTE weight handling to read and write the grouped weight parameter directly.
  • Kept AutoModel MoE checkpoint serialization in stacked tensor format.
  • Updated EP grad-scaling name matching for the grouped TE parameter layout.
  • Added and updated unit tests for the new TE path.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
Signed-off-by: SwekeR-463 <swekerswasti@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@SwekeR-463 SwekeR-463 changed the title (fix): update TE GroupedLinear integration for single-parameter mode fix: update TE GroupedLinear integration for single-parameter mode Apr 4, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 4, 2026

/ok to test 5346557

@hemildesai
Copy link
Copy Markdown
Contributor

Hi @SwekeR-463, thanks a lot for your contribution. Do you have an example wandb run verifying the convergence before and after this change?

@SwekeR-463
Copy link
Copy Markdown
Contributor Author

Hi @SwekeR-463, thanks a lot for your contribution. Do you have an example wandb run verifying the convergence before and after this change?

I haven't done any runs, will do and update.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 10, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 10, 2026

/ok to test bf85bc9

@SwekeR-463
Copy link
Copy Markdown
Contributor Author

Hi @SwekeR-463, thanks a lot for your contribution. Do you have an example wandb run verifying the convergence before and after this change?

Hello @hemildesai, I attempted to run experiments to verify convergence before and after the change, but ran into repeated setup issues on my end. Rather than delaying further, I wanted to inform. 🙂

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 13, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 16, 2026

/ok to test f6f3fcd

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 16, 2026

Hi @SwekeR-463 I restarted CI, I apologize for the long delay.

@chtruong814 chtruong814 added waiting-on-customer Waiting on the original author to respond and removed needs-follow-up Issue needs follow-up labels Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants