feat: FSDP2 w weight prefetching and async TP optimization by ZhiyuLi-Nvidia · Pull Request #1711 · NVIDIA-NeMo/Automodel

ZhiyuLi-Nvidia · 2026-04-07T13:28:41Z

What does this PR do ?

Perf Improvement:

llama3-70b peft MFU 31% from previous 23%
llama3-70b peft 2 node MFU 29% from previous 19%
qwen2.5-32b peft MFU 30% from previous 25%
qwen2.5-32b peft 2 nodes MFU 28% from previous 25%
llama3-70b full weight tuning MFU 39% from previous 36%

FSDP2 throughput optimizations + removal of combined projection modules.

FSDP2 Optimizations (`FSDP2Config` new flags)

Weight prefetching: configurable forward/backward prefetch depth
Async tensor parallel (enable_async_tensor_parallel): overlaps ReduceScatter with compute via _micro_pipeline_tp; requires sequence_parallel=True
Per-layer compile (enable_compile): applies torch.compile after checkpoint load to avoid _orig_mod key mismatches
patch_is_packed_sequence: removes per-layer CPU-GPU sync (aten::is_nonzero) and enables static shapes for compile

Compile Compatibility

TPLinear: replaces F.linear with bmm to avoid aten.view crashes on sharded DTensor in AOT-autograd backward
_patch_dtensor_spec_hash_for_symint: fixes DTensorSpec hash crash when torch.compile uses symbolic shapes

Combined Projection Depreciation

Little gain to do Combined Projection originally for reducing overhead from 3 small matmul as almost 90% all gather communication is overlapped with computation
Removes CombinedQKVAttentionMixin, CombinedGateUpMLP, and CombinedProjectionStateDictAdapter
Llama and Qwen2 revert to standard HF-style separate q/k/v and gate/up projections — simplifies TP sharding and state dict handling

Test plan

Llama/Qwen2 unit tests
8-node Llama 3.1 70B pretrain benchmark

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-04-07T13:28:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-04-07T13:35:11Z

/ok to test 1b4762a

ZhiyuLi-Nvidia · 2026-04-07T14:02:50Z

/ok to test 0d9ffef

ZhiyuLi-Nvidia · 2026-04-07T14:33:41Z

/ok to test 295ed82

ZhiyuLi-Nvidia · 2026-04-09T07:19:22Z

/ok to test 03ee22d

ZhiyuLi-Nvidia · 2026-04-09T07:35:55Z

/ok to test 2ab2123

jgerh

Did another tech pubs pass on docs/performance-summary.md. Found a few dup entries and link issues. Please check and verify suggestions.

akoumpa · 2026-04-10T03:30:00Z

/ok to test 4120722

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…frastructure Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-04-10T13:13:15Z

/ok to test 77d24bd

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-04-10T17:29:40Z

/ok to test 592f9c8

ZhiyuLi-Nvidia requested review from HuiyingLi, adil-a, akoumpa, hemildesai and pthombre as code owners April 7, 2026 13:28

copy-pr-bot Bot temporarily deployed to test April 7, 2026 13:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 13:35 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 7, 2026 13:35 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 13:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 13:40 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 7, 2026 14:00 Error

ZhiyuLi-Nvidia force-pushed the zhiyul/fsdp_optimized branch from 1b4762a to 0d9ffef Compare April 7, 2026 14:01

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 14:03 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 7, 2026 14:03 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 14:03 Inactive

copy-pr-bot Bot temporarily deployed to test April 7, 2026 14:04 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 7, 2026 14:16 Error

ZhiyuLi-Nvidia force-pushed the zhiyul/fsdp_optimized branch from 0d9ffef to 295ed82 Compare April 7, 2026 14:32

copy-pr-bot Bot temporarily deployed to nemo-ci April 7, 2026 14:34 Inactive

jgerh reviewed Apr 9, 2026

View reviewed changes

Comment thread docs/performance-summary.md

Comment thread docs/performance-summary.md Outdated

ZhiyuLi-Nvidia and others added 18 commits April 10, 2026 05:09

feat: FSDP2 w weight prefetching and async TP optimization

d783256

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

remove deferred rs feature

28a680c

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

add datapoints

b0b1e8e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

lint

3364bb4

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

fix unit tests

d29a5f5

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

address claude review

8c2bbd6

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

remove invalid tests and better readbility

daec3e7

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

skip unused fsdp flag

cfbaae3

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Apply suggestions from code review

c237638

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

refactor: use nn.Module.compile() and consolidate compile paths in in…

a916028

…frastructure Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

refactor: remove fsdp_layer_group_size flag

6e844fe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

derive pp_enabled

246ac1b

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

lint

8a36e91

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

update cp and fix

1685426

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

lint

8f9d76a

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

update perf

6785c92

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

update

1feeb1b

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

update perf

77d24bd

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

fix test

592f9c8

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

akoumpa approved these changes Apr 10, 2026

View reviewed changes

ZhiyuLi-Nvidia mentioned this pull request Apr 10, 2026

cp: FSDP2 w weight prefetching and async TP optimization (#1711) #1779

Merged

HuiyingLi mentioned this pull request Apr 13, 2026

fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params #1813

Merged

2 tasks

sharonyu-115 mentioned this pull request Apr 15, 2026

Gemma4 26B-A4B MoE Expert Weights float32 After FSDP2 Prefetch PR #1863

Open

HuiyingLi mentioned this pull request Apr 16, 2026

cp: 1813 fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params #1869

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FSDP2 w weight prefetching and async TP optimization#1711

feat: FSDP2 w weight prefetching and async TP optimization#1711
ZhiyuLi-Nvidia merged 19 commits intomainfrom
zhiyul/fsdp_optimized

ZhiyuLi-Nvidia commented Apr 7, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 9, 2026

Uh oh!

jgerh left a comment

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Apr 10, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 10, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhiyuLi-Nvidia commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

FSDP2 Optimizations (FSDP2Config new flags)

Compile Compatibility

Combined Projection Depreciation

Test plan

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 9, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Apr 10, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 10, 2026

Uh oh!

ZhiyuLi-Nvidia commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZhiyuLi-Nvidia commented Apr 7, 2026 •

edited

Loading

FSDP2 Optimizations (`FSDP2Config` new flags)