Skip to content

feat: FSDP2 w weight prefetching and async TP optimization#1711

Merged
ZhiyuLi-Nvidia merged 19 commits intomainfrom
zhiyul/fsdp_optimized
Apr 10, 2026
Merged

feat: FSDP2 w weight prefetching and async TP optimization#1711
ZhiyuLi-Nvidia merged 19 commits intomainfrom
zhiyul/fsdp_optimized

Conversation

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Apr 7, 2026

What does this PR do ?

Perf Improvement:

  • llama3-70b peft MFU 31% from previous 23%
  • llama3-70b peft 2 node MFU 29% from previous 19%
  • qwen2.5-32b peft MFU 30% from previous 25%
  • qwen2.5-32b peft 2 nodes MFU 28% from previous 25%
  • llama3-70b full weight tuning MFU 39% from previous 36%

FSDP2 throughput optimizations + removal of combined projection modules.

FSDP2 Optimizations (FSDP2Config new flags)

  • Weight prefetching: configurable forward/backward prefetch depth
  • Async tensor parallel (enable_async_tensor_parallel): overlaps ReduceScatter with compute via _micro_pipeline_tp; requires sequence_parallel=True
  • Per-layer compile (enable_compile): applies torch.compile after checkpoint load to avoid _orig_mod key mismatches
  • patch_is_packed_sequence: removes per-layer CPU-GPU sync (aten::is_nonzero) and enables static shapes for compile

Compile Compatibility

  • TPLinear: replaces F.linear with bmm to avoid aten.view crashes on sharded DTensor in AOT-autograd backward
  • _patch_dtensor_spec_hash_for_symint: fixes DTensorSpec hash crash when torch.compile uses symbolic shapes

Combined Projection Depreciation

  • Little gain to do Combined Projection originally for reducing overhead from 3 small matmul as almost 90% all gather communication is overlapped with computation
  • Removes CombinedQKVAttentionMixin, CombinedGateUpMLP, and CombinedProjectionStateDictAdapter
  • Llama and Qwen2 revert to standard HF-style separate q/k/v and gate/up projections — simplifies TP sharding and state dict handling

Test plan

  • Llama/Qwen2 unit tests
  • 8-node Llama 3.1 70B pretrain benchmark

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 1b4762a

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 0d9ffef

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 295ed82

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 03ee22d

@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 2ab2123

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another tech pubs pass on docs/performance-summary.md. Found a few dup entries and link issues. Please check and verify suggestions.

Comment thread docs/performance-summary.md
Comment thread docs/performance-summary.md Outdated
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 10, 2026

/ok to test 4120722

ZhiyuLi-Nvidia and others added 18 commits April 10, 2026 05:09
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…frastructure

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 77d24bd

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 592f9c8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants