fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params by HuiyingLi · Pull Request #1813 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-04-13T23:20:57Z

Summary

PR feat: FSDP2 w weight prefetching and async TP optimization #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR fix: Qwen3.5 dense CP support and FSDP mixed-dtype fix #1710's patch_hf_model.
Move the gate computation (g = -A_log.exp() * softplus(a + dt_bias)) into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally around the fp32 params.
Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Class swap is now unconditional (needed for the non-CP forward override).
Add __getattr__ on the class for checkpoint/state_dict access to moved params.

Test plan

Qwen3.5 9B: 100 steps, CP1: https://wandb.ai/Nemo-automodel/huiyingl_workspace/runs/stq9sl31
Qwen3.5 9B: 100 steps, CP2 (loss parity with CP1): https://wandb.ai/Nemo-automodel/huiyingl_workspace/runs/rtqwo7bb

🤖 Generated with Claude Code

PR #1711 changed _should_load_before_shard to return False for multi-GPU DP, so models stay on meta device through FSDP wrapping. This broke the __dict__ trick in PR #1710's patch_hf_model. Move the gate computation into _Fp32ParamHolder.forward() so FSDP's unshard/reshard lifecycle fires naturally. Override CPAwareGatedDeltaNet forward for both CP and non-CP paths to route through the holder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot · 2026-04-13T23:21:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-04-13T23:21:04Z

/ok to test a08e559

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-13T23:26:29Z

/claude review

HuiyingLi · 2026-04-13T23:26:46Z

/ok to test b9a7513

HuiyingLi · 2026-04-13T23:43:39Z

/claude review

claude · 2026-04-13T23:45:37Z


    _logger = logging.getLogger(__name__)
    patched = 0
+    patched_classes = set()


Bug: patched_classes is local to each patch_hf_model call, but CPAwareGatedDeltaNet inherits from Qwen3_5GatedDeltaNet, so a second invocation would pass the isinstance check on line 515 and re-wrap cls.__getattr__, creating a growing chain of wrappers (each calling the next, all checking _fp32_params).

This may not happen today since parallelizer.py only calls it once, but it's a latent bug. A simple guard would be a sentinel on the class:

Suggested change

patched_classes = set()

patched_classes = set()

_PATCHED_ATTR = "_fp32_getattr_patched"

Then at line 537:

if cls not in patched_classes and not getattr(cls, _PATCHED_ATTR, False): cls.__getattr__ = _make_fp32_getattr(cls.__getattr__) setattr(cls, _PATCHED_ATTR, True) patched_classes.add(cls)

…guard Add unit tests for: - _Fp32ParamHolder.forward gate computation and dtype preservation - _compute_gate routing through holder vs inline fallback - patch_hf_model sentinel preventing __getattr__ re-wrapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-14T06:57:28Z

/ok to test cea2a0c

Add 14 new tests covering the critical _forward_no_cp method (lines 91-193) and forward() dispatch logic (lines 207-213) to satisfy codecov/patch requirements for PR #1813: - _forward_no_cp basic forward, cache_params=None, causal_conv1d_fn fallback, causal_conv1d_fn set, attention_mask, GQA repeat-interleave, _compute_gate delegation, and output dtype - forward() dispatch when _cp_mesh is None or size <= 1, parameter pass-through, and extra CP kwargs - _make_fp32_getattr fallback to AttributeError and real attr resolution Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi · 2026-04-14T16:25:56Z

/claude review

HuiyingLi · 2026-04-14T16:26:02Z

/ok to test a17d84a

claude

LGTM

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and pthombre as code owners April 13, 2026 23:20

copy-pr-bot Bot had a problem deploying to test April 13, 2026 23:21 Error

copy-pr-bot Bot had a problem deploying to nemo-ci April 13, 2026 23:21 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 23:21 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 13, 2026 23:21 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 23:21 Inactive

chore: remove test yaml not intended for PR

b9a7513

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot Bot temporarily deployed to test April 13, 2026 23:27 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 23:28 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 13, 2026 23:30 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 13, 2026 23:38 Error

claude Bot reviewed Apr 13, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 14, 2026 04:56 Inactive

copy-pr-bot Bot temporarily deployed to test April 14, 2026 04:56 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 14, 2026 05:04 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 14, 2026 05:14 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 14, 2026 05:34 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 14, 2026 06:57 Inactive

claude Bot approved these changes Apr 14, 2026

View reviewed changes

HuiyingLi mentioned this pull request Apr 16, 2026

cp: 1813 fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params #1869

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params#1813

fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params#1813
HuiyingLi merged 7 commits intor0.4.0from
huiyingl/fix-qwen35-fsdp2-meta-device

HuiyingLi commented Apr 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

claude Bot Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	patched_classes = set()
	patched_classes = set()
	_PATCHED_ATTR = "_fp32_getattr_patched"

Conversation

HuiyingLi commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

HuiyingLi commented Apr 13, 2026

Uh oh!

claude Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

HuiyingLi commented Apr 14, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HuiyingLi commented Apr 13, 2026 •

edited

Loading