fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload by hallerite · Pull Request #2714 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-04T13:36:51Z

Summary

After a weight update, NemotronH inference produced NaN logits / garbage generations. Root cause: vLLM 0.22's layerwise online reload (load_weights_checkpoint_layerwise) drops the weight load for every Mamba layer's mixer.D (the SSD skip connection). The reload materializes each layer's tensors as uninitialized empty_strided memory and then replays buffered loads; mixer.D's load is dropped, so it is never written and reads back as non-deterministic garbage — NaN, inf, or huge finite values (~1e17) — which makes the logits NaN.

Confirmed by direct measurement that this is not a dtype/stride issue: post-reload D is bf16 [8] contiguous, byte-identical in dtype/shape/stride to its neighbours dt_bias/A, which load fine.

The symptom is delayed/non-deterministic because of async lag: the orchestrator runs a couple of steps on the initial weights, so the garbage only appears at the first step that uses reloaded weights (Mismatch KL jumps from ~1e-3 to 2–5).

This supersedes #2701: the existing monkey_patch_vllm_layerwise_reload_alias_buffers (which #2701 tweaks) targets conv_weights, but that alias is a red herring — vLLM's reload finalize re-derives it correctly. The monkey-patch's copy-back loop instead getattrs conv_weights after it's been delattr'd, producing AttributeError: 'MambaMixer2' object has no attribute 'conv_weights' (500). #2701 crashes identically and does not address the real mixer.D drop.

Why only `mixer.D` (and not `e_score_correction_bias`)

An earlier per-tensor norm-delta also flagged the MoE router's gate.e_score_correction_bias. Direct post-reload measurement shows that one is a false positive: its post-reload value equals the received broadcast value exactly (norms match to 4 decimals across layers). It only looked changed because the trainer broadcasts it shifted by -bias.min() (converting_nemotron_h.py, HF→prime; the prime→HF path renames but never re-adds the min) so the ~57-magnitude bias fits bf16 without its ~0.04 inter-expert spread collapsing. That is a routing-invariant constant shift (top-k is invariant to a constant added to every expert; routing weights come from raw sigmoid), so reloading the shifted value is correct — and reversing the shift would reintroduce the bf16 collapse. So e_score_correction_bias is not corrupted, and the fix is scoped to mixer.D only.

Changes

Drop monkey_patch_vllm_layerwise_reload_alias_buffers (call + definition). It crashes on vLLM 0.22, and conv_weights is handled by vLLM's native reload finalize (#42481).
Restore mixer.D after reload (_restore_reload_corrupted_params in the NCCL weight-update worker): capture the received broadcast value for .mixer.D while streaming into load_weights_checkpoint_layerwise, then restore it via the param's own weight_loader (correct sharding). The received value is by definition the intended one.

Validation

2-node SLURM RL run (Nemotron-3-Nano-30B, reverse-text):

Before: generations go to NaN/garbage at the first step using reloaded weights; Mismatch KL 0.001 → 2–5; reward collapses; (with the monkey-patch present, a hard 500 AttributeError on the first reload).
After (20 steps): completes clean — no NaN / no 500s, Mismatch KL steady ~1e-3 across all 20 weight updates, reward climbs 0.014 → 0.17 (the model actually learns reverse-text).
Direct post-reload measurement (pre-restore): mixer.D comes back NaN / inf / 1e17 (non-deterministic, uninitialized), same dtype/strides as dt_bias/A; e_score_correction_bias equals the received value exactly.

Notes

The underlying defect is in vLLM's layerwise reload — it conflates "elements copied" with "elements loaded." The reload finalizes a layer when load_numel >= load_numel_total, where:
- load_numel is tracked by CopyCounter, a TorchDispatchMode that adds numel() on every aten.copy_.default op;
- load_numel_total = get_layer_size(layer) counts the layer's param elements (implicitly assuming one copy per element).
These disagree for any loader that writes a param more than once. The Mamba mixer's load_numel_total is 24 (A+D+dt_bias, 8 elems each) and it streams params in the order dt_bias, A_log, D, …. A's loader is composed_weight_loader(sharded_weight_loader(0), -exp), whose composed_loader issues two copy_ calls into the 8-element A: (1) default_weight_loader → param.data.copy_(shard), then (2) param.data.copy_(-exp(param)) to post-process. So CopyCounter attributes 16 to A, while D/dt_bias (plain sharded loader, one copy_) count 8 each.

Result: after dt_bias(8) + A(16), load_numel == 24 == load_numel_total and _layerwise_process finalizes the mixer — materializing it via empty_strided and replaying only dt_bias+A — before D (third in the stream) arrives. D's late load then hits the online_process_loader "Excessive loading" early-return and is dropped, leaving D uninitialized. Measured directly: [LW-PROC] buffered=['dt_bias','A'] D_in=False numel=24/24 in 368/368 observations, and .mixer.D is broadcast exactly once per layer (23×) — so this is a vLLM bug, not a conversion/broadcast bug. (vLLM's own online_process_loader comment acknowledges a sibling case — qconfigs that "load the same weight multiple times" overshooting load_numel_total — but doesn't handle the composed_weight_loader transform case.) This worker-side restore is a workaround that can be removed once fixed upstream.
The e_score_correction_bias -bias.min() shift in converting_nemotron_h.py is intentional and correct (bf16 representability, routing-invariant) — it should not be "reversed".
Requires the separate NemotronH offline-init fix (use_mamba_kernels=False, merged in fix(trainer): disable NemotronH HF-Hub mamba kernels for offline init #2713) for the trainer to start under HF_HUB_OFFLINE=1 and exercise this path end-to-end.

🤖 Generated with Claude Code

Note

Medium Risk
Changes live inference weight-update behavior for NemotronH; wrong restore logic could corrupt Mamba skip weights, but the fix is narrow (.mixer.D only) and uses received broadcast values plus existing loaders.

Overview
Fixes NemotronH inference after NCCL weight updates on vLLM 0.22 by removing a broken layerwise-reload monkey-patch and re-applying mixer.D from the broadcast stream after load_weights_checkpoint_layerwise.

patches.py: Stops registering monkey_patch_vllm_layerwise_reload_alias_buffers (call and implementation). That patch tried to skip buffer copies that alias parameters; on 0.22 it can AttributeError on reload finalize instead of fixing the real issue.

nccl.py: Wraps the incoming weight iterator to snapshot .mixer.D tensors on CPU, runs layerwise reload as before, then _restore_reload_corrupted_params writes them back via each param’s weight_loader (or copy_ fallback), keyed by layers.* suffixes. This works around vLLM finalizing Mamba mixer layers before D is loaded, which left skip-connection weights uninitialized and produced NaN logits after the first reload step.

Scope is NCCL online updates only (not the quantize/kernel path or filesystem loader).

^{Reviewed by Cursor Bugbot for commit d0d32e3. Bugbot is set up for automated code reviews on this repo. Configure here.}

…fter vLLM reload vLLM 0.22's layerwise reload mis-loads exactly two NemotronH per-layer param families through the online-reload path -- mixer.D (Mamba SSD skip) and the MoE router's gate.e_score_correction_bias -- while loading all other weights correctly. mixer.D becomes non-deterministic garbage/inf (NaN logits) and the gate bias gets a wrong value (broken routing), so generations go to NaN after a weight update. Restore both from the received broadcast (correct by definition) via each param's own weight_loader. Also drop monkey_patch_vllm_layerwise_reload_alias_buffers: it crashes on vLLM 0.22 (AttributeError on the delattr'd conv_weights) and conv_weights is handled correctly by vLLM's native reload finalize. Supersedes #2701. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

S1ro1 · 2026-06-04T19:25:24Z

+_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D",)
+
+
+def _restore_reload_corrupted_params(model: Module, received: dict[str, torch.Tensor]) -> None:


Let's remove this excessive comment, only the middle part is sufficient I think

Direct measurement of the post-reload (pre-restore) state shows only mixer.D is actually corrupted by vLLM 0.22's layerwise online reload: its weight load is dropped and the param is left as uninitialized empty_strided memory -> non-deterministic garbage (NaN, inf, or huge finite values like 1e17) -> NaN logits. Same dtype (bf16) and strides as its neighbours dt_bias/A, which load fine, so it's a dropped load, not a dtype/stride issue. The MoE gate.e_score_correction_bias reloads correctly (post-reload value equals the received value exactly). It only appeared corrupted in an earlier norm-delta because the trainer broadcasts it shifted by -bias.min() (converting_nemotron_h.py) for bf16 representability -- a routing-invariant constant shift, not corruption. Restoring it was a no-op, so dropping it from the fix is behaviour-preserving. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

S1ro1

lgtm chief

hallerite marked this pull request as ready for review June 4, 2026 13:40

hallerite and others added 3 commits June 4, 2026 15:40

Merge branch 'main' into fix/nemotron-vllm-reload

4e93192

Merge branch 'main' into fix/nemotron-vllm-reload

80f98e3

chore(inference): drop debug log from NemotronH reload restore

ff53c5c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hallerite changed the title ~~fix(inference): restore NemotronH mixer.D + e_score_correction_bias after vLLM reload~~ fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload Jun 4, 2026

hallerite force-pushed the fix/nemotron-vllm-reload branch from 570ce05 to a10d187 Compare June 4, 2026 19:07

S1ro1 reviewed Jun 4, 2026

View reviewed changes

hallerite force-pushed the fix/nemotron-vllm-reload branch from a10d187 to d0d32e3 Compare June 4, 2026 19:32

S1ro1 approved these changes Jun 4, 2026

View reviewed changes

hallerite merged commit 7f6ca61 into main Jun 4, 2026
18 checks passed

hallerite deleted the fix/nemotron-vllm-reload branch June 4, 2026 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload#2714

fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload#2714
hallerite merged 5 commits into
mainfrom
fix/nemotron-vllm-reload

hallerite commented Jun 4, 2026 •

edited by cursor Bot

Loading

Uh oh!

S1ro1 Jun 4, 2026

Uh oh!

S1ro1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D",)


		def _restore_reload_corrupted_params(model: Module, received: dict[str, torch.Tensor]) -> None:

Conversation

hallerite commented Jun 4, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why only mixer.D (and not e_score_correction_bias)

Changes

Validation

Notes

Uh oh!

S1ro1 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallerite commented Jun 4, 2026 •

edited by cursor Bot

Loading

Why only `mixer.D` (and not `e_score_correction_bias`)