Skip to content

fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload#2714

Merged
hallerite merged 5 commits into
mainfrom
fix/nemotron-vllm-reload
Jun 4, 2026
Merged

fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload#2714
hallerite merged 5 commits into
mainfrom
fix/nemotron-vllm-reload

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Jun 4, 2026

Summary

After a weight update, NemotronH inference produced NaN logits / garbage generations. Root cause: vLLM 0.22's layerwise online reload (load_weights_checkpoint_layerwise) drops the weight load for every Mamba layer's mixer.D (the SSD skip connection). The reload materializes each layer's tensors as uninitialized empty_strided memory and then replays buffered loads; mixer.D's load is dropped, so it is never written and reads back as non-deterministic garbageNaN, inf, or huge finite values (~1e17) — which makes the logits NaN.

Confirmed by direct measurement that this is not a dtype/stride issue: post-reload D is bf16 [8] contiguous, byte-identical in dtype/shape/stride to its neighbours dt_bias/A, which load fine.

The symptom is delayed/non-deterministic because of async lag: the orchestrator runs a couple of steps on the initial weights, so the garbage only appears at the first step that uses reloaded weights (Mismatch KL jumps from ~1e-3 to 2–5).

This supersedes #2701: the existing monkey_patch_vllm_layerwise_reload_alias_buffers (which #2701 tweaks) targets conv_weights, but that alias is a red herring — vLLM's reload finalize re-derives it correctly. The monkey-patch's copy-back loop instead getattrs conv_weights after it's been delattr'd, producing AttributeError: 'MambaMixer2' object has no attribute 'conv_weights' (500). #2701 crashes identically and does not address the real mixer.D drop.

Why only mixer.D (and not e_score_correction_bias)

An earlier per-tensor norm-delta also flagged the MoE router's gate.e_score_correction_bias. Direct post-reload measurement shows that one is a false positive: its post-reload value equals the received broadcast value exactly (norms match to 4 decimals across layers). It only looked changed because the trainer broadcasts it shifted by -bias.min() (converting_nemotron_h.py, HF→prime; the prime→HF path renames but never re-adds the min) so the ~57-magnitude bias fits bf16 without its ~0.04 inter-expert spread collapsing. That is a routing-invariant constant shift (top-k is invariant to a constant added to every expert; routing weights come from raw sigmoid), so reloading the shifted value is correct — and reversing the shift would reintroduce the bf16 collapse. So e_score_correction_bias is not corrupted, and the fix is scoped to mixer.D only.

Changes

  1. Drop monkey_patch_vllm_layerwise_reload_alias_buffers (call + definition). It crashes on vLLM 0.22, and conv_weights is handled by vLLM's native reload finalize (#42481).
  2. Restore mixer.D after reload (_restore_reload_corrupted_params in the NCCL weight-update worker): capture the received broadcast value for .mixer.D while streaming into load_weights_checkpoint_layerwise, then restore it via the param's own weight_loader (correct sharding). The received value is by definition the intended one.

Validation

2-node SLURM RL run (Nemotron-3-Nano-30B, reverse-text):

  • Before: generations go to NaN/garbage at the first step using reloaded weights; Mismatch KL 0.001 → 2–5; reward collapses; (with the monkey-patch present, a hard 500 AttributeError on the first reload).
  • After (20 steps): completes clean — no NaN / no 500s, Mismatch KL steady ~1e-3 across all 20 weight updates, reward climbs 0.014 → 0.17 (the model actually learns reverse-text).
  • Direct post-reload measurement (pre-restore): mixer.D comes back NaN / inf / 1e17 (non-deterministic, uninitialized), same dtype/strides as dt_bias/A; e_score_correction_bias equals the received value exactly.

Notes

  • The underlying defect is in vLLM's layerwise reload — it conflates "elements copied" with "elements loaded." The reload finalizes a layer when load_numel >= load_numel_total, where:

    • load_numel is tracked by CopyCounter, a TorchDispatchMode that adds numel() on every aten.copy_.default op;
    • load_numel_total = get_layer_size(layer) counts the layer's param elements (implicitly assuming one copy per element).

    These disagree for any loader that writes a param more than once. The Mamba mixer's load_numel_total is 24 (A+D+dt_bias, 8 elems each) and it streams params in the order dt_bias, A_log, D, …. A's loader is composed_weight_loader(sharded_weight_loader(0), -exp), whose composed_loader issues two copy_ calls into the 8-element A: (1) default_weight_loaderparam.data.copy_(shard), then (2) param.data.copy_(-exp(param)) to post-process. So CopyCounter attributes 16 to A, while D/dt_bias (plain sharded loader, one copy_) count 8 each.

    Result: after dt_bias(8) + A(16), load_numel == 24 == load_numel_total and _layerwise_process finalizes the mixer — materializing it via empty_strided and replaying only dt_bias+Abefore D (third in the stream) arrives. D's late load then hits the online_process_loader "Excessive loading" early-return and is dropped, leaving D uninitialized. Measured directly: [LW-PROC] buffered=['dt_bias','A'] D_in=False numel=24/24 in 368/368 observations, and .mixer.D is broadcast exactly once per layer (23×) — so this is a vLLM bug, not a conversion/broadcast bug. (vLLM's own online_process_loader comment acknowledges a sibling case — qconfigs that "load the same weight multiple times" overshooting load_numel_total — but doesn't handle the composed_weight_loader transform case.) This worker-side restore is a workaround that can be removed once fixed upstream.

  • The e_score_correction_bias -bias.min() shift in converting_nemotron_h.py is intentional and correct (bf16 representability, routing-invariant) — it should not be "reversed".

  • Requires the separate NemotronH offline-init fix (use_mamba_kernels=False, merged in fix(trainer): disable NemotronH HF-Hub mamba kernels for offline init #2713) for the trainer to start under HF_HUB_OFFLINE=1 and exercise this path end-to-end.

🤖 Generated with Claude Code


Note

Medium Risk
Changes live inference weight-update behavior for NemotronH; wrong restore logic could corrupt Mamba skip weights, but the fix is narrow (.mixer.D only) and uses received broadcast values plus existing loaders.

Overview
Fixes NemotronH inference after NCCL weight updates on vLLM 0.22 by removing a broken layerwise-reload monkey-patch and re-applying mixer.D from the broadcast stream after load_weights_checkpoint_layerwise.

patches.py: Stops registering monkey_patch_vllm_layerwise_reload_alias_buffers (call and implementation). That patch tried to skip buffer copies that alias parameters; on 0.22 it can AttributeError on reload finalize instead of fixing the real issue.

nccl.py: Wraps the incoming weight iterator to snapshot .mixer.D tensors on CPU, runs layerwise reload as before, then _restore_reload_corrupted_params writes them back via each param’s weight_loader (or copy_ fallback), keyed by layers.* suffixes. This works around vLLM finalizing Mamba mixer layers before D is loaded, which left skip-connection weights uninitialized and produced NaN logits after the first reload step.

Scope is NCCL online updates only (not the quantize/kernel path or filesystem loader).

Reviewed by Cursor Bugbot for commit d0d32e3. Bugbot is set up for automated code reviews on this repo. Configure here.

…fter vLLM reload

vLLM 0.22's layerwise reload mis-loads exactly two NemotronH per-layer param
families through the online-reload path -- mixer.D (Mamba SSD skip) and the MoE
router's gate.e_score_correction_bias -- while loading all other weights
correctly. mixer.D becomes non-deterministic garbage/inf (NaN logits) and the
gate bias gets a wrong value (broken routing), so generations go to NaN after a
weight update. Restore both from the received broadcast (correct by definition)
via each param's own weight_loader.

Also drop monkey_patch_vllm_layerwise_reload_alias_buffers: it crashes on vLLM
0.22 (AttributeError on the delattr'd conv_weights) and conv_weights is handled
correctly by vLLM's native reload finalize. Supersedes #2701.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review June 4, 2026 13:40
@hallerite hallerite changed the title fix(inference): restore NemotronH mixer.D + e_score_correction_bias after vLLM reload fix(inference): restore NemotronH mixer.D after vLLM 0.22 layerwise reload Jun 4, 2026
@hallerite hallerite force-pushed the fix/nemotron-vllm-reload branch from 570ce05 to a10d187 Compare June 4, 2026 19:07
_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D",)


def _restore_reload_corrupted_params(model: Module, received: dict[str, torch.Tensor]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this excessive comment, only the middle part is sufficient I think

Direct measurement of the post-reload (pre-restore) state shows only
mixer.D is actually corrupted by vLLM 0.22's layerwise online reload:
its weight load is dropped and the param is left as uninitialized
empty_strided memory -> non-deterministic garbage (NaN, inf, or huge
finite values like 1e17) -> NaN logits. Same dtype (bf16) and strides
as its neighbours dt_bias/A, which load fine, so it's a dropped load,
not a dtype/stride issue.

The MoE gate.e_score_correction_bias reloads correctly (post-reload value
equals the received value exactly). It only appeared corrupted in an
earlier norm-delta because the trainer broadcasts it shifted by
-bias.min() (converting_nemotron_h.py) for bf16 representability -- a
routing-invariant constant shift, not corruption. Restoring it was a
no-op, so dropping it from the fix is behaviour-preserving.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the fix/nemotron-vllm-reload branch from a10d187 to d0d32e3 Compare June 4, 2026 19:32
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm chief

@hallerite hallerite merged commit 7f6ca61 into main Jun 4, 2026
18 checks passed
@hallerite hallerite deleted the fix/nemotron-vllm-reload branch June 4, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants