fix(inference): scope NemotronH reload fix to mixer.D

hallerite · claude · hallerite · commit a10d187a254b · 2026-06-05T00:37:39.000+05:30
Direct measurement of the post-reload (pre-restore) state shows only
mixer.D is actually corrupted by vLLM 0.22's layerwise online reload:
its weight load is dropped and the param is left as uninitialized
empty_strided memory -&gt; non-deterministic garbage (NaN, inf, or huge
finite values like 1e17) -&gt; NaN logits. Same dtype (bf16) and strides
as its neighbours dt_bias/A, which load fine, so it's a dropped load,
not a dtype/stride issue.

The MoE gate.e_score_correction_bias reloads correctly (post-reload value
equals the received value exactly). It only appeared corrupted in an
earlier norm-delta because the trainer broadcasts it shifted by
-bias.min() (converting_nemotron_h.py) for bf16 representability -- a
routing-invariant constant shift, not corruption. Restoring it was a
no-op, so dropping it from the fix is behaviour-preserving.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/prime_rl/inference/vllm/worker/nccl.py b/src/prime_rl/inference/vllm/worker/nccl.py
@@ -24,19 +24,35 @@
 
 logger = init_logger("vllm.inference.vllm.worker_nccl")
 
-# NemotronH params that vLLM 0.22's layerwise reload mis-loads through the online-reload path.
-_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D", ".e_score_correction_bias")
+# NemotronH mixer.D is dropped by vLLM 0.22's layerwise online-reload path (left uninitialized).
+_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D",)
 
 
 def _restore_reload_corrupted_params(model: Module, received: dict[str, torch.Tensor]) -> None:
     """Work around a vLLM 0.22 layerwise-reload bug for NemotronH.
 
-    The online reload mis-loads exactly two per-layer parameter families -- ``mixer.D`` (Mamba SSD
-    skip) and the MoE router's ``gate.e_score_correction_bias`` -- while loading all other weights
-    correctly. ``mixer.D`` ends up as non-deterministic garbage/inf (NaN logits) and the gate bias
-    gets a wrong value (broken expert routing), so generations go to NaN after a weight update.
-
-    The received broadcast value is correct, so restore those params from it via each param's own
+    The online reload drops the weight load for every Mamba layer's ``mixer.D`` (the SSD skip
+    connection): the param is materialized as uninitialized ``empty_strided`` memory and never
+    written, so it reads back as non-deterministic garbage (NaN, inf, or huge finite values like
+    1e17), which makes the logits NaN after a weight update. Measured directly -- D has the same
+    dtype (bf16) and strides as its neighbours ``dt_bias``/``A`` (which load fine), so this is a
+    dropped load, not a dtype/stride issue.
+
+    Precise trigger (instrumented): the mixer streams its params in the order dt_bias, A_log, D, ...
+    and its ``load_numel_total`` is 24 (A+D+dt_bias, 8 each). ``A``'s loader is
+    ``composed_weight_loader(sharded_weight_loader, -exp)``, whose extra copy makes vLLM's
+    ``CopyCounter`` attribute 16 elements to the 8-element ``A``. So after dt_bias (8) + A (16),
+    ``load_numel`` already equals ``load_numel_total`` and ``_layerwise_process`` finalizes the mixer
+    -- materializing it via ``empty_strided`` and replaying only dt_bias+A -- before ``D`` (third in
+    the stream) arrives; ``D``'s late load then hits the "Excessive loading" early-return and is
+    dropped. (D is broadcast correctly, exactly once per layer, so this is a vLLM bug, not a
+    conversion/broadcast bug.)
+
+    (The MoE ``gate.e_score_correction_bias`` reloads correctly -- it only looked corrupted in a
+    norm-delta because the trainer broadcasts it shifted by ``-bias.min()`` for bf16 representability,
+    a routing-invariant constant shift.)
+
+    The received broadcast value is correct, so restore D from it via the param's own
     ``weight_loader`` (which applies the right sharding). Remove once the upstream reload bug is fixed.
     """
 
@@ -182,8 +198,8 @@ def update_weights_from_path(self, weight_dir: str) -> None:
             update_mla_absorbed_weights(model)
             return
 
-        # vLLM 0.22's layerwise reload mis-loads NemotronH mixer.D and MoE gate.e_score_correction_bias
-        # (see _restore_reload_corrupted_params). Capture the correct received values to restore after.
+        # vLLM 0.22's layerwise reload drops NemotronH mixer.D's weight load (see
+        # _restore_reload_corrupted_params). Capture the correct received value to restore after.
         received_reload_fix: dict[str, torch.Tensor] = {}
 
         def _capture_reload_fix(weights):