fix(inference): scope NemotronH reload fix to mixer.D

hallerite · claude · hallerite · commit 570ce052d51d · 2026-06-04T23:12:29.000+05:30
Direct measurement of the post-reload (pre-restore) state shows only
mixer.D is actually corrupted by vLLM 0.22's layerwise online reload:
its weight load is dropped and the param is left as uninitialized
empty_strided memory -&gt; non-deterministic garbage (NaN, inf, or huge
finite values like 1e17) -&gt; NaN logits. Same dtype (bf16) and strides
as its neighbours dt_bias/A, which load fine, so it's a dropped load,
not a dtype/stride issue.

The MoE gate.e_score_correction_bias reloads correctly (post-reload value
equals the received value exactly). It only appeared corrupted in an
earlier norm-delta because the trainer broadcasts it shifted by
-bias.min() (converting_nemotron_h.py) for bf16 representability -- a
routing-invariant constant shift, not corruption. Restoring it was a
no-op, so dropping it from the fix is behaviour-preserving.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/prime_rl/inference/vllm/worker/nccl.py b/src/prime_rl/inference/vllm/worker/nccl.py
@@ -24,19 +24,23 @@
 
 logger = init_logger("vllm.inference.vllm.worker_nccl")
 
-# NemotronH params that vLLM 0.22's layerwise reload mis-loads through the online-reload path.
-_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D", ".e_score_correction_bias")
+# NemotronH mixer.D is dropped by vLLM 0.22's layerwise online-reload path (left uninitialized).
+_RELOAD_CORRUPTED_SUFFIXES = (".mixer.D",)
 
 
 def _restore_reload_corrupted_params(model: Module, received: dict[str, torch.Tensor]) -> None:
     """Work around a vLLM 0.22 layerwise-reload bug for NemotronH.
 
-    The online reload mis-loads exactly two per-layer parameter families -- ``mixer.D`` (Mamba SSD
-    skip) and the MoE router's ``gate.e_score_correction_bias`` -- while loading all other weights
-    correctly. ``mixer.D`` ends up as non-deterministic garbage/inf (NaN logits) and the gate bias
-    gets a wrong value (broken expert routing), so generations go to NaN after a weight update.
+    The online reload drops the weight load for every Mamba layer's ``mixer.D`` (the SSD skip
+    connection): the param is materialized as uninitialized ``empty_strided`` memory and never
+    written, so it reads back as non-deterministic garbage (NaN, inf, or huge finite values like
+    1e17), which makes the logits NaN after a weight update. Measured directly -- D has the same
+    dtype (bf16) and strides as its neighbours ``dt_bias``/``A`` (which load fine), so this is a
+    dropped load, not a dtype/stride issue. (The MoE ``gate.e_score_correction_bias`` reloads
+    correctly -- it only looked corrupted in a norm-delta because the trainer broadcasts it shifted
+    by ``-bias.min()`` for bf16 representability, a routing-invariant constant shift.)
 
-    The received broadcast value is correct, so restore those params from it via each param's own
+    The received broadcast value is correct, so restore D from it via the param's own
     ``weight_loader`` (which applies the right sharding). Remove once the upstream reload bug is fixed.
     """
 
@@ -182,8 +186,8 @@ def update_weights_from_path(self, weight_dir: str) -> None:
             update_mla_absorbed_weights(model)
             return
 
-        # vLLM 0.22's layerwise reload mis-loads NemotronH mixer.D and MoE gate.e_score_correction_bias
-        # (see _restore_reload_corrupted_params). Capture the correct received values to restore after.
+        # vLLM 0.22's layerwise reload drops NemotronH mixer.D's weight load (see
+        # _restore_reload_corrupted_params). Capture the correct received value to restore after.
         received_reload_fix: dict[str, torch.Tensor] = {}
 
         def _capture_reload_fix(weights):