Commit d0d32e3
fix(inference): scope NemotronH reload fix to mixer.D
Direct measurement of the post-reload (pre-restore) state shows only
mixer.D is actually corrupted by vLLM 0.22's layerwise online reload:
its weight load is dropped and the param is left as uninitialized
empty_strided memory -> non-deterministic garbage (NaN, inf, or huge
finite values like 1e17) -> NaN logits. Same dtype (bf16) and strides
as its neighbours dt_bias/A, which load fine, so it's a dropped load,
not a dtype/stride issue.
The MoE gate.e_score_correction_bias reloads correctly (post-reload value
equals the received value exactly). It only appeared corrupted in an
earlier norm-delta because the trainer broadcasts it shifted by
-bias.min() (converting_nemotron_h.py) for bf16 representability -- a
routing-invariant constant shift, not corruption. Restoring it was a
no-op, so dropping it from the fix is behaviour-preserving.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent ff53c5c commit d0d32e3
1 file changed
Lines changed: 8 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | | - | |
| 27 | + | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
41 | 38 | | |
42 | 39 | | |
43 | 40 | | |
| |||
182 | 179 | | |
183 | 180 | | |
184 | 181 | | |
185 | | - | |
186 | | - | |
| 182 | + | |
| 183 | + | |
187 | 184 | | |
188 | 185 | | |
189 | 186 | | |
| |||
0 commit comments