Commit 570ce05
fix(inference): scope NemotronH reload fix to mixer.D
Direct measurement of the post-reload (pre-restore) state shows only
mixer.D is actually corrupted by vLLM 0.22's layerwise online reload:
its weight load is dropped and the param is left as uninitialized
empty_strided memory -> non-deterministic garbage (NaN, inf, or huge
finite values like 1e17) -> NaN logits. Same dtype (bf16) and strides
as its neighbours dt_bias/A, which load fine, so it's a dropped load,
not a dtype/stride issue.
The MoE gate.e_score_correction_bias reloads correctly (post-reload value
equals the received value exactly). It only appeared corrupted in an
earlier norm-delta because the trainer broadcasts it shifted by
-bias.min() (converting_nemotron_h.py) for bf16 representability -- a
routing-invariant constant shift, not corruption. Restoring it was a
no-op, so dropping it from the fix is behaviour-preserving.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent ff53c5c commit 570ce05
1 file changed
Lines changed: 13 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | | - | |
| 27 | + | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
38 | 42 | | |
39 | | - | |
| 43 | + | |
40 | 44 | | |
41 | 45 | | |
42 | 46 | | |
| |||
182 | 186 | | |
183 | 187 | | |
184 | 188 | | |
185 | | - | |
186 | | - | |
| 189 | + | |
| 190 | + | |
187 | 191 | | |
188 | 192 | | |
189 | 193 | | |
| |||
0 commit comments