[codex] Fix vLLM layerwise reload alias buffers#2701
Draft
samsja wants to merge 2 commits into
Draft
Conversation
dcc607a to
305689d
Compare
hallerite
added a commit
that referenced
this pull request
Jun 4, 2026
…fter vLLM reload vLLM 0.22's layerwise reload mis-loads exactly two NemotronH per-layer param families through the online-reload path -- mixer.D (Mamba SSD skip) and the MoE router's gate.e_score_correction_bias -- while loading all other weights correctly. mixer.D becomes non-deterministic garbage/inf (NaN logits) and the gate bias gets a wrong value (broken routing), so generations go to NaN after a weight update. Restore both from the received broadcast (correct by definition) via each param's own weight_loader. Also drop monkey_patch_vllm_layerwise_reload_alias_buffers: it crashes on vLLM 0.22 (AttributeError on the delattr'd conv_weights) and conv_weights is handled correctly by vLLM's native reload finalize. Supersedes #2701. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MambaMixer2.conv_weights -> conv1d.weightand skip alias buffers that vLLM intentionally omitted from restore metadata./update_weightsresults, not only trainer-side broadcast logs.Details
The vLLM API server
/update_weightspath reloads checkpoint-format weights layer by layer. NemotronH/Mamba can register buffers on a parent module that alias parameter storage owned by a child module. The previous alias handling only covered direct layer parameters and could either copy stale buffer data back over parameter storage or fail when vLLM omitted an alias buffer from the restored module state.The patch compares aliased buffers against recursive parameter storage and captured kernel parameters, uses module registries instead of
getattr, and skips absent alias buffers.Validation
UV_NO_SYNC=1 uv run pytest tests/unit/inference/test_vllm_reload_patches.pyUV_NO_SYNC=1 uv run ruff check src/prime_rl/inference/patches.py tests/unit/inference/test_vllm_reload_patches.py23599with the external-LB config fix from [codex] Fix external-LB inference config sizing #2705 reached all 4 API servers ready, completed rollouts withError 0.0%, all 16 vLLM workers reloaded checkpoint-format weights, all four/update_weightscalls returned200 OK, inference resumed, and the trainer started step 1. Final log scan found noERROR,Traceback,Fatal,ValueError,data_parallel_rank,conv_weights, orInternal Server Errorfailures.