Enable Gemma 4 E2B / E4B inference via vLLM RPA#4053
Draft
gagika wants to merge 1 commit into
Draft
Conversation
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request successfully enables inference for Gemma 4 E2B/E4B models by implementing cross-layer KV sharing within the vLLM RPA (Ragged Paged Attention) path. The implementation is technically sound, handling the complex layer-to-slot mapping required for shared KV caches while maintaining compatibility with existing inference workflows.
🔍 General Feedback
- Robust KV-Sharing Implementation: The mapping logic in
decoders.pycorrectly handles the redirection of shared layers to donor slots, ensuring efficient memory usage during TPU inference. - Improved Attention Logic: The fix in
attentions.pyto restrict sliding window attention toLOCAL_SLIDINGlayers is a necessary correction for hybrid attention models like Gemma 4. - Clear Documentation: The added recipes in
Run_Gemma4.mdprovide essential guidance on system prompts and sampling parameters required for coherent output from these smaller checkpoints.
| donor_idx = gemma4_small.kv_donor_layer_idx(lyr, layer_types, num_kv_shared) | ||
| if donor_idx is not None: | ||
| cache_index_of[lyr] = cache_index_of[donor_idx] | ||
| else: |
There was a problem hiding this comment.
🟡 Rebuilding the `cache_index_of` dictionary on every call to `_apply_gemma4_small_layers` is redundant since the layer types and sharing pattern are static for a given model configuration. While the overhead is likely negligible for tracing, precomputing this mapping during initialization would be cleaner.
Suggested change
| else: | |
| # tpu-inference allocates one `kv_caches` slot per non-shared layer; | |
| # KV-shared layers reuse the donor's slot. Map decoder layer index -> slot. | |
| cache_index_of: dict[int, int] = {} | |
| next_slot = 0 | |
| for lyr in range(cfg.num_decoder_layers): | |
| donor_idx = gemma4_small.kv_donor_layer_idx(lyr, layer_types, num_kv_shared) | |
| if donor_idx is not None: | |
| cache_index_of[lyr] = cache_index_of[donor_idx] | |
| else: | |
| cache_index_of[lyr] = next_slot | |
| next_slot += 1 |
| if config.system_prompt: | ||
| messages.append({"role": "system", "content": config.system_prompt}) | ||
| messages.append({"role": "user", "content": config.prompt}) | ||
| input_with_chat_template = tokenizer.apply_chat_template( |
There was a problem hiding this comment.
🟢 The message creation logic is duplicated in both `decode_with_vllm` and `decode_with_tunix`. Consider refactoring this into a helper function to improve maintainability.
Suggested change
| input_with_chat_template = tokenizer.apply_chat_template( | |
| messages = [] | |
| if config.system_prompt: | |
| messages.append({"role": "system", "content": config.system_prompt}) | |
| messages.append({"role": "user", "content": config.prompt}) |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add KV-shared layers wiring adds a
system_promptflag tovllm_decode(required by the E-family-itcheckpoints), and documents a verified inference recipe inRun_Gemma4.md.Key changes
gemma4_small.py— decoder layer returns the kernel-updatedkv_cache(was dropped).decoders.py— KV-shared layers redirect to the donor'skv_cachesslot via a layer→slot map; cache is written back per layer.attentions.py— sliding-window only onLOCAL_SLIDINGlayers; KV-shared layers no longer overwrite the donor's cache (update_kv_cache=False).vllm_decode.py/types.py— newsystem_promptconfig knob, prepended as a system message whenuse_chat_template=True.Run_Gemma4.md— E2B / E4B recipe: system prompt, model-card sampling, fulleos_token_id[1, 106, 50]stop-token set.Tests
On v5p-8, e2b + e4b (vLLM TP=1): cross-checked top-1 logits at greedy vs the native checkpoint — bit-identical. With the documented recipe, models generate coherent output and stop cleanly. CLI and Python API verified.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.