Respect user override for Gemma4 attention backend#25547
Conversation
Only auto-select trtllm_mha (sm100) or triton when the user hasn't explicitly set --attention-backend; otherwise assert the chosen backend is one of trtllm_mha or triton. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the attention backend selection logic for the Gemma4ForConditionalGeneration model, defaulting to trtllm_mha or triton based on hardware support and adding validation for user-specified backends. A review comment identifies a potential logic error where the new assertion could fail if attention_backend is None while other specific backend flags are set, and suggests using get_attention_backends() for safer validation.
|
/rerun-test test/registered/core/test_gemma4_moe_deterministic.py |
|
🚀 |
Address Gemini review feedback: when the user sets only the split prefill/decode backend flags, self.attention_backend is None and the plain-attribute assertion misses the actual prefill/decode choices. Route validation through get_attention_backends() and fall back to a Gemma4-compatible default if only one split side is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/rerun-test test/registered/core/test_gemma4_moe_deterministic.py |
|
🚀 |
|
All protected under gemma 4 branch, should be safe enough |
- server_args.py: extend Gemma4 accepted_backends to include intel_xpu so the model can be served with --attention-backend intel_xpu (PR sgl-project#25547 whitelist had restricted to trtllm_mha / triton). - test/srt/xpu/test_gemma_4_31b.py: 31B XPU smoke test mirroring the e2b stencil (OpenAI /v1, single Q&A). - test/srt/xpu/gemma_4_{31b,e2b}_comparison.txt: comparison logs from the attention-backend A/B runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
self.attention_backend, so--attention-backendwas silently ignored.trtllm_mhaon sm100,tritonotherwise) when no backend has been set by the user.--attention-backend, assert it is one oftrtllm_mhaortriton(the two Gemma4-supported backends).Test plan
--attention-backendon an sm100 GPU and confirmtrtllm_mhais selected (log line).--attention-backendon a non-sm100 GPU and confirmtritonis selected.--attention-backend trtllm_mha/tritonand confirm the user value is preserved (no log override).--attention-backend flashinferand confirm the assertion fires with a clear message.🤖 Generated with Claude Code
CI States
Latest PR Test (Base): ❌ Missing
run-cilabel — add it to run CI tests.Latest PR Test (Extra): ❌ Blocked —
run-ciis required first.