Skip to content

Respect user override for Gemma4 attention backend#25547

Merged
Fridge003 merged 2 commits into
sgl-project:mainfrom
kpham-sgl:gemma4-attn-backend-guard
May 18, 2026
Merged

Respect user override for Gemma4 attention backend#25547
Fridge003 merged 2 commits into
sgl-project:mainfrom
kpham-sgl:gemma4-attn-backend-guard

Conversation

@kpham-sgl
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl commented May 17, 2026

Summary

  • Follow-up to Enable trtllm_mha as gemma4 default attn backend. #25006. The Gemma4 default-backend block was unconditionally overwriting self.attention_backend, so --attention-backend was silently ignored.
  • Only auto-select (trtllm_mha on sm100, triton otherwise) when no backend has been set by the user.
  • If the user did pass --attention-backend, assert it is one of trtllm_mha or triton (the two Gemma4-supported backends).

Test plan

  • Launch Gemma4 with no --attention-backend on an sm100 GPU and confirm trtllm_mha is selected (log line).
  • Launch Gemma4 with no --attention-backend on a non-sm100 GPU and confirm triton is selected.
  • Launch Gemma4 with --attention-backend trtllm_mha / triton and confirm the user value is preserved (no log override).
  • Launch Gemma4 with --attention-backend flashinfer and confirm the assertion fires with a clear message.

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Missing run-ci label — add it to run CI tests.
Latest PR Test (Extra): ❌ Blockedrun-ci is required first.

Only auto-select trtllm_mha (sm100) or triton when the user hasn't
explicitly set --attention-backend; otherwise assert the chosen backend
is one of trtllm_mha or triton.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the attention backend selection logic for the Gemma4ForConditionalGeneration model, defaulting to trtllm_mha or triton based on hardware support and adding validation for user-specified backends. A review comment identifies a potential logic error where the new assertion could fail if attention_backend is None while other specific backend flags are set, and suggests using get_attention_backends() for safer validation.

Comment thread python/sglang/srt/server_args.py Outdated
@kpham-sgl
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/core/test_gemma4_moe_deterministic.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

🚀 2-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/core/test_gemma4_moe_deterministic.py

Address Gemini review feedback: when the user sets only the split
prefill/decode backend flags, self.attention_backend is None and the
plain-attribute assertion misses the actual prefill/decode choices.
Route validation through get_attention_backends() and fall back to a
Gemma4-compatible default if only one split side is set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kpham-sgl
Copy link
Copy Markdown
Collaborator Author

/rerun-test test/registered/core/test_gemma4_moe_deterministic.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

🚀 2-gpu-h100 (1 test): ✅ View workflow run

cd test/ && python3 registered/core/test_gemma4_moe_deterministic.py

@Fridge003
Copy link
Copy Markdown
Collaborator

All protected under gemma 4 branch, should be safe enough

@Fridge003 Fridge003 merged commit b29e41e into sgl-project:main May 18, 2026
74 of 84 checks passed
jmunetong added a commit to jmunetong/sglang that referenced this pull request May 18, 2026
- server_args.py: extend Gemma4 accepted_backends to include intel_xpu so
  the model can be served with --attention-backend intel_xpu (PR sgl-project#25547
  whitelist had restricted to trtllm_mha / triton).
- test/srt/xpu/test_gemma_4_31b.py: 31B XPU smoke test mirroring the e2b
  stencil (OpenAI /v1, single Q&A).
- test/srt/xpu/gemma_4_{31b,e2b}_comparison.txt: comparison logs from the
  attention-backend A/B runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants