[vLLM IR] rework gemma_rms_norm by ZJY0516 · Pull Request #39014 · vllm-project/vllm

ZJY0516 · 2026-04-05T05:02:04Z

Purpose

rework for #38780

cc @ProExpertProg @wxsIcey

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

gemini-code-assist

Code Review

This pull request implements mixed-dtype support for RMSNorm by ensuring that provider kernels (vllm_c, aiter, xpu) fall back to a native implementation when input and weight dtypes do not match. It refactors GemmaRMSNorm to use the unified IR operation and adds comprehensive tests for mixed-dtype scenarios. Review feedback identifies a potential precision loss in the native multiplication step, a bug where the residual tensor is returned in float32 instead of the original dtype, and a performance regression in GemmaRMSNorm caused by the removal of torch.compile and redundant weight conversions.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 979bea2a64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-05T05:07:05Z

+            if not _rms_weight_dtype_matches_input(input, weight):
+                result_rms = vllm.ir.ops.rms_norm(input, weight, self.epsilon)
+                return self.quant_matcher(result_rms, scale)[0]


Move mixed-dtype gate out of traced replacement function

This mixed-dtype fallback is placed inside a replacement closure that Inductor traces using example tensors, so the Python if is resolved at trace time (with homogeneous sample dtypes) and the fallback branch is not preserved in the replacement graph. As a result, mixed-dtype RMSNorm+quant graphs introduced by this commit can still be rewritten to fused kernels that require weight.dtype == input.dtype, which leads to runtime failures (or undefined behavior for kernels that reinterpret weight as input scalar type). Please enforce this constraint via match filtering (e.g., extra_check/separate pattern registration) rather than a runtime Python branch inside replacement.

Useful? React with 👍 / 👎.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ProExpertProg · 2026-04-05T13:19:03Z

This is a nice fix! IIUC, you convert x to fp32 before passing it to rms_norm so it's the same dtype as weight. I think it would be good to check the final compiled graph and get performance numbers - if x is 16bit and produced by a custom kernel, I'd want to avoid an extra triton kernel to convert to 32bit. Also wonder if it's slower to read & write x at 32 bit instead of 16.

ZJY0516 · 2026-04-05T13:24:18Z

This is a nice fix! IIUC, you convert x to fp32 before passing it to rms_norm so it's the same dtype as weight. I think it would be good to check the final compiled graph and get performance numbers - if x is 16bit and produced by a custom kernel, I'd want to avoid an extra triton kernel to convert to 32bit. Also wonder if it's slower to read & write x at 32 bit instead of 16.

Actually, this will make models/quantization/test_gguf.py::test_models[1-5-32-bfloat16-model6] fail

Need to fix

https://buildkite.com/vllm/ci/builds/59823/steps/canvas?jid=019d5d54-ee19-441f-9417-854701716e26&tab=output

wxsIcey · 2026-04-05T14:13:42Z

Thanks for your work! The current solution seems to be causing accuracy issues. I think we could try the solution provided by Luka, limiting fusion doesn't happen with fp32 weights.

ZJY0516 · 2026-04-05T14:22:31Z

Thanks for your work! The current solution seems to be causing accuracy issues. I think we could try the solution provided by Luka, limiting fusion doesn't happen with fp32 weights.

It's not about fusion. It's beause in this pr gemma gguf test uses cuda kernel, but previous is native implementation

wxsIcey · 2026-04-05T14:48:15Z

Thanks for your work! The current solution seems to be causing accuracy issues. I think we could try the solution provided by Luka, limiting fusion doesn't happen with fp32 weights.

It's not about fusion. It's beause in this pr gemma gguf test uses cuda kernel, but previous is native implementation

If x isn't converted to float(), the dtype of x and the weight are inconsistent, preventing the CUDA kernel from being used. Then, by implementing restriction in fusion pass, perhaps CI can pass completely.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ProExpertProg

Slight simplification?

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

mergify · 2026-04-06T03:56:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZJY0516.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # tests/kernels/core/test_layernorm.py Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 · 2026-04-06T12:27:05Z

@ProExpertProg I disable allreduce_rms fusion when dtype is mismatching. It wll cause accuracy issue for quantized model

ProExpertProg

Disabling allreduce+rms fusion with mismatched types sounds good. CI failing though?

yma11 · 2026-04-07T03:40:55Z

cc @chaojun-zhang

ZJY0516 · 2026-04-07T03:42:47Z

Disabling allreduce+rms fusion with mismatched types sounds good. CI failing though?

This failure is unrelated I think

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Jacob Lou <jacoblou0924@gmail.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

init

979bea2

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from BoyuanFeng, ProExpertProg, tjtanaa, vadiklyutiy, youkaichao and zou3519 as code owners April 5, 2026 05:02

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread vllm/ir/ops/layernorm.py

Comment thread vllm/model_executor/layers/layernorm.py Outdated

Comment thread vllm/model_executor/layers/layernorm.py

chatgpt-codex-connector Bot reviewed Apr 5, 2026

View reviewed changes

ZJY0516 added 2 commits April 5, 2026 13:35

update

0d73436

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

b4b37f9

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners April 5, 2026 08:25

ZJY0516 added 5 commits April 5, 2026 16:27

update

e587cef

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

0a88d96

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

15c9129

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

16657d1

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

add a test

9f57c6d

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 added the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Apr 5, 2026

ZJY0516 removed the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Apr 5, 2026

update

d9c86dc

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Apr 5, 2026

ProExpertProg approved these changes Apr 5, 2026

View reviewed changes

Comment thread vllm/ir/ops/layernorm.py Outdated

Update vllm/ir/ops/layernorm.py

af037cd

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

mergify Bot added the needs-rebase label Apr 6, 2026

Merge branch 'main' into rework-gemma-norm

6cdaf87

# Conflicts: # tests/kernels/core/test_layernorm.py Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mergify Bot removed the needs-rebase label Apr 6, 2026

ZJY0516 added 3 commits April 6, 2026 10:22

update

08e28a9

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Merge branch 'main' into rework-gemma-norm

f73009e

update

672fb48

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 force-pushed the rework-gemma-norm branch from da2e44c to 672fb48 Compare April 6, 2026 10:26

ProExpertProg approved these changes Apr 6, 2026

View reviewed changes

xinyu-intel approved these changes Apr 7, 2026

View reviewed changes

vllm-bot merged commit 8060bb0 into vllm-project:main Apr 7, 2026
179 of 181 checks passed

Uh oh!

Conversation

ZJY0516 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented Apr 5, 2026

Uh oh!

ZJY0516 commented Apr 5, 2026

Uh oh!

wxsIcey commented Apr 5, 2026

Uh oh!

ZJY0516 commented Apr 5, 2026

Uh oh!

wxsIcey commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Apr 6, 2026

Uh oh!

ZJY0516 commented Apr 6, 2026

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

yma11 commented Apr 7, 2026

Uh oh!

ZJY0516 commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ZJY0516 commented Apr 5, 2026 •

edited

Loading

wxsIcey commented Apr 5, 2026 •

edited

Loading

ProExpertProg left a comment •

edited

Loading