Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131
Conversation
|
Friendly ping on this one. @tianleiwu |
There was a problem hiding this comment.
Pull request overview
This PR fixes an incorrect QDQ Gemm → QGemm fusion case by preventing fusion when a Gemm bias is present and alpha != 1, and extends the existing QDQ Gemm transformer tests to cover alpha != 1 scenarios (including fastmath).
Changes:
- Update
GemmNodeGroupSelector::Checkto reject QDQ Gemm→QGemm fusion for bias-present Gemm whenalpha != 1. - Add
alpha_not_onecoverage toQDQTransformerGemmTestsin both the standard and fastmath test suites. - Update test graph expectations so fusion is skipped specifically for the bias +
alpha_not_onecombination.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc |
Adds an alpha == 1 requirement (when bias is present) to avoid incorrect bias scaling in QGemm fusion. |
onnxruntime/test/optimizer/qdq_transformer_test.cc |
Extends QDQ Gemm tests with alpha_not_one cases and updates fusion expectations accordingly. |
onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc |
Mirrors the same alpha_not_one test coverage for the fastmath variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks! |
Done |
|
Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again? |
Job re-tried. Hope that it can pass this time. |
|
@tianleiwu Failed again😅. I think there is a st problem with it 🤔 . |
Last retry (failed 5 times). If it does not work, please merge/rebase main. |
The Gemm→QGemm QDQ fusion selector only validated beta == 1, letting Gemms with alpha != 1 and a bias through. QGemm broadcasts the int32 bias into the accumulator before applying the alpha*sa*sb output scale, so the bias ends up scaled by alpha too — producing incorrect outputs when alpha != 1 (bias == 0 masks the issue). Add an alpha == 1 check alongside the existing beta == 1 check in GemmNodeGroupSelector::Check (only when bias is present — without bias the fused path is still exact). Extend QDQTransformerGemmTests and the fastmath variant with an alpha_not_one parameter so the regression is covered. Follow-up tracked in the issue: absorb alpha into the int32 bias in GemmReplaceWithQuant so alpha != 1 cases can keep the fusion.
Head branch was pushed to by a user without write access
184d71d to
e8c90e9
Compare
|
@tianleiwu Sadly failed at the same point. As you suggested, rebased onto current main. Sory for the trouble |
|
Hi @tianleiwu, sorry to keep bothering you with this. After rebasing to the main, the CI still failed at the same point with the same ASAN OOM, so I suspect the added test coverage might be pushing it over the memory limit. I've reduced the new cases down to the minimal essential ones (keeping the bias + alpha != 1 regression checks). Could you trigger CI once more when you have a moment? |
Here is AI analysis of the failed CI job:The failure is most likely caused by
Root cause
In std::vector<int> opsets = {15, 18, 21};
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
if (q_domain == kOnnxDomain) {
opsets = std::vector<int>{21};
}
}So this single test expands into many ASan-heavy session builds. Since the log shows the crash happens exactly when Best fixReduce the matrix for this specific no-axis test under ASan, especially for the 16-bit variants and/or MS-domain duplicates, because the behavior being validated is already covered by neighboring tests:
A focused change is to keep one representative no-axis case and skip the redundant high-memory variants in ASan builds. Suggested code changeUpdate TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
std::vector<int64_t> zp_value_shape{};
std::optional<ONNX_NAMESPACE::AttributeProto> no_axis; // Empty axis value will not be set.
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(__SANITIZE_ADDRESS__)
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}More portable version for MSVC ASanBecause compiler macros differ on Windows, a safer pattern is to add a small local guard: #if defined(__SANITIZE_ADDRESS__) || defined(ADDRESS_SANITIZER) || defined(_MSC_VER)
#define ORT_ASAN_BUILD 1
#endifThen: TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
std::vector<int64_t> zp_value_shape{};
std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(ORT_ASAN_BUILD)
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}Preferred cleaner fixIf you want to preserve all coverage outside ASan without sprinkling guards in tests, add an optional template <typename QuantType>
static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
const std::vector<int64_t>& zp_value_shape,
std::optional<ONNX_NAMESPACE::AttributeProto> axis,
const std::string& q_domain = "",
std::vector<int> opsets_override = {}) {
...
std::vector<int> opsets = opsets_override.empty() ? std::vector<int>{15, 18, 21} : std::move(opsets_override);
if (opsets_override.empty()) {
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
if (q_domain == kOnnxDomain) {
opsets = std::vector<int>{21};
}
}
}
TransformerTester(..., opsets);
}Then: TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
...
#if defined(ORT_ASAN_BUILD)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain, {21});
#else
...
#endif
}Why this is the right solutionThe workflow is an ASan job ( |
|
Hi @tianleiwu, thank you for the analysis! I went with option 1 ( Option 3 changes the helper signature and other call sites, so I left it out of this PR - but I can add it here if you'd prefer, or send it as a small follow-up. Thank you again! |
|
Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700). |
@elwhyjay, The solution is either reduce memory usage of the test case, or reduce test parallel in ASan to avoid OOM. Here is a PR for the later: #28675. You can merge it to your branch and test it. |
Description
Add an
alpha == 1.0check toGemmNodeGroupSelector::Check(symmetric to the existingbeta == 1.0check) so the QDQGemm → QGemmfusion is skipped when a bias is present andalpha != 1. Extend the QDQ Gemm transformer tests (both the default and fastmath variants) withalpha_not_onecases.Motivation and Context
Fixes #28130.
When a Gemm has
alpha != 1and a bias, the QDQ fusion produces incorrect results. The root cause is in the QGemm kernel:The output scale (
alpha * a_scale * b_scale) is applied to the whole int32 accumulator, which already contains the bias, soalphaends up multiplying the bias too:This matches the reporter's observation that the discrepancy vanishes when the bias is zero.
This PR is the minimal correctness fix — it rejects the fusion in the broken case rather than producing wrong output. The follow-up discussed on the issue is to absorb
alphainto the int32 bias at graph-transform time (C_int_new = round(C_int / alpha)insideGemmReplaceWithQuant), soalpha != 1cases can keep the fused path. That is a separate PR because the precision/overflow characteristics want their own review.Test plan
QDQTransformerGemmTestsand its fastmath variant with analpha_not_oneparameter. Setalpha=2.0on the Gemm node when true, and adjustcheck_binary_op_graphso fusion is expected to be skipped when bias is present.mainthe new cases fail — both on op count and on output (expected -0.624, got -0.429, diff 0.195, tol 0.016), matching the bug report.QDQTransformerTests.Gemm_*(8),QDQTransformerTests.*(146), andGraphTransformationTests.Gemm*(16) all pass locally on macOS arm64.gemm_alpha.pyagainst a locally built wheel (macOS arm64) matches optimized vs unoptimized across all cases:alpha=1.0, optimize=False [[ 39 221 173]] alpha=1.0, optimize=True [[ 39 221 173]] alpha=2.0, optimize=False [[ 47 216 169]] alpha=2.0, optimize=True [[ 47 216 169]] (was [[0 255 218]] on main) alpha=2.0, optimize=False, bias=0 [[144 118 120]] alpha=2.0, optimize=True, bias=0 [[144 118 120]]