Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131
Conversation
|
Friendly ping on this one. @tianleiwu |
There was a problem hiding this comment.
Pull request overview
This PR fixes an incorrect QDQ Gemm → QGemm fusion case by preventing fusion when a Gemm bias is present and alpha != 1, and extends the existing QDQ Gemm transformer tests to cover alpha != 1 scenarios (including fastmath).
Changes:
- Update
GemmNodeGroupSelector::Checkto reject QDQ Gemm→QGemm fusion for bias-present Gemm whenalpha != 1. - Add
alpha_not_onecoverage toQDQTransformerGemmTestsin both the standard and fastmath test suites. - Update test graph expectations so fusion is skipped specifically for the bias +
alpha_not_onecombination.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc |
Adds an alpha == 1 requirement (when bias is present) to avoid incorrect bias scaling in QGemm fusion. |
onnxruntime/test/optimizer/qdq_transformer_test.cc |
Extends QDQ Gemm tests with alpha_not_one cases and updates fusion expectations accordingly. |
onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc |
Mirrors the same alpha_not_one test coverage for the fastmath variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks! |
Done |
|
Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again? |
Job re-tried. Hope that it can pass this time. |
|
@tianleiwu Failed again😅. I think there is a st problem with it 🤔 . |
Last retry (failed 5 times). If it does not work, please merge/rebase main. |
The Gemm→QGemm QDQ fusion selector only validated beta == 1, letting Gemms with alpha != 1 and a bias through. QGemm broadcasts the int32 bias into the accumulator before applying the alpha*sa*sb output scale, so the bias ends up scaled by alpha too — producing incorrect outputs when alpha != 1 (bias == 0 masks the issue). Add an alpha == 1 check alongside the existing beta == 1 check in GemmNodeGroupSelector::Check (only when bias is present — without bias the fused path is still exact). Extend QDQTransformerGemmTests and the fastmath variant with an alpha_not_one parameter so the regression is covered. Follow-up tracked in the issue: absorb alpha into the int32 bias in GemmReplaceWithQuant so alpha != 1 cases can keep the fusion.
Head branch was pushed to by a user without write access
184d71d to
e8c90e9
Compare
|
@tianleiwu Sadly failed at the same point. As you suggested, rebased onto current main. Sory for the trouble |
|
Hi @tianleiwu, sorry to keep bothering you with this. After rebasing to the main, the CI still failed at the same point with the same ASAN OOM, so I suspect the added test coverage might be pushing it over the memory limit. I've reduced the new cases down to the minimal essential ones (keeping the bias + alpha != 1 regression checks). Could you trigger CI once more when you have a moment? |
Here is AI analysis of the failed CI job:The failure is most likely caused by
Root cause
In std::vector<int> opsets = {15, 18, 21};
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
if (q_domain == kOnnxDomain) {
opsets = std::vector<int>{21};
}
}So this single test expands into many ASan-heavy session builds. Since the log shows the crash happens exactly when Best fixReduce the matrix for this specific no-axis test under ASan, especially for the 16-bit variants and/or MS-domain duplicates, because the behavior being validated is already covered by neighboring tests:
A focused change is to keep one representative no-axis case and skip the redundant high-memory variants in ASan builds. Suggested code changeUpdate TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
std::vector<int64_t> zp_value_shape{};
std::optional<ONNX_NAMESPACE::AttributeProto> no_axis; // Empty axis value will not be set.
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(__SANITIZE_ADDRESS__)
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}More portable version for MSVC ASanBecause compiler macros differ on Windows, a safer pattern is to add a small local guard: #if defined(__SANITIZE_ADDRESS__) || defined(ADDRESS_SANITIZER) || defined(_MSC_VER)
#define ORT_ASAN_BUILD 1
#endifThen: TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
std::vector<int64_t> zp_value_shape{};
std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(ORT_ASAN_BUILD)
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}Preferred cleaner fixIf you want to preserve all coverage outside ASan without sprinkling guards in tests, add an optional template <typename QuantType>
static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
const std::vector<int64_t>& zp_value_shape,
std::optional<ONNX_NAMESPACE::AttributeProto> axis,
const std::string& q_domain = "",
std::vector<int> opsets_override = {}) {
...
std::vector<int> opsets = opsets_override.empty() ? std::vector<int>{15, 18, 21} : std::move(opsets_override);
if (opsets_override.empty()) {
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
if (q_domain == kOnnxDomain) {
opsets = std::vector<int>{21};
}
}
}
TransformerTester(..., opsets);
}Then: TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
...
#if defined(ORT_ASAN_BUILD)
RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain, {21});
#else
...
#endif
}Why this is the right solutionThe workflow is an ASan job ( |
|
Hi @tianleiwu, thank you for the analysis! I went with option 1 ( Option 3 changes the helper signature and other call sites, so I left it out of this PR - but I can add it here if you'd prefer, or send it as a small follow-up. Thank you again! |
|
Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700). |
@elwhyjay, The solution is either reduce memory usage of the test case, or reduce test parallel in ASan to avoid OOM. Here is a PR for the later: #28675. You can merge it to your branch and test it. |
Head branch was pushed to by a user without write access
|
Hi @tianleiwu, thank you for #28675! I've merged the latest main (which now includes your --test_parallel 4 change) and also reverted the ASan guard I had added to transpose_optimizer_test.cc since it's no longer needed. Could you trigger CI when you have a moment? |
|
@elwhyjay, it seems that Asan pipeline still failed. |
|
HI @tianleiwu ! To reduce cumulative ASan pressure from this PR, I reduced the new alpha_not_one Gemm matrix to one representative uint8 QGemm path with int32 bias, while keeping bias/no-bias and output-q/no-output-q coverage in both default and fastmath tests. This should still exercise the selector behavior this PR changes, while avoiding repeated alpha coverage across every type combination. Hopefully this is a reasonable smaller test surface to try for the ASan pipeline. |
|
@tianleiwu The previous reduction did not clear the ASan job, but it does look like some progress: compared with the earlier log where ASan ran out around the transpose optimizer area, the latest run got further through This latest commit keeps the new |
# PR: Fix ASan OOM in QDQ Gemm transformer tests ## Description PR #28131 ("Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias") added `alpha_not_one` coverage to the QDQ Gemm fusion tests. This multiplied the number of `TransformerTester` session builds inside the already-large `Gemm_U8U8U8` test matrix and pushed the AddressSanitizer (ASan) build of `onnxruntime_test_all` over its allocator limit, causing the `windows_x64_asan` CI to fail with `AddressSanitizer: Out of memory. The process has exhausted 8192MB for size class 8192`. This PR isolates the `alpha != 1` coverage into small, dedicated tests so the peak memory of any single test is reduced. ## Summary of Changes | File | Change | |------|--------| | `onnxruntime/test/optimizer/qdq_transformer_test.cc` | Added an `opset_version` parameter to `QDQTransformerGemmTests` (default `0` = run opsets 12/18/19); replaced the three hardcoded `TransformerTester` calls with a loop over the selected opset(s); removed the inline `alpha_not_one` block from the templated `QDQTransformerGemmTests()`; added a dedicated `TEST(QDQTransformerTests, Gemm_AlphaNotOne_U8U8U8)` that runs only uint8/uint8/uint8 at opset 19. | | `onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc` | Same refactor for the fastmath variant; added `TEST(QDQTransformerTests, Gemm_AlphaNotOne_U8U8U8_FastMath)` running only uint8/uint8/uint8 at opset 19. | The net effect is that the incremental `alpha_not_one` work added by #28131 drops from 24 session builds (4 alpha variants × 3 opsets, in each of the regular and fastmath files) to 8, and is no longer part of the large `Gemm_U8U8U8` matrix test — directly lowering the peak memory consumed in a single test. ## Testing - Build with `--enable_address_sanitizer` (the `windows_x64_asan` configuration) and run `onnxruntime_test_all`; confirm `QDQTransformerTests.Gemm_AlphaNotOne_U8U8U8`, `QDQTransformerTests.Gemm_AlphaNotOne_U8U8U8_FastMath`, and `QDQTransformerTests.Gemm_U8U8U8` pass and the suite no longer hits the ASan OOM. - Fusion behavior is unchanged: the same `alpha != 1` rejection logic is still exercised, just with a narrower datatype/opset footprint. ## Motivation and Context The ASan failure is the sanitizer's internal allocator size-class limit (8 GB for `size class 8192`), not a runner RAM cap that can simply be raised. Loosening it via `ASAN_OPTIONS` quarantine tuning would weaken the sanitizer's bug-detection guarantees, so the fix targets the test's memory footprint instead. ### Options considered 1. **`--test_parallel` (reduce CTest concurrency).** Lower the parallelism in the ASan workflow (e.g., `--test_parallel 1`) so fewer test binaries run concurrently. This only addresses cumulative/overlapping process memory; it does **not** reduce the peak memory of a single test, and it slows the CI down for every run. Rejected as a blunt, non-durable workaround. 2. **Shard the ASan tests.** Split `onnxruntime_test_all` into N gtest shards (`GTEST_TOTAL_SHARDS` / `GTEST_SHARD_INDEX`) so the ASan allocator resets between shards. This helps with cumulative growth across the whole binary, but it still does **not** reduce the peak memory of any individual test — if one test alone approaches the limit, sharding the binary will not help. Rejected for the same root-cause reason. 3. **Break the test into smaller tests (chosen).** Isolate the `alpha != 1` coverage into dedicated tests that run a single datatype (uint8/uint8/uint8) at a single opset (19), and remove the alpha cases from the large `Gemm_U8U8U8` matrix. This reduces the work done in the heaviest single test and addresses the peak-memory problem at its source while keeping the same fusion behavior under test. Reference: PR #28131 (merge commit `585273033e`). ## Checklist - [x] Tests added/updated - [ ] Documentation updated (if applicable) - [x] No breaking changes (test-only change) - [ ] CI passes
Description
Add an
alpha == 1.0check toGemmNodeGroupSelector::Check(symmetric to the existingbeta == 1.0check) so the QDQGemm → QGemmfusion is skipped when a bias is present andalpha != 1. Extend the QDQ Gemm transformer tests (both the default and fastmath variants) withalpha_not_onecases.Motivation and Context
Fixes #28130.
When a Gemm has
alpha != 1and a bias, the QDQ fusion produces incorrect results. The root cause is in the QGemm kernel:The output scale (
alpha * a_scale * b_scale) is applied to the whole int32 accumulator, which already contains the bias, soalphaends up multiplying the bias too:This matches the reporter's observation that the discrepancy vanishes when the bias is zero.
This PR is the minimal correctness fix — it rejects the fusion in the broken case rather than producing wrong output. The follow-up discussed on the issue is to absorb
alphainto the int32 bias at graph-transform time (C_int_new = round(C_int / alpha)insideGemmReplaceWithQuant), soalpha != 1cases can keep the fused path. That is a separate PR because the precision/overflow characteristics want their own review.Test plan
QDQTransformerGemmTestsand its fastmath variant with analpha_not_oneparameter. Setalpha=2.0on the Gemm node when true, and adjustcheck_binary_op_graphso fusion is expected to be skipped when bias is present.mainthe new cases fail — both on op count and on output (expected -0.624, got -0.429, diff 0.195, tol 0.016), matching the bug report.QDQTransformerTests.Gemm_*(8),QDQTransformerTests.*(146), andGraphTransformationTests.Gemm*(16) all pass locally on macOS arm64.gemm_alpha.pyagainst a locally built wheel (macOS arm64) matches optimized vs unoptimized across all cases:alpha=1.0, optimize=False [[ 39 221 173]] alpha=1.0, optimize=True [[ 39 221 173]] alpha=2.0, optimize=False [[ 47 216 169]] alpha=2.0, optimize=True [[ 47 216 169]] (was [[0 255 218]] on main) alpha=2.0, optimize=False, bias=0 [[144 118 120]] alpha=2.0, optimize=True, bias=0 [[144 118 120]]