Skip to content

Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131

Merged
tianleiwu merged 8 commits into
microsoft:mainfrom
elwhyjay:fix/qdq-gemm-alpha-28130
Jun 3, 2026
Merged

Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias#28131
tianleiwu merged 8 commits into
microsoft:mainfrom
elwhyjay:fix/qdq-gemm-alpha-28130

Conversation

@elwhyjay

@elwhyjay elwhyjay commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Description

Add an alpha == 1.0 check to GemmNodeGroupSelector::Check (symmetric to the existing beta == 1.0 check) so the QDQ Gemm → QGemm fusion is skipped when a bias is present and alpha != 1. Extend the QDQ Gemm transformer tests (both the default and fastmath variants) with alpha_not_one cases.

Motivation and Context

Fixes #28130.

When a Gemm has alpha != 1 and a bias, the QDQ fusion produces incorrect results. The root cause is in the QGemm kernel:

acc_int32 = (A - zpA)(B - zpB) + C_int        // bias added to int accumulator
Y_float   = (alpha * sa * sb) * acc_int32      // output scale applied to everything

The output scale (alpha * a_scale * b_scale) is applied to the whole int32 accumulator, which already contains the bias, so alpha ends up multiplying the bias too:

Y = alpha * A_dq * B_dq + alpha * C_float      (buggy)
Y = alpha * A_dq * B_dq + C_float              (expected)

This matches the reporter's observation that the discrepancy vanishes when the bias is zero.

This PR is the minimal correctness fix — it rejects the fusion in the broken case rather than producing wrong output. The follow-up discussed on the issue is to absorb alpha into the int32 bias at graph-transform time (C_int_new = round(C_int / alpha) inside GemmReplaceWithQuant), so alpha != 1 cases can keep the fused path. That is a separate PR because the precision/overflow characteristics want their own review.

Test plan

  • Extend QDQTransformerGemmTests and its fastmath variant with an alpha_not_one parameter. Set alpha=2.0 on the Gemm node when true, and adjust check_binary_op_graph so fusion is expected to be skipped when bias is present.
  • On main the new cases fail — both on op count and on output (expected -0.624, got -0.429, diff 0.195, tol 0.016), matching the bug report.
  • With the selector fix, QDQTransformerTests.Gemm_* (8), QDQTransformerTests.* (146), and GraphTransformationTests.Gemm* (16) all pass locally on macOS arm64.
  • End-to-end with the reporter's gemm_alpha.py against a locally built wheel (macOS arm64) matches optimized vs unoptimized across all cases:
    alpha=1.0, optimize=False     [[ 39 221 173]]
    alpha=1.0, optimize=True      [[ 39 221 173]]
    alpha=2.0, optimize=False     [[ 47 216 169]]
    alpha=2.0, optimize=True      [[ 47 216 169]]   (was [[0 255 218]] on main)
    alpha=2.0, optimize=False, bias=0   [[144 118 120]]
    alpha=2.0, optimize=True,  bias=0   [[144 118 120]]

@elwhyjay

Copy link
Copy Markdown
Contributor Author

Friendly ping on this one. @tianleiwu
Most CI is green; the only failure I see is windows_x64_asan, which appears to be an AddressSanitizer OOM in unrelated tests. Happy to rebase/update if preferred.
and would appreciate any guidance on whether this minimal fix is acceptable, or whether maintainers would prefer folding the follow-up bias-rescaling approach into this PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an incorrect QDQ Gemm → QGemm fusion case by preventing fusion when a Gemm bias is present and alpha != 1, and extends the existing QDQ Gemm transformer tests to cover alpha != 1 scenarios (including fastmath).

Changes:

  • Update GemmNodeGroupSelector::Check to reject QDQ Gemm→QGemm fusion for bias-present Gemm when alpha != 1.
  • Add alpha_not_one coverage to QDQTransformerGemmTests in both the standard and fastmath test suites.
  • Update test graph expectations so fusion is skipped specifically for the bias + alpha_not_one combination.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc Adds an alpha == 1 requirement (when bias is present) to avoid incorrect bias scaling in QGemm fusion.
onnxruntime/test/optimizer/qdq_transformer_test.cc Extends QDQ Gemm tests with alpha_not_one cases and updates fusion expectations accordingly.
onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc Mirrors the same alpha_not_one test coverage for the fastmath variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu
tianleiwu previously approved these changes May 1, 2026
@tianleiwu tianleiwu enabled auto-merge (squash) May 1, 2026 18:25
@elwhyjay

elwhyjay commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

@tianleiwu

Copy link
Copy Markdown
Contributor

Hi @tianleiwu, this PR is blocked from auto-merge by a single failing check (windows_x64_asan / build_x64).Could you re-run the ASAN job? Thanks!

Done

@elwhyjay

Copy link
Copy Markdown
Contributor Author

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

@tianleiwu

Copy link
Copy Markdown
Contributor

Gentle ping - @tianleiwu. this PR is still blocked by a same single check. Could yo re-run this job again?

Job re-tried. Hope that it can pass this time.

@elwhyjay

Copy link
Copy Markdown
Contributor Author

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

@tianleiwu

tianleiwu commented May 13, 2026

Copy link
Copy Markdown
Contributor

@tianleiwu Failed again😅. I think there is a st problem with it 🤔 .

Last retry (failed 5 times). If it does not work, please merge/rebase main.

The Gemm→QGemm QDQ fusion selector only validated beta == 1, letting
Gemms with alpha != 1 and a bias through. QGemm broadcasts the int32
bias into the accumulator before applying the alpha*sa*sb output scale,
so the bias ends up scaled by alpha too — producing incorrect outputs
when alpha != 1 (bias == 0 masks the issue).

Add an alpha == 1 check alongside the existing beta == 1 check in
GemmNodeGroupSelector::Check (only when bias is present — without bias
the fused path is still exact). Extend QDQTransformerGemmTests and the
fastmath variant with an alpha_not_one parameter so the regression is
covered.

Follow-up tracked in the issue: absorb alpha into the int32 bias in
GemmReplaceWithQuant so alpha != 1 cases can keep the fusion.
auto-merge was automatically disabled May 13, 2026 02:26

Head branch was pushed to by a user without write access

@elwhyjay elwhyjay force-pushed the fix/qdq-gemm-alpha-28130 branch from 184d71d to e8c90e9 Compare May 13, 2026 02:26
@elwhyjay

Copy link
Copy Markdown
Contributor Author

@tianleiwu Sadly failed at the same point. As you suggested, rebased onto current main. Sory for the trouble

@elwhyjay

Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, sorry to keep bothering you with this. After rebasing to the main, the CI still failed at the same point with the same ASAN OOM, so I suspect the added test coverage might be pushing it over the memory limit. I've reduced the new cases down to the minimal essential ones (keeping the bias + alpha != 1 regression checks). Could you trigger CI once more when you have a moment?

@tianleiwu

Copy link
Copy Markdown
Contributor

@elwhyjay,

Here is AI analysis of the failed CI job:

The failure is most likely caused by TransposeOptimizerTests.TestDequantizeLinearNoAxis, which crashes under ASan with OOM immediately after starting:

Root cause

TestDequantizeLinearNoAxis invokes RunDequantizeLinearTestCase for 6 variants:

  • ONNX domain: uint8_t, uint16_t, int16_t
  • MS domain: uint8_t, uint16_t, int16_t

In RunDequantizeLinearTestCase, all variants use TransformerTester(...) across multiple opsets:

std::vector<int> opsets = {15, 18, 21};
if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
  if (q_domain == kOnnxDomain) {
    opsets = std::vector<int>{21};
  }
}

So this single test expands into many ASan-heavy session builds. Since the log shows the crash happens exactly when TestDequantizeLinearNoAxis starts, the issue is test memory pressure, not a product correctness failure.

Best fix

Reduce the matrix for this specific no-axis test under ASan, especially for the 16-bit variants and/or MS-domain duplicates, because the behavior being validated is already covered by neighboring tests:

  • TestDequantizeLinearScalarIgnoreAxis
  • TestDequantizeLinearVector

A focused change is to keep one representative no-axis case and skip the redundant high-memory variants in ASan builds.

Suggested code change

Update TestDequantizeLinearNoAxis to limit coverage under ASan:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;  // Empty axis value will not be set.

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(__SANITIZE_ADDRESS__)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

More portable version for MSVC ASan

Because compiler macros differ on Windows, a safer pattern is to add a small local guard:

#if defined(__SANITIZE_ADDRESS__) || defined(ADDRESS_SANITIZER) || defined(_MSC_VER)
#define ORT_ASAN_BUILD 1
#endif

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  std::optional<std::vector<int64_t>> zp_input_shape = std::vector<int64_t>{};
  std::vector<int64_t> zp_value_shape{};
  std::optional<ONNX_NAMESPACE::AttributeProto> no_axis;

  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);

#if !defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain);
#if !defined(DISABLE_CONTRIB_OPS)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<uint16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
  RunDequantizeLinearTestCase<int16_t>(zp_input_shape, zp_value_shape, no_axis, kMSDomain);
#endif
#endif
}

Preferred cleaner fix

If you want to preserve all coverage outside ASan without sprinkling guards in tests, add an optional opsets override or reduced-coverage mode to the helper and use it only here. Example:

template <typename QuantType>
static void RunDequantizeLinearTestCase(const std::optional<std::vector<int64_t>>& zp_input_shape,
                                        const std::vector<int64_t>& zp_value_shape,
                                        std::optional<ONNX_NAMESPACE::AttributeProto> axis,
                                        const std::string& q_domain = "",
                                        std::vector<int> opsets_override = {}) {
  ...
  std::vector<int> opsets = opsets_override.empty() ? std::vector<int>{15, 18, 21} : std::move(opsets_override);

  if (opsets_override.empty()) {
    if constexpr (std::is_same_v<QuantType, uint16_t> || std::is_same_v<QuantType, int16_t>) {
      if (q_domain == kOnnxDomain) {
        opsets = std::vector<int>{21};
      }
    }
  }

  TransformerTester(..., opsets);
}

Then:

TEST(TransposeOptimizerTests, TestDequantizeLinearNoAxis) {
  ...
#if defined(ORT_ASAN_BUILD)
  RunDequantizeLinearTestCase<uint8_t>(zp_input_shape, zp_value_shape, no_axis, kOnnxDomain, {21});
#else
  ...
#endif
}

Why this is the right solution

The workflow is an ASan job (windows_build_x64_asan), and the log shows the process dies from memory exhaustion rather than a semantic assertion failure. The fix should therefore reduce peak memory from this test’s combinatorial coverage, not change runtime code.

@elwhyjay

Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, thank you for the analysis! I went with option 1 (__SANITIZE_ADDRESS__ guard) because adding _MSC_VER would also disable the variants on non-ASan Windows builds, and MSVC 16.9+ defines __SANITIZE_ADDRESS__ under /fsanitize=address anyway. Also restored the alpha_not_one cases I had trimmed earlier.

Option 3 changes the helper signature and other call sites, so I left it out of this PR - but I can add it here if you'd prefer, or send it as a small follow-up. Thank you again!

@tianleiwu tianleiwu enabled auto-merge (squash) May 22, 2026 04:07
tianleiwu
tianleiwu previously approved these changes May 22, 2026
@elwhyjay

Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700).
So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

@tianleiwu

tianleiwu commented May 26, 2026

Copy link
Copy Markdown
Contributor

Hi @tianleiwu, the rerun failed again with the same ASan OOM, but the failing test was TransposeOptimizerTests.TestReduceMaxKeepdimsFalse (line 1265), which runs before TestDequantizeLinearNoAxis (line 3760), so the guard never gets a chance to trigger. Looking at the previous failures too, the OOM consistently happens somewhere in the TransposeOptimizerTests block (lines ~1200–1700). So it looks like the cumulative memory pressure builds up before reaching the guard. Could you advise on how you'd like to proceed?

@elwhyjay, The solution is either reduce memory usage of the test case, or reduce test parallel in ASan to avoid OOM.

Here is a PR for the later: #28675. You can merge it to your branch and test it.

auto-merge was automatically disabled May 27, 2026 01:45

Head branch was pushed to by a user without write access

@elwhyjay

Copy link
Copy Markdown
Contributor Author

Hi @tianleiwu, thank you for #28675! I've merged the latest main (which now includes your --test_parallel 4 change) and also reverted the ASan guard I had added to transpose_optimizer_test.cc since it's no longer needed. Could you trigger CI when you have a moment?

@tianleiwu

Copy link
Copy Markdown
Contributor

@elwhyjay, it seems that Asan pipeline still failed.

@elwhyjay

elwhyjay commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

HI @tianleiwu ! To reduce cumulative ASan pressure from this PR, I reduced the new alpha_not_one Gemm matrix to one representative uint8 QGemm path with int32 bias, while keeping bias/no-bias and output-q/no-output-q coverage in both default and fastmath tests. This should still exercise the selector behavior this PR changes, while avoiding repeated alpha coverage across every type combination. Hopefully this is a reasonable smaller test surface to try for the ASan pipeline.

@elwhyjay

elwhyjay commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@tianleiwu
Thanks again for helping with the CI reruns here. I pushed one more small test-surface reduction in ea615ec.

The previous reduction did not clear the ASan job, but it does look like some progress: compared with the earlier log where ASan ran out around the transpose optimizer area, the latest run got further through onnxruntime_test_all before hitting the same 8GB ASan allocator limit.

This latest commit keeps the new alpha_not_one coverage on the representative uint8/int32-bias QGemm path, and additionally avoids repeating the same selector behavior across the larger shape and contrib-QDQ variants. Hopefully this preserves the regression coverage for the behavior changed in this PR while giving the ASan pipeline a smaller incremental surface to try.

@tianleiwu tianleiwu merged commit 5852730 into microsoft:main Jun 3, 2026
85 checks passed
tianleiwu added a commit that referenced this pull request Jun 4, 2026
# PR: Fix ASan OOM in QDQ Gemm transformer tests

## Description

PR #28131 ("Reject QDQ Gemm→QGemm fusion when alpha != 1 with bias")
added `alpha_not_one`
coverage to the QDQ Gemm fusion tests. This multiplied the number of
`TransformerTester`
session builds inside the already-large `Gemm_U8U8U8` test matrix and
pushed the
AddressSanitizer (ASan) build of `onnxruntime_test_all` over its
allocator limit, causing the
`windows_x64_asan` CI to fail with `AddressSanitizer: Out of memory. The
process has exhausted
8192MB for size class 8192`. This PR isolates the `alpha != 1` coverage
into small, dedicated
tests so the peak memory of any single test is reduced.

## Summary of Changes

| File | Change |
|------|--------|
| `onnxruntime/test/optimizer/qdq_transformer_test.cc` | Added an
`opset_version` parameter to `QDQTransformerGemmTests` (default `0` =
run opsets 12/18/19); replaced the three hardcoded `TransformerTester`
calls with a loop over the selected opset(s); removed the inline
`alpha_not_one` block from the templated `QDQTransformerGemmTests()`;
added a dedicated `TEST(QDQTransformerTests, Gemm_AlphaNotOne_U8U8U8)`
that runs only uint8/uint8/uint8 at opset 19. |
| `onnxruntime/test/optimizer/qdq_transformer_fastmath_test.cc` | Same
refactor for the fastmath variant; added `TEST(QDQTransformerTests,
Gemm_AlphaNotOne_U8U8U8_FastMath)` running only uint8/uint8/uint8 at
opset 19. |

The net effect is that the incremental `alpha_not_one` work added by
#28131 drops from 24
session builds (4 alpha variants × 3 opsets, in each of the regular and
fastmath files) to 8,
and is no longer part of the large `Gemm_U8U8U8` matrix test — directly
lowering the peak
memory consumed in a single test.

## Testing

- Build with `--enable_address_sanitizer` (the `windows_x64_asan`
configuration) and run
`onnxruntime_test_all`; confirm
`QDQTransformerTests.Gemm_AlphaNotOne_U8U8U8`,
`QDQTransformerTests.Gemm_AlphaNotOne_U8U8U8_FastMath`, and
`QDQTransformerTests.Gemm_U8U8U8`
  pass and the suite no longer hits the ASan OOM.
- Fusion behavior is unchanged: the same `alpha != 1` rejection logic is
still exercised, just
  with a narrower datatype/opset footprint.

## Motivation and Context

The ASan failure is the sanitizer's internal allocator size-class limit
(8 GB for
`size class 8192`), not a runner RAM cap that can simply be raised.
Loosening it via
`ASAN_OPTIONS` quarantine tuning would weaken the sanitizer's
bug-detection guarantees, so the
fix targets the test's memory footprint instead.

### Options considered

1. **`--test_parallel` (reduce CTest concurrency).** Lower the
parallelism in the ASan workflow
(e.g., `--test_parallel 1`) so fewer test binaries run concurrently.
This only addresses
cumulative/overlapping process memory; it does **not** reduce the peak
memory of a single
test, and it slows the CI down for every run. Rejected as a blunt,
non-durable workaround.

2. **Shard the ASan tests.** Split `onnxruntime_test_all` into N gtest
shards
(`GTEST_TOTAL_SHARDS` / `GTEST_SHARD_INDEX`) so the ASan allocator
resets between shards.
This helps with cumulative growth across the whole binary, but it still
does **not** reduce
the peak memory of any individual test — if one test alone approaches
the limit, sharding
   the binary will not help. Rejected for the same root-cause reason.

3. **Break the test into smaller tests (chosen).** Isolate the `alpha !=
1` coverage into
dedicated tests that run a single datatype (uint8/uint8/uint8) at a
single opset (19), and
remove the alpha cases from the large `Gemm_U8U8U8` matrix. This reduces
the work done in
the heaviest single test and addresses the peak-memory problem at its
source while keeping
   the same fusion behavior under test.

Reference: PR #28131 (merge commit `585273033e`).

## Checklist

- [x] Tests added/updated
- [ ] Documentation updated (if applicable)
- [x] No breaking changes (test-only change)
- [ ] CI passes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization breaks QDQ-quantized Gemm graph when alpha != 1

3 participants