Skip to content

[GLUTEN-12013][VL] Fix bloom-filter bytes corruption on whole-stage AQE fallback#12151

Open
brijrajk wants to merge 4 commits into
apache:mainfrom
brijrajk:fix/12013-bloom-filter-stage-fallback
Open

[GLUTEN-12013][VL] Fix bloom-filter bytes corruption on whole-stage AQE fallback#12151
brijrajk wants to merge 4 commits into
apache:mainfrom
brijrajk:fix/12013-bloom-filter-stage-fallback

Conversation

@brijrajk

@brijrajk brijrajk commented May 27, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

Fixes #12013

Root cause

When ExpandFallbackPolicy triggers a whole-stage AQE fallback it reverts to the plan captured before HeuristicTransform runs (i.e. before all pre-transform rewrites). BloomFilterMightContainJointRewriteRule was registered as a pre-transform Rule[SparkPlan], so its substitutions were silently undone in the fallback plan.

If Stage 0 (bloom_filter_agg subquery) had already executed natively it produced Velox-format bloom filter bytes. The vanilla BloomFilterMightContain in the fallen-back filter stage then called BloomFilterImpl.readFrom() on those bytes, throwing:

java.io.IOException: Unexpected Bloom filter version number (16777217)

A second issue caused BloomFilterMightContainJointRewriteRule to rewrite every BloomFilterAggregate it encountered — including the standalone DataFrame.stat.bloomFilter() path. That API collects the aggregate output bytes and passes them directly to BloomFilter.readFrom(), which expects Spark-native format; receiving Velox-format bytes surfaced as GlutenDataFrameStatSuite - Bloom filter failing in CI.

A third issue emerged in Spark 4.1 CI: in Spark 4.1, Gluten's injectOptimizerRule runs after InjectRuntimeFilter, so our optimizer rule sees DPP-injected bloom filters (BloomFilterMightContain(ScalarSubquery(...), xxhash64(col, seed))). The previous catch-all pattern rewrote them to VeloxBloomFilterMightContain, which FilterExecTransformer rejects as a non-native Velox expression — causing the filter stage and its subquery aggregate to fall back to JVM ObjectHashAggregate, changing the physical plan structure and breaking 19 TPC-DS plan-stability golden files in Spark 4.1.

Fix

Move BloomFilterMightContainJointRewriteRule from injectPreTransform (Rule[SparkPlan]) to injectOptimizerRule (Rule[LogicalPlan]), modelled after CollectRewriteRule. Running as an optimizer rule ensures both substitutions are baked into the originalPlan snapshot before ExpandFallbackPolicy takes it. Both sides of the bloom-filter pair therefore always produce and consume the same byte format, regardless of which stages fall back and in what order.

The rule only rewrites BloomFilterAggregate when it appears inside the ScalarSubquery of a BloomFilterMightContain. Standalone usages such as DataFrame.stat.bloomFilter() are intentionally left untouched so that collected bytes remain in Spark-native format.

DPP-injected bloom filters are also deliberately excluded: the rule only matches BloomFilterMightContain(subq: ScalarSubquery, v: Attribute). DPP filters (from InjectRuntimeFilter) always hash the join key — xxhash64(col, seed) — before passing it to BloomFilterMightContain, so their value expression is never a bare Attribute and the pattern never matches.

Files changed

  • BloomFilterMightContainJointRewriteRule.scala — Rewritten as Rule[LogicalPlan]; rewrites BloomFilterAggregate only when inside a ScalarSubquery of BloomFilterMightContain whose value is a plain column reference (Attribute)
  • VeloxRuleApi.scala — Moves registration from injectPreTransform to injectOptimizerRule
  • GlutenBloomFilterFallbackSuite.scala (new, gluten-ut/test) — Four regression tests covering all fallback scenarios

How was this patch tested?

Four regression tests were added to GlutenBloomFilterFallbackSuite in gluten-ut/test, guarded with requireBloomFilterAggMightContainJointFallback().

Test 1 — only filter stage falls back (threshold=2)

  • COLUMNAR_FILTER_ENABLED=false forces FilterExec to fall back (transition cost=2 meets threshold)
  • bloom_filter_agg subquery (cost=1) continues to run natively and emits Velox-format bytes
  • Asserts velox_might_contain appears in the optimized plan

Test 2 — both stages fall back (threshold=1)

  • COLUMNAR_WHOLESTAGE_FALLBACK_THRESHOLD=1 causes Stage 0 (bloom_filter_agg) to also fall back
  • Both sides execute via JNI in JVM row-mode, producing and consuming Velox-format bytes consistently

Test 3 — DataFrame.stat.bloomFilter() produces Spark-readable bytes

  • Calls df.stat.bloomFilter("col", 1000L, 0.01) directly and asserts no IOException and mightContainLong returns the expected result
  • Regression guard for the GlutenDataFrameStatSuite - Bloom filter CI failure

Test 4 — native bloom filter disabled (enableNativeBloomFilter=false)

  • Exercises the early-exit path of the optimizer rule
  • Asserts the query returns correct results and the optimized plan does not contain velox_might_contain

All tests pass, verified locally against the gluten-ut/spark40 module:

Suite Tests Result
GlutenDataFrameStatSuite 25/25 passed (was failing in CI)
GlutenBloomFilterFallbackSuite 4/4 passed
GlutenBloomFilterAggregateQuerySuite 14/14 passed
GlutenInjectRuntimeFilterSuite 13/13 passed

Was this patch authored or co-authored using generative AI tooling?

Yes. Claude Code (claude-sonnet-4-6) was used as an AI assistant during development.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 27, 2026
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch 2 times, most recently from 4a56662 to 9bf19dc Compare May 27, 2026 11:38
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk

Copy link
Copy Markdown
Contributor Author

Could a maintainer please remove the CORE label? All three changed files are Velox-backend-specific (backends-velox/ and gluten-ut/spark40/) — no common core code is touched. VELOX label only is correct. Thanks!

@brijrajk

Copy link
Copy Markdown
Contributor Author

Gentle ping for a maintainer review. The link-referenced-issues CI check that initially failed has since re-run successfully — all checks are green.

Also re-raising: could a maintainer remove the CORE label? The three changed files are all Velox-backend-specific (backends-velox/ and gluten-ut/spark40/) — no common core code is touched, so VELOX label only is correct.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@philo-he

Copy link
Copy Markdown
Member

Gentle ping for a maintainer review. The link-referenced-issues CI check that initially failed has since re-run successfully — all checks are green.

Also re-raising: could a maintainer remove the CORE label? The three changed files are all Velox-backend-specific (backends-velox/ and gluten-ut/spark40/) — no common core code is touched, so VELOX label only is correct.

@brijrajk, thanks for the PR. Could you rebase the code to see if the CI failures go away?

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 9bf19dc to 009a9a8 Compare June 11, 2026 05:38
@brijrajk

Copy link
Copy Markdown
Contributor Author

Done — rebased onto current main and force-pushed. Fresh CI triggered.

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 009a9a8 to 3148dbe Compare June 11, 2026 05:50
@philo-he philo-he requested a review from Copilot June 11, 2026 16:30
@philo-he philo-he removed the CORE works for Gluten Core label Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment on lines +44 to +48
override def apply(plan: SparkPlan): SparkPlan = {
if (!BackendsApiManager.getSettings.requireBloomFilterAggMightContainJointFallback()) {
return plan
}
plan match {
Comment on lines +173 to +177
val df = spark.sql(sqlString)
// Must not throw java.io.IOException: Unexpected Bloom filter version number (16777217)
df.collect
// All 200003 rows match the bloom filter built from the same data.
assert(df.count() == 200003L)
@philo-he

Copy link
Copy Markdown
Member

@brijrajk, could you first check if Copilot's comments make sense?

@github-actions github-actions Bot added the CORE works for Gluten Core label Jun 11, 2026
@brijrajk

Copy link
Copy Markdown
Contributor Author

Thanks for flagging this, @philo-he!

Both of Copilot's comments were valid:

1. Patcher active when native bloom filter is disabled

When spark.gluten.sql.native.bloomFilter=false, Stage 0 falls back to Spark and produces Spark-format bytes. The joint-fallback rule still wraps Stage 1 in a FallbackNode, so the patcher was incorrectly rewriting it to VeloxBloomFilterMightContain — which would cause the same IOException the patcher was introduced to fix, just from the opposite trigger.

Added a second guard: if (!GlutenConfig.get.enableNativeBloomFilter) return plan. This mirrors the existing guard already in BloomFilterMightContainJointRewriteRule.

2. df.collect + df.count() runs the query twice

Combined into assert(df.collect().length == 200003L) — single execution, same failure signal if the IOException is thrown.

@philo-he

Copy link
Copy Markdown
Member

@brijrajk, thanks for the update. Could you check if my following understanding is correct?

Besides the spark.gluten.sql.native.bloomFilter=false setting, which makes the bloom filter fall back in stage 0, there's another case: the fallback policy can also cause it to fall back. In that case, if we rely solely on checking that config, could it lead to an incompatibility issue in stage 1?

@brijrajk

Copy link
Copy Markdown
Contributor Author

@philo-he You are absolutely right. We confirmed it with a test case.

How threshold and cost work

ExpandFallbackPolicy counts the number of columnar↔row conversion boundaries inside a stage. If that count (cost) meets COLUMNAR_WHOLESTAGE_FALLBACK_THRESHOLD, the entire stage is wrapped in a FallbackNode and runs as plain Spark.

Scenario Threshold Stage 0 cost Stage 1 cost Outcome
Original fix (PR as-is) 2 1 → native ✓ 2 → whole-stage fallback Stage 0 Velox bytes, Stage 1 JVM — patcher correct
Your scenario 1 1 → whole-stage fallback ≥ 1 → whole-stage fallback Stage 0 Spark bytes, Stage 1 JVM — patcher misfires

Test case confirming the failure

testGluten(
  "Test bloom_filter_agg whole-stage fallback when both stages fall back",
  Issue12013) {
  ...
  if (BackendsApiManager.getSettings.requireBloomFilterAggMightContainJointFallback()) {
    // threshold=1: Stage 0's inherent transition cost of 1 meets the threshold, so
    // ExpandFallbackPolicy promotes Stage 0 to a whole-stage fallback as well.
    // Stage 0 runs as Spark and produces Spark-format bytes. Stage 1 also falls back.
    // The patcher must NOT rewrite BloomFilterMightContain -> VeloxBloomFilterMightContain
    // in this case.
    withSQLConf(
      GlutenConfig.COLUMNAR_FILTER_ENABLED.key -> "false",
      GlutenConfig.COLUMNAR_WHOLESTAGE_FALLBACK_THRESHOLD.key -> "1",
      SQLConf.ANSI_ENABLED.key -> "false"
    ) {
      val df = spark.sql(sqlString)
      assert(df.collect().length == 200003L)
    }
  }
}

Output

- Gluten - Test bloom_filter_agg whole-stage fallback when both stages fall back *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times,
  most recent failure: Lost task 0.0 in stage 7.0: org.apache.gluten.exception.GlutenException:
  Exception: VeloxUserError
  Error Source: USER
  Error Code: INVALID_ARGUMENT
  Reason: (1 vs. 0)
  Retriable: False
  Expression: kBloomFilterV1 == version
  Function: mayContain
  File: velox/common/base/BloomFilter.h
  Line: 70

    at org.apache.gluten.utils.VeloxBloomFilterJniWrapper.mightContainLongOnSerializedBloom(Native Method)
    at org.apache.gluten.utils.VeloxBloomFilter.mightContainLongOnSerializedBloom(VeloxBloomFilter.java:163)
    ...

Tests: succeeded 1, failed 1

kBloomFilterV1 == version failing with (1 vs. 0) is the exact version-byte mismatch: Velox's reader expected its own format (1) but got Spark's format (0).

Proposed fix

The root cause is that enableNativeBloomFilter answers "is native bloom filter on in config?" but the right question is "did Stage 0 actually run natively?" The fix is to make the guard structural: inside patchBloomFilterMightContain, before rewriting, inspect the physical plan referenced by bloomFilterExpression. If Stage 0's plan is itself a FallbackNode, it will produce Spark-format bytes and Stage 1 must be left with the vanilla BloomFilterMightContain.

Do you see any concerns with this approach, or is there a cleaner way you would handle it?

@brijrajk

brijrajk commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@philo-he Just a gentle ping — we've already added a failing test that reproduces your scenario (see above). Does the proposed structural fix approach look good to you? Ready to implement it as soon as you confirm.

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch 3 times, most recently from 25c7fd9 to 2727774 Compare June 19, 2026 04:23

@zhztheplayer zhztheplayer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brijrajk thanks.

* This rule runs as a second fallback-policy pass, after `ExpandFallbackPolicy`, so it only acts
* when the plan is already wrapped in a `FallbackNode`.
*/
case class BloomFilterMightContainFallbackPatcher() extends Rule[SparkPlan] {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall why BloomFilterMightContainJointRewriteRule was made a physical rule, but can you try turning it to a logical rule anyway? So such a patcher rule can be avoided?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — BloomFilterMightContainJointRewriteRule is now a Rule[LogicalPlan] registered via injectOptimizerRule, modelled after CollectRewriteRule. The patcher is gone. Running as an optimizer rule ensures both substitutions (BloomFilterAggregateVeloxBloomFilterAggregate and BloomFilterMightContainVeloxBloomFilterMightContain) are captured in the originalPlan snapshot before ExpandFallbackPolicy takes it, so the byte format stays consistent regardless of which stages fall back. This also fixes the threshold=1 case where Stage 0 itself falls back (the patcher would incorrectly rewrite the filter side while Stage 0 was producing Spark-format bytes).

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 2727774 to cac891f Compare June 19, 2026 17:41
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 3e210d8 to ee314fe Compare June 22, 2026 16:26
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he philo-he left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brijrajk, Thank you so much for your work. LGTM. One extra minor comment. Please also check whether the CI failures are related.
cc @zhztheplayer

// plan still uses VeloxBloomFilterMightContain so the JVM filter reads Velox-format bytes.
testWithMinSparkVersion(
"GLUTEN-12013: bloom_filter_agg whole-stage fallback does not corrupt bloom filter bytes",
"3.3") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 3.3 is already Gluten's minimum supported version, it might be unnecessary to use "testWithMinSparkVersion". Ditto for other tests. Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, replaced all three calls with plain . Re-triggered CI to check whether the spark34/spark40/spark41 failures are related to this change or pre-existing flaky tests.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from bfe384f to 7cc6faf Compare June 23, 2026 03:49
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer

Copy link
Copy Markdown
Member

@brijrajk would you help check whether CI failures are related?

@zhztheplayer zhztheplayer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a local test and there are unexpected fallbacks on bloom filter operators with this PR.

@brijrajk can you run some local end-to-end tests to check?

@brijrajk brijrajk left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the local test, @zhztheplayer.

The unexpected fallbacks were caused by a second bug in the original patch: BloomFilterMightContainJointRewriteRule was rewriting every BloomFilterAggregate it encountered — including standalone usages such as DataFrame.stat.bloomFilter(). That API collects the aggregate output bytes and feeds them directly to BloomFilter.readFrom(), which expects Spark-native format. When Velox-format bytes were produced instead, the session fell back with java.io.IOException: Unexpected Bloom filter version number (visible as the GlutenDataFrameStatSuite - Bloom filter CI failure).

Fix (pushed): the rule now only rewrites BloomFilterAggregate when it appears inside the ScalarSubquery of a BloomFilterMightContain. Standalone aggregates are left untouched.

I ran the following suites locally (Spark 4.0, Velox backend) against the updated patch:

Suite Tests Result
GlutenDataFrameStatSuite 25/25 passed (was failing)
GlutenBloomFilterFallbackSuite 4/4 passed
GlutenBloomFilterAggregateQuerySuite 14/14 passed
GlutenInjectRuntimeFilterSuite 13/13 passed

A new regression test (GLUTEN-12013: DataFrame.stat.bloomFilter() produces Spark-readable bytes) has also been added to GlutenBloomFilterFallbackSuite to guard against this in future.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 8639584 to 7323e4e Compare June 23, 2026 15:50
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk

brijrajk commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@zhztheplayer The branch has been rebased onto the latest upstream main (as of 2026-06-24). CI is re-running.

The spark-test-spark40 and spark-test-spark41 failures seen in recent runs are pre-existing infrastructure flakiness unrelated to this PR — the exact same GlutenCSVv1Suite / GlutenTextSuite failures appear on unrelated branches (e.g. gluten-12348) with no connection to bloom filter code. These appear to have been introduced by a recent upstream commit unrelated to this change.

Our bloom filter-specific test suites all pass locally (Spark 4.0, Velox backend):

Suite Tests Result
GlutenDataFrameStatSuite 25/25 passed (was failing before the fix)
GlutenBloomFilterFallbackSuite 4/4 passed
GlutenBloomFilterAggregateQuerySuite 14/14 passed
GlutenInjectRuntimeFilterSuite 13/13 passed

The root fix (moving BloomFilterMightContainJointRewriteRule to injectOptimizerRule) is unchanged and addresses the bytes-corruption issue described in the PR.

@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 7323e4e to 299f4f8 Compare June 24, 2026 16:57
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer

Copy link
Copy Markdown
Member

@brijrajk thanks for the update. I've rerun the CI.

…QE fallback

`BloomFilterMightContainJointRewriteRule` previously rewrote every
`BloomFilterAggregate` it encountered, including standalone usages such
as `DataFrame.stat.bloomFilter()`.  That API collects the aggregate
output bytes and passes them directly to `BloomFilter.readFrom()`, which
expects Spark-native format; receiving Velox-format bytes caused
`java.io.IOException: Unexpected Bloom filter version number` (surfaced
as a CI failure in `GlutenDataFrameStatSuite - Bloom filter`).

Fix: only rewrite `BloomFilterAggregate` when it appears inside the
`ScalarSubquery` of a `BloomFilterMightContain`.  Standalone aggregates
are left untouched so that collected bytes remain in Spark-native format.

Add a regression test (`GlutenBloomFilterFallbackSuite`) to guard
against reintroducing this regression.

Local test results (Spark 4.0, Velox backend):
- GlutenDataFrameStatSuite            : 25/25 passed (was failing)
- GlutenBloomFilterFallbackSuite      :  4/4  passed
- GlutenBloomFilterAggregateQuerySuite: 14/14 passed
- GlutenInjectRuntimeFilterSuite      : 13/13 passed
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from 299f4f8 to 9d096a3 Compare June 26, 2026 03:55
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@brijrajk

brijrajk commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

CI failure explanation — spark-test-spark40 / spark-test-spark41

The 3 failing checks (spark-test-spark40 ×2, spark-test-spark41) are all caused by a pre-existing GlutenTPCHPlanStabilitySuitetpch/q19 failure, unrelated to this PR's bloom filter changes.

Root cause: GlutenPlanStabilitySuite.glutenNormalizeIds() uses a regex that matches any #<digits> pattern — including TPC-H string literals. The p_brand filter in q19 uses values Brand#11, Brand#12, Brand#13. Over the 264 commits since the golden file was added in #11805 (c37fee4e5, 2026-03-24), new optimizer rules shifted the ExprId counter so Brand#12 now normalizes to Brand#6, causing a spurious plan mismatch. The suite code itself warns about this at line 67–68:

"Running all suites together in one JVM is recommended to avoid ExprId normalization issues where string constants (e.g., Brand#23 in TPCH q19) may collide with ExprId numbers."

Evidence this is pre-existing (not introduced by this PR):

Ran GlutenTPCHPlanStabilitySuite on main at commit 6097b59a6 (2026-06-25, [MINOR][VL] Build Arrow 18 with patch for Power #12344) — without this PR applied:

Tests: succeeded 21, failed 1  ← q19 fails on main too
BUILD FAILURE

This PR's bloom filter injectOptimizerRule does not affect q19 — the rule finds no BloomFilterMightContain in q19's plan and returns it unchanged.

Why this was only exposed by this PR:

spark-test-spark40 is only triggered in CI when Velox backend Scala files are modified. This PR touches VeloxRuleApi.scala and BloomFilterMightContainJointRewriteRule.scala — so spark-test-spark40 runs here. Most other PRs modify native C++ code, documentation, or non-Velox modules, so they never trigger this check and the stale q19 golden file goes unnoticed.

Fix: Opened #12374 (tracks #12375) to refresh the q19 golden file. Once that merges, these CI checks should go green.

@brijrajk

Copy link
Copy Markdown
Contributor Author

Temporarily cherry-picked the q19 golden file fix from #12374 onto this branch (commit c06593c96) to verify that the CI failures go green. This commit will be dropped once #12374 merges into main and this branch is rebased.

If CI passes with this commit, it confirms that the only blocker was the pre-existing q19 plan stability issue and the bloom filter fix itself is clean.

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

brijrajk and others added 2 commits June 26, 2026 13:39
The ExprId normalizer in GlutenPlanStabilitySuite uses regex `#\d+`
which inadvertently matches TPC-H string literals such as Brand#11,
Brand#12, Brand#13 (p_brand values in q19's filter). Over the 264
commits since the golden file was added in apache#11805, new optimizer rules
shifted the ExprId counter so Brand#12 now normalizes to Brand#6 and
_pre_1#14 to _pre_1#13, causing a spurious plan mismatch.

Regenerated by running GlutenTPCHPlanStabilitySuite with
SPARK_GENERATE_GOLDEN_FILES=1. Only q19/explain.txt changes; simplified.txt
and all other queries are unaffected.

Verified: q19 fails on main without this fix (21/22); passes with it (22/22).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lters

In Spark 4.1, the optimizer processes BloomFilterMightContainJointRewriteRule
after InjectRuntimeFilter, so the rule sees DPP-injected bloom filters
(which use xxhash64(col, seed) as their value expression).  Rewriting those
to VeloxBloomFilterMightContain caused FilterExecTransformer to fail Velox
validation and fall back to JVM, changing the physical plan structure and
breaking 19 TPC-DS plan-stability golden files across v1.4, v2.7, and
modified suites (q2, q10, q16, q24a, q24b, q32, q37, q40, q59, q69, q80,
q82, q85, q92, q94, q95, q10a, q64, q80a).

Fix: guard the first arm of transformAllExpressions with `v: Attribute`.
DPP bloom filters always hash the join key — xxhash64(col, seed) — so
their value is never a bare Attribute and the guard skips them cleanly.
User-facing bloom filters (might_contain(subquery, col)) pass a plain
column reference and continue to be rewritten correctly.

Also guard the catch-all arm against ScalarSubquery to be consistent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@brijrajk brijrajk force-pushed the fix/12013-bloom-filter-stage-fallback branch from c06593c to aca2904 Compare June 26, 2026 08:10
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Scalastyle nonascii.message rule rejects Unicode em-dash (U+2014).
Replace all occurrences in docstring and inline comments with ASCII --.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail to read the native bloom_filter when the stage fallback to java

5 participants