[SPARK-56355][SQL] Improve join stats estimation when equi-join keys lack column statistics by wecharyu · Pull Request #55195 · apache/spark

wecharyu · 2026-04-04T11:23:31Z

What changes were proposed in this pull request?

Improve join estimation in case where equal-join's join key column stats are missing. Currently, the estimator could fall back to a Cartesian product T(A) * T(B), which often leads to severe overestimation.

We can instead use max(T(A), T(B)) as a more accurate estimate.

Why are the changes needed?

To improve the reliability of join estimation especially in AQE re-optimize.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add unit test, and test on the prod Spark job.

TPC-DS benchmark shows no performance regression:

Before

[info] TPCDS:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q24a                                             237404         237715         440          1.3         775.7       1.0X

After

[info] Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
[info] TPCDS:                                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] q24a                                             236725         237440        1011          1.3         773.5       1.0X

Was this patch authored or co-authored using generative AI tooling?

Cursor / Opus 4.6

…lack column statistics

wecharyu · 2026-04-05T16:44:16Z

cc @cloud-fan @wangyum

lakechd · 2026-04-16T20:34:36Z

 import org.apache.spark.sql.catalyst.dsl.plans._
 import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeMap, AttributeReference, Literal, SortOrder}
-import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.{Inner, LeftOuter, PlanTest}


why leftouter not inner

test for both inner and leftouter

wecharyu · 2026-04-17T02:29:50Z

gentle ping @cloud-fan , could you help take a look? The original over-estimation could lead Gluten choose wrong build side and cause OOM, details in apache/gluten#11774

wecharyu added 2 commits April 4, 2026 19:15

[SPARK-56355][SQL] Improve join stats estimation when equi-join keys …

2feeeca

…lack column statistics

fix tests

f0c18fa

lakechd reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56355][SQL] Improve join stats estimation when equi-join keys lack column statistics#55195

[SPARK-56355][SQL] Improve join stats estimation when equi-join keys lack column statistics#55195
wecharyu wants to merge 2 commits intoapache:masterfrom
wecharyu:enhance_join_estimate

wecharyu commented Apr 4, 2026 •

edited

Loading

Uh oh!

wecharyu commented Apr 5, 2026

Uh oh!

lakechd Apr 16, 2026

Uh oh!

wecharyu Apr 17, 2026

Uh oh!

wecharyu commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wecharyu commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wecharyu commented Apr 5, 2026

Uh oh!

lakechd Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

wecharyu Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

wecharyu commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wecharyu commented Apr 4, 2026 •

edited

Loading