[SPARK-56485][SQL] Fix RowCount Estimation in CBO to avoid unintended BroadcastHashJoin

JkSelf · dongjoon-hyun · commit 96ebbaafdb27 · 2026-04-16T09:34:53.000-07:00
### What changes were proposed in this pull request? When running TPC-DS Q4 on 1TB TPC-DS with CBO enabled and `spark.sql.autoBroadcastJoinThreshold` set to `10MB`, the query fails with a SparkException. ``` Py4JJavaError: An error occurred while calling o596.collectToPython. : org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8.0 GiB: 10.7 GiB at org.apache.gluten.backendsapi.velox.VeloxSparkPlanExecApi.createBroadcastRelation(VeloxSparkPlanExecApi.scala:850) at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:79) at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25) at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37) at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$1(ColumnarBroadcastExchangeExec.scala:66) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:230) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:225) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) ``` The error indicates that Spark attempted to broadcast a table significantly larger than the 8GB limit (actual size ~10.7 GiB). Interestingly, the query runs successfully if CBO is disabled. The issue stems from a bug in Spark's CBO Filter Estimation. When processing equality filters (e.g., d_year = 2001), the CBO incorrectly estimates the rowCount as 0. Because the estimated row count is 0, the optimizer concludes that the join result will be very small and chooses a BroadcastHashJoin . However, the actual data size is roughly 10.7 GiB, which exceeds Spark's hard limit for broadcasting. The root of this miscalculation is in [FilterEstimation.scala](https://www.google.com/search?q=%5Bhttps://github.com/apache/spark/blob/15ffa544ca53cd9f8a25baaf11fa0171dac7c85f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala%23L347%5D(https://github.com/apache/spark/blob/15ffa544ca53cd9f8a25baaf11fa0171dac7c85f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala%23L347)). Under normal circumstances, Spark should use column statistics to estimate selectivity. In this case, although column information is present, most of the statistical fields (min, max, distinct count, etc.) are null/None, except for versioning information. ``` Filter (isnotnull(d_year#7208) AND (d_year#7208 = 2001)), Statistics(sizeInBytes=1.0 B, rowCount=0) +- RelationV2[d_date_sk#7202, d_date_id#7203, d_date#7204, d_month_seq#7205, d_week_seq#7206, d_quarter_seq#7207, d_year#7208, d_dow#7209, d_moy#7210, d_dom#7211, d_qoy#7212, d_fy_year#7213, d_fy_quarter_seq#7214, d_fy_week_seq#7215, d_day_name#7216, d_quarter_name#7217, d_holiday#7218, d_weekend#7219, d_following_holiday#7220, d_first_dom#7221, d_last_dom#7222, d_same_day_ly#7223, d_same_day_lq#7224, d_current_day#7225, ... 4 more fields] spark_catalog.wxd_icebergtpcdsdb1000g.date_dim, Statistics(sizeInBytes=20.1 MiB, rowCount=7.30E+4) ``` As a result, Spark fails to enter the correct evaluation logic in [lines 310-313](https://www.google.com/search?q=https://github.com/apache/spark/blob/15ffa544ca53cd9f8a25baaf11fa0171dac7c85f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala%23L310-L313) and defaults to an incorrect estimation. ``` // Example of the empty stats being passed: d_year#20690 -> ColumnStat(None,None,None,None,None,None,None,2) ``` ### Why are the changes needed? Fix OOM issue ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added new unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #55350 from JkSelf/fix-equal-filter-estimation. Authored-by: Ke Jia <ke.jia@ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
@@ -325,39 +325,44 @@ case class FilterEstimation(plan: Filter) extends Logging {
     }
     val colStat = colStatsMap(attr)
 
-    // decide if the value is in [min, max] of the column.
-    // We currently don't store min/max for binary/string type.
-    // Hence, we assume it is in boundary for binary/string type.
-    val statsInterval = ValueInterval(colStat.min, colStat.max, attr.dataType)
-    if (statsInterval.contains(literal)) {
-      if (update) {
-        // We update ColumnStat structure after apply this equality predicate:
-        // Set distinctCount to 1, nullCount to 0, and min/max values (if exist) to the literal
-        // value.
-        val newStats = attr.dataType match {
-          case StringType | BinaryType =>
-            colStat.copy(distinctCount = Some(1), nullCount = Some(0))
-          case _ =>
-            colStat.copy(distinctCount = Some(1), min = Some(literal.value),
-              max = Some(literal.value), nullCount = Some(0))
-        }
-        colStatsMap.update(attr, newStats)
-      }
+    // Decide if the value is in [min, max] of the column.
+    // We currently don't store min/max for binary/string type. For other types, if min/max are
+    // missing, treat the range as unknown (instead of "all nulls") and fall back to NDV/histogram.
+    val valueInRange = attr.dataType match {
+      case StringType | BinaryType =>
+        true
+      case _ if !colStat.hasMinMaxStats =>
+        true
+      case _ =>
+        ValueInterval(colStat.min, colStat.max, attr.dataType).contains(literal)
+    }
 
-      if (colStat.histogram.isEmpty) {
-        if (!colStat.distinctCount.isEmpty) {
-          // returns 1/ndv if there is no histogram
-          Some(1.0 / colStat.distinctCount.get.toDouble)
-        } else {
-          None
-        }
-      } else {
+    if (!valueInRange) return Some(0.0)
+
+    val percent = if (colStat.histogram.isDefined) {
+      if (colStat.hasMinMaxStats) {
         Some(computeEqualityPossibilityByHistogram(literal, colStat))
+      } else {
+        None
       }
+    } else {
+      colStat.distinctCount.filter(_ > 0).map(ndv => 1.0 / ndv.toDouble)
+    }
 
-    } else {  // not in interval
-      Some(0.0)
+    if (update && percent.isDefined) {
+      // We update ColumnStat structure after apply this equality predicate:
+      // Set distinctCount to 1, nullCount to 0, and min/max values (if exist) to the literal value.
+      val newStats = attr.dataType match {
+        case StringType | BinaryType =>
+          colStat.copy(distinctCount = Some(1), nullCount = Some(0))
+        case _ =>
+          colStat.copy(distinctCount = Some(1), min = Some(literal.value),
+            max = Some(literal.value), nullCount = Some(0))
+      }
+      colStatsMap.update(attr, newStats)
     }
+
+    percent
   }
 
   /**
@@ -409,9 +414,12 @@ case class FilterEstimation(plan: Filter) extends Logging {
     // use [min, max] to filter the original hSet
     dataType match {
       case _: NumericType | BooleanType | DateType | TimestampType =>
-        if (ndv.toDouble == 0 || colStat.min.isEmpty || colStat.max.isEmpty)  {
+        if (ndv.toDouble == 0) {
           return Some(0.0)
         }
+        if (colStat.min.isEmpty || colStat.max.isEmpty) {
+          return None
+        }
 
         val statsInterval =
           ValueInterval(colStat.min, colStat.max, dataType).asInstanceOf[NumericValueInterval]
diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
@@ -211,6 +211,36 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       expectedRowCount = 1)
   }
 
+  test("cint = 2 - missing min/max but has NDV") {
+    val attrIntNoMinMax = AttributeReference("cint_no_min_max", IntegerType)()
+    val colStatIntNoMinMax = ColumnStat(distinctCount = Some(10), min = None, max = None,
+      nullCount = Some(0), avgLen = Some(4), maxLen = Some(4))
+    val childPlan = StatsTestPlan(
+      outputList = Seq(attrIntNoMinMax),
+      rowCount = 10L,
+      attributeStats = AttributeMap(Seq(attrIntNoMinMax -> colStatIntNoMinMax))
+    )
+    validateEstimatedStats(
+      Filter(EqualTo(attrIntNoMinMax, Literal(2)), childPlan),
+      Seq(attrIntNoMinMax -> colStatIntNoMinMax.copy(distinctCount = Some(1),
+        min = Some(2), max = Some(2), nullCount = Some(0))),
+      expectedRowCount = 1)
+  }
+
+  test("cint = 2 - missing all column stats") {
+    val attrIntNoStats = AttributeReference("cint_no_stats", IntegerType)()
+    val colStatIntNoStats = ColumnStat()
+    val childPlan = StatsTestPlan(
+      outputList = Seq(attrIntNoStats),
+      rowCount = 10L,
+      attributeStats = AttributeMap(Seq(attrIntNoStats -> colStatIntNoStats))
+    )
+    validateEstimatedStats(
+      Filter(EqualTo(attrIntNoStats, Literal(2)), childPlan),
+      Seq(attrIntNoStats -> colStatIntNoStats),
+      expectedRowCount = 10)
+  }
+
   test("cint <=> 2") {
     validateEstimatedStats(
       Filter(EqualNullSafe(attrInt, Literal(2)), childStatsTestPlan(Seq(attrInt), 10L)),
@@ -387,6 +417,21 @@ class FilterEstimationSuite extends StatsEstimationTestBase {
       expectedRowCount = 3)
   }
 
+  test("cint IN (3, 4, 5) - missing min/max but has NDV") {
+    val attrIntNoMinMax = AttributeReference("cint_no_min_max", IntegerType)()
+    val colStatIntNoMinMax = ColumnStat(distinctCount = Some(10), min = None, max = None,
+      nullCount = Some(0), avgLen = Some(4), maxLen = Some(4))
+    val childPlan = StatsTestPlan(
+      outputList = Seq(attrIntNoMinMax),
+      rowCount = 10L,
+      attributeStats = AttributeMap(Seq(attrIntNoMinMax -> colStatIntNoMinMax))
+    )
+    validateEstimatedStats(
+      Filter(InSet(attrIntNoMinMax, Set(3, 4, 5)), childPlan),
+      Seq(attrIntNoMinMax -> colStatIntNoMinMax),
+      expectedRowCount = 10)
+  }
+
   test("evaluateInSet with all zeros") {
     validateEstimatedStats(
       Filter(InSet(attrString, Set(3, 4, 5)),