You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: improve inner join cardinality estimation for FK joins
When distinct count statistics are absent, the join cardinality
estimator falls back to using num_rows as the distinct count
estimate. Previously it used max(left_distinct, right_distinct)
as the selectivity denominator, which for a dimension-fact FK
join like warehouse(5) ⋈ catalog_sales(1.4M) would compute:
(5 * 1.4M) / max(5, 1.4M) = 5 rows — a severe underestimate.
This caused the optimizer to keep the 1.4M-row fact table as
the hash join build side (since it appeared to be "5 rows"),
leading to massive concat_batches allocations and 100x+ slowdowns
on queries like TPC-DS Q99.
Fix: when no actual distinct count stats are available, use
min(left_distinct, right_distinct) instead of max. This gives
(5 * 1.4M) / min(5, 1.4M) = 1.4M — the correct FK join estimate.
The optimizer then correctly swaps to put the small dimension
table as the build side.
Also handle the edge case where selectivity is 0 (one side has
no non-null values): return 0 rows instead of None.
TPC-DS Q99 improvement: 10.4s → 59ms (157x faster).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments