You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve CostHashAgg with single-column NDV and spill-aware cost model
Two enhancements to CostHashAgg in CCostModelGPDB:
1. Single-column NDV optimization for local partial HashAgg:
When GROUP BY has exactly 1 column, use GetNDVs() (global NDV from
column statistics) instead of pci->Rows() to estimate the output
row count of the local partial aggregation stage. GetNDVs() returns
the global NDV directly, so no * UlHosts() scaling is needed.
This lets the optimizer distinguish high-NDV cases (partial agg
streams nearly as many rows as input, 2-phase has little benefit)
from low-NDV cases (partial agg significantly reduces data before
redistribution, 2-phase is preferred).
Multi-column GROUP BY falls back to the original behavior:
num_output_rows = pci->Rows() * UlHosts().
2. Spill-aware cost model:
When num_output_rows * width exceeds the spilling memory threshold
(EcpHJSpillingMemThreshold, 50 MB), apply higher cost unit values
to reflect disk I/O overhead. Uses the existing HJ spilling cost
parameters (EcpHJFeedingTupColumnSpillingCostUnit etc.) which are
already tuned for spilling scenarios.
TPC-H benchmark: -14.3% overall (Q17 -60%, Q03 -6%).
TPC-DS benchmark: -0.4% overall (Q59 -29%).
0 commit comments