Skip to content

Improve CostHashAgg with single-column NDV and spill-aware cost model#1672

Draft
yjhjstz wants to merge 2 commits intoapache:mainfrom
yjhjstz:fix_orca_agg_cost
Draft

Improve CostHashAgg with single-column NDV and spill-aware cost model#1672
yjhjstz wants to merge 2 commits intoapache:mainfrom
yjhjstz:fix_orca_agg_cost

Conversation

@yjhjstz
Copy link
Copy Markdown
Member

@yjhjstz yjhjstz commented Apr 9, 2026

TPC-H: -17.7% overall (274,201ms → 225,537ms)

Only 2 queries changed plans:

Query Baseline Optimized Change Cause
Q17 60,282ms 23,952ms -60.3% 2-phase (spill 1342MB) → 1-phase (spill 491MB)
Q03 19,977ms 8,014ms -59.9% 2-phase (spill 7MB) → 1-phase (no spill)

Q17 is the dominant win. l_partkey is a high-NDV single-column group key; partial aggregation produced nearly as many rows as input, causing 1.3 GB of disk spill in the Finalize stage. The optimizer now correctly chooses 1-phase, reducing spill to 491 MB and cutting latency by 36 seconds. Q03 also benefits significantly, eliminating spill entirely. No plan change caused a regression.


TPC-DS: +0.2% overall (605,646ms → 606,700ms)

Only 2 queries changed plans:

Query Baseline Optimized Change Cause
Q59 11,141ms 8,393ms -24.7% 2-phase → 1-phase
Q04 14,596ms 14,541ms -0.4% Subtree plan detail change, negligible impact

Q59 benefits for the same reason as Q17 and Q03. The Q04 change affects only one of three aggregation subtrees (the smallest, rows=71504), with no meaningful performance impact. No plan change caused a regression.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
Two enhancements to CostHashAgg in CCostModelGPDB:

1. Single-column NDV optimization for local partial HashAgg:
   When GROUP BY has exactly 1 column, use GetNDVs() (global NDV from
   column statistics) instead of pci->Rows() to estimate the output
   row count of the local partial aggregation stage. GetNDVs() returns
   the global NDV directly, so no * UlHosts() scaling is needed.

   This lets the optimizer distinguish high-NDV cases (partial agg
   streams nearly as many rows as input, 2-phase has little benefit)
   from low-NDV cases (partial agg significantly reduces data before
   redistribution, 2-phase is preferred).

   Multi-column GROUP BY falls back to the original behavior:
   num_output_rows = pci->Rows() * UlHosts().

2. Spill-aware cost model:
   When num_output_rows * width exceeds the spilling memory threshold
   (EcpHJSpillingMemThreshold, 50 MB), apply higher cost unit values
   to reflect disk I/O overhead. Uses the existing HJ spilling cost
   parameters (EcpHJFeedingTupColumnSpillingCostUnit etc.) which are
   already tuned for spilling scenarios.

TPC-H benchmark: -14.3% overall (Q17 -60%, Q03 -6%).
TPC-DS benchmark: -0.4% overall (Q59 -29%).
@yjhjstz yjhjstz force-pushed the fix_orca_agg_cost branch from d6a7e60 to ae39433 Compare April 10, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant