fix(tesseract): JOIN-model assembly for multi-stage reduce_by#10947
Conversation
…reduce_by Add three ignored multi-stage integration tests covering count_distinct, avg, and count outer aggregations with reduce_by. Their snapshots hold the honest expected values so when the window-based path is replaced with the JOIN-based model these tests turn green; until then they document the silent-wrong / invalid-SQL bugs by name.
Extend the reduce_by integration suite with measures and tests that probe the full grain matrix: sum-of-count and add_group_by+reduce_by combo are numerically correct under the current window path and pass; max-of-sum, sum-of-max stay ignored with snapshots pinned to JOIN-semantic results so the assertions stop documenting the broken window shape.
Add FullKeyAggregateKeysInput { refs, keys } structure on FullKeyAggregate
and a parallel keys_input vec on MultiStageQueryDescription. The planner
now populates the description's keys_input when reduce_by / group_by
actually shrinks parent's state grain (add_group_by alone never triggers),
detected via a partition_new_state computed alongside new_state. Inputs
keep their existing full-grain shape and the renderer is untouched, so
this is purely metadata for the upcoming JOIN-based assembly.
When a multi-stage Aggregate inode shrinks parent's state grain via reduce_by / group_by and no add_group_by is in play, route the CTE through a JOIN-based path instead of the window function. The planner keeps the original full-grain inputs as keys_input and runs a fresh make_childs round at partition grain for the measure-side; member planner threads both sides into a FullKeyAggregateKeysInput and skips the Window flag. A new KeysInputFullKeyAggregateStrategy renders the result as LEFT JOIN of measure-side refs against the keys-side ref, JOIN-keys derived from schema intersection. Combo cases (add_group_by + reduce_by) and Rank still go through the legacy window path; non-idempotent outer types (count, count_distinct) remain ignored until measure references can be threaded through add_ungrouped_measure_reference. reduce_by suite: 12 passing, 2 ignored. avg / max_of_sum / sum_of_max now converge to the JOIN-semantic results.
…m-of-…
Outer multi-stage type for rolling up count / count_distinct base measures
to a coarser grain is `sum` — same rollup rule the pre-aggregation path
uses (Count → sum via pre_aggregate_wrap). The earlier `type: count` /
`type: count_distinct` shapes were the wrong user-facing pattern: outer
`count(...)` on a single broadcasted row collapses to 1.
Switch both ignored tests to `type: sum` with `sql: "{CUBE.count}"` /
`sql: "{CUBE.unique_customers}"`. Snapshots pin the JOIN-semantic results
(3 / 6 / 6 total orders per status, 3 / 3 / 3 distinct customers per
status, broadcast across categories). reduce_by suite: 14 passing, 0
ignored.
…ategy
Fold the JOIN-model assembly into the existing `KeysFullKeyAggregateStrategy`
so one code path handles both cases:
- keys-side from the logical plan's `keys_input` (JOIN-model: measure
refs live at partition grain, keys refs at leaf grain);
- keys-side derived from measure refs themselves (legacy keys path used
when the dialect lacks FULL JOIN).
JOIN-keys come from the intersection of the keys-side dim set with each
measure ref's schema, which collapses to partition grain in JOIN-model and
to full grain in the legacy path. The separate `KeysInputFullKeyAggregateStrategy`
is removed.
`FullKeyAggregateKeysInput` drops the redundant `keys` field — the strategy
reads key dims off `refs[0].schema()` directly. Add `test_reduce_by_calculated_over_two`
covering a Calculate inode that sums two reduce_by Aggregate children.
reduce_by suite: 15 passing.
Pass `-v` to `docker rm -f` in the atexit-cleanup callback so anonymous volumes attached to each test Postgres container are removed alongside the container itself. Without it disk usage steadily grows across test runs.
Drop the legacy Window path for Aggregate inodes — `window_function_to_use` is now always `None` for them (Rank still uses Rank). The unused branch stays as a comment for context. In the planner: apply modifiers in the right order — first reduce_by / group_by (shrink parent grain), then add_group_by (extend leaf). The previous order silently dropped `add_group_by`-added dims when `group_by` was a keep-only filter, breaking the leaf grain for cases like multi-stage count_distinct sum-by-quarter (the pre-aggregation match relied on the leaf staying at full leaf grain). With the corrected order `add_group_by + reduce_by` combo lands on the JOIN-model: leaf at `state - reduce_by + add_group_by`, keys_input rebuilt at parent state via a second make_childs. Combo numerical case (test_reduce_by_add_group_by_combo) now matches its honest expected result through the JOIN path, without the window fallback. Suite: 925 passing, 11 ignored (none reduce_by-related).
- assert in reduce_by tests which path (window vs JOIN-model) is taken
- simplify is_window_path_eligible: single-slice dep pattern + Result-typed
as_measure(), no expect()
- drop branch-history language ("legacy", "now", "see <case>") from
multi-stage docs and inline comments
|
Claude finished @waralexrom's task in 2m 37s —— View job Code Review: fix(tesseract): JOIN-model assembly for multi-stage reduce_by
Overall AssessmentThis is a well-structured fix for a correctness bug — Findings1. Duplicate measure definitions in YAML fixture (Minor)
- name: count_reduce_category # line 111
type: sum
sql: "{CUBE.count}"
multi_stage: true
reduce_by:
- orders.category
- name: count_sum_reduce_category # line 125
type: sum
sql: "{CUBE.count}"
multi_stage: true
reduce_by:
- orders.categoryBoth are 2. Missing path assertion in
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10947 +/- ##
==========================================
+ Coverage 78.98% 83.57% +4.59%
==========================================
Files 470 254 -216
Lines 93132 76126 -17006
Branches 3454 0 -3454
==========================================
- Hits 73559 63623 -9936
+ Misses 19070 12503 -6567
+ Partials 503 0 -503
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- drop duplicate `count_sum_reduce_category` measure (identical to `count_reduce_category`) and the redundant `test_reduce_by_sum_of_count` it covered — `test_reduce_by_count` already exercises sum-of-count with the window-path assertion - add FIXME on `partition_filter` pointing to its sibling `member_partition_by_logical` — the two reshape the same way on different inputs and should be merged
|
Claude finished @waralexrom's task in 2m 35s —— View job Code Review: fix(tesseract): JOIN-model assembly for multi-stage reduce_by
Overall AssessmentThis is a well-executed correctness fix for a real bug — The code is well-structured: the window-path optimization is preserved for the narrow safe subset, the modifier ordering is correct, and the test matrix covers the important shapes. I have a few observations, mostly low-risk and for future consideration. Findings1.
|
Summary
Multi-stage measures with
reduce_bypreviously routed through a windowfunction — for non-additive base measures (e.g.
count_distinct) thisemitted invalid SQL like
countDistinct(countDistinct(x)) OVER (...).This PR routes every
Aggregateinode through a JOIN-based assembly bydefault, preserves the window-based render as an opt-in optimisation for
the narrow safe subset (single dep, additive identity rollup, no
leaf-extending modifiers), and adds integration coverage for the
previously-unspecified shapes.
Changes
Aggregateinodes assemble viaFullKeyAggregate+ LEFT JOIN againsta leaf-grain dim grid (new
keys_inputslot on the multi-stage logicalplan). Modifiers apply in order:
reduce_by/group_by→add_group_by→
time_shift(the swap was the root cause of the count_distinctsum-by-quarter regression).
sum-over-{sum, count}aggregates withno
add_group_by, when the resulting partition is a strict subset ofquery dims (otherwise the window degenerates into a plain group-by and
we skip it).
FullKeyAggregatekeys-side handling unified: a single strategyhandles both explicit
keys_input(JOIN-model) and keys-derived-from-measure-refs (when the dialect lacks FULL JOIN).
reduce_bywithavg,count_distinct,max(sum),sum(max),sum(count),add_group_bycombo, and a
Calculateparent over mixed children; each test assertswhich path the planner picked (window vs JOIN).
Testing
cargo test --features integration-postgres multi_stage→ 108 passed,6 ignored (unrelated).
cargo test --features integration-postgres reduce_by→ 18 passed.the integration seed.