perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924
perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924BABTUNA wants to merge 13 commits into
Conversation
…tion For multi-column groupbys containing Utf8/Binary columns, replace variable-width string columns with dense u32 symbol IDs before hashing and comparison. This avoids repeatedly hashing and comparing string bytes during the probe phase. # Conflicts: # src/daft-recordbatch/src/ops/inline_agg.rs
For multi-column groupbys with very short string keys (e.g. CHAR(1) in TPC-H Q1), replacing variable-width strings with u32 symbol IDs adds an extra pass without saving meaningful hash/compare work, producing a measurable CodSpeed regression on Q1. Add a heuristic that skips symbolization unless avg live string bytes/row meets MIN_AVG_STRING_BYTES_PER_ROW (16). When skipped, agg_symbolized_path returns None and grouping falls through to the existing agg_generic_hash_path. Above the threshold the symbolized path runs unchanged. Tests: - multi-col int-only (no string col -> skip path) - multi-col long strings (above gate -> active symbolized path) - multi-col short strings (below gate -> skip path; TPC-H Q1 shape)
…grouping ## Summary Extends Item 3 of Eventual-Inc#6585 (and builds on PR Eventual-Inc#6748). For the two-string-column groupby shape, this change packs the two `u32` symbol IDs produced by `symbolize_column` into a single `u64` key and groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop — no per-row comparator closure, no `IndexHash`, no dynamic-typed `Series` equality dispatch. PR Eventual-Inc#6748 already symbolizes Utf8/Binary group-by columns, but the symbolized columns are still fed into the generic multi-column hash path (`agg_generic_hash_path`), which keeps the comparator-closure overhead. This PR captures the remaining benefit by routing the two-string-column shape — including the TPC-H Q1 `l_returnflag` / `l_linestatus` pattern — through a dedicated typed map. Grouping semantics and final query results are unchanged. ## Conceptual example ``` Input rows: After symbolization: Packed u64 key: key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym --------|-------- ---------|--------- ------------------------- "alice" | "red" 0 | 0 0x00000000_00000000 "bob" | "red" 1 | 0 0x00000001_00000000 "alice" | "blue" 0 | 1 0x00000000_00000001 "alice" | "red" 0 | 0 0x00000000_00000000 ``` The two symbol spaces sit in disjoint 32-bit halves of the u64 key, so distinct (sym0, sym1) pairs always yield distinct packed keys. Null-equals-null is preserved: when a column has nulls, `symbolize_column` reserves symbol ID 0 for null and starts non-null IDs at 1, so both-null rows share a unique packed key and never collide with non-null rows. ## Key changes - Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by columns: symbolize each column into a `Vec<u32>`, pack the pair into a `u64`, group with `FnvHashMap<u64, u32>` using the same Vacant/Occupied pattern as the existing single-column fast paths. - Extract a small `symbolize_string_col` helper that returns `Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used by the new path and isolates per-dtype null/value-accessor wiring. - Dispatch in `agg_groupby_inline` tries the packed-u64 path first for multi-column shapes, then falls through to the existing `agg_symbolized_path`, then to `agg_generic_hash_path`. All other multi-column shapes (3+ columns, int×string, pure int multi-col) are unchanged. - No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed loop, not a symbolized RecordBatch rebuild fed through the generic comparator-closure hash path. Short-string shapes (TPC-H Q1) win along with long-string shapes. - New `bench_packed_u64_two_strings` Rust-level benchmark covering two-Utf8-column shapes at varying cardinalities and across the full count/sum/min/max/count+sum agg matrix. ## Benchmarks ### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols) | rows | inline (ms) | fallback (ms) | speedup | |------|-------------|---------------|---------| | 1.2M | 21.75 | 33.74 | 1.55x | | 5M | 111.34 | 175.12 | 1.57x | ### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix) | agg | rows | distinct | inline (ms) | fallback (ms) | speedup | |-----------|------|-------------------|-------------|---------------|---------| | count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x | | sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x | | min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x | | max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x | | count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x | | count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x | | sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x | | min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x | | max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x | | count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x | | count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x | | sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x | | min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x | | max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x | | count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x | | count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x | | sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x | | min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x | | max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x | | count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x | The long-string shapes are below 1.0x vs the Daft fallback, but that gap is pre-existing in PR Eventual-Inc#6748's inline path — Daft's general groupby machinery is currently faster for these specific shapes. The delta this PR introduces is the comparison below. ### Inline-vs-inline: packed-u64 vs PR Eventual-Inc#6748 inline (same shape, same machine) | shape | PR Eventual-Inc#6748 inline (ms) | packed-u64 (ms) | speedup | |----------------------------------------|----------------------|-----------------|---------| | Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x | | Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x | | 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x | | 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x | | 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x | | 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x | | 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x | | 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x | | 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x | | 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x | | 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x | | 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x | | 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x | | 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x | | 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x | | 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x | | 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x | | 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x | | 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x | | 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x | | 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x | | 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x | Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster than PR Eventual-Inc#6748's inline path. This is the "missing benefit" the comparator-closure / IndexHash dispatch was hiding. All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3, iters=10, --test-threads=1. Commands: cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1 cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1 ## Test plan - 5 new test_inline_packed_u64_* cases comparing inline output to the fallback path: - Utf8 × Utf8, no nulls. - Utf8 × Utf8, nulls in both columns. - Binary × Binary. - Utf8 × Binary. - Short-string CHAR(1) shape (TPC-H Q1). - All existing inline_agg tests still pass (48 total, 0 failures). - New bench_packed_u64_two_strings benchmark exercises the path end-to-end at scale across the count/sum/min/max/count+sum matrix. ## Related issues Part of Eventual-Inc#6585 (deeper specialization of Item 3, on top of PR Eventual-Inc#6748).
Greptile SummaryThis PR introduces a dedicated
Confidence Score: 4/5Safe to merge; the packed-u64 packing scheme is mathematically sound and correctness is verified by comparing against the fallback across all key shapes and null patterns. The core logic is correct and well-tested across the full agg matrix. Only minor doc comment staleness (caused by the new dispatch ordering) and an inline import style violation were found. The test doc comments in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[agg_groupby_inline] --> B{num_columns == 1?}
B -- Yes --> C[dispatch_single_col_int! / dispatch_single_col_bytes!]
C --> D{fast path matched?}
D -- Yes --> OUT[Return groupkey_indices]
D -- No --> GEN[agg_generic_hash_path]
B -- No --> E[agg_packed_u64_path]
E --> F{exactly 2 cols,\nboth Utf8/Binary?}
F -- Yes --> G[symbolize_string_col x2\npack to u64\nFnvHashMap u64 to u32]
G --> OUT
F -- No --> H[agg_symbolized_path]
H --> I{has_string_col AND\navg_bytes >= MIN_AVG?}
I -- Yes --> J[replace cols with u32 symbols\n-> agg_generic_hash_path]
J --> OUT
I -- No --> GEN
GEN --> OUT
Reviews (1): Last reviewed commit: "perf(inline-agg): pack two-string-column..." | Re-trigger Greptile |
| fn make_two_long_string_batch( | ||
| num_rows: usize, | ||
| num_distinct1: usize, | ||
| num_distinct2: usize, | ||
| ) -> (RecordBatch, Vec<BoundExpr>, Schema) { | ||
| use std::sync::Arc; |
There was a problem hiding this comment.
The
use std::sync::Arc; import is placed inside the function body, which violates the project convention that imports belong at the module level. The mod bench block doesn't already import Arc explicitly (it relies on prelude::*), so if prelude::* doesn't re-export Arc, this should be added at the module top — and either way the inline placement should be removed.
| fn make_two_long_string_batch( | |
| num_rows: usize, | |
| num_distinct1: usize, | |
| num_distinct2: usize, | |
| ) -> (RecordBatch, Vec<BoundExpr>, Schema) { | |
| use std::sync::Arc; | |
| fn make_two_long_string_batch( | |
| num_rows: usize, | |
| num_distinct1: usize, | |
| num_distinct2: usize, | |
| ) -> (RecordBatch, Vec<BoundExpr>, Schema) { |
Rule Used: Import statements should be placed at the top of t... (source)
Learned From
Eventual-Inc/Daft#5078
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| /// Multi-column groupby with two long-string keys (well above the | ||
| /// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the | ||
| /// active symbolized path. |
There was a problem hiding this comment.
These test comments were written in #6748 to describe
agg_symbolized_path, but this PR changes the dispatch order so that agg_packed_u64_path fires first for any two-string-column shape. The claim "Exercises the active symbolized path" is now incorrect — this test is routed to agg_packed_u64_path (which has no byte-threshold gate and always symbolizes).
| /// Multi-column groupby with two long-string keys (well above the | |
| /// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the | |
| /// active symbolized path. | |
| /// Multi-column groupby with two long-string keys and count/sum/min/max aggs. | |
| /// Exercises the packed-u64 path (two Utf8 columns → symbolize + pack). |
| /// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization | ||
| /// should be skipped under the avg-bytes-per-row gate. Mirrors the shape | ||
| /// of TPC-H Q1 to guard against the perf regression that motivated the | ||
| /// gating heuristic. |
There was a problem hiding this comment.
The comment implies short-string symbolization is skipped via the
MIN_AVG_STRING_BYTES_PER_ROW gate, but agg_packed_u64_path runs before agg_symbolized_path in the dispatch and has no such gate — so for this CHAR(1)-style shape the packed-u64 path fires and symbolization does happen.
| /// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization | |
| /// should be skipped under the avg-bytes-per-row gate. Mirrors the shape | |
| /// of TPC-H Q1 to guard against the perf regression that motivated the | |
| /// gating heuristic. | |
| /// Short-string two-column groupby (CHAR(1)-style keys, TPC-H Q1 shape). | |
| /// Routed to the packed-u64 path (no byte-threshold gate), so symbolization | |
| /// always runs and the result must still match the fallback. |
…iew nits - cargo fmt: wrap long lines in the agg_groupby_inline dispatch chain and the bench_packed_u64_two_strings helper call. - Hoist `use std::sync::Arc;` to the `bench` module top instead of inside `make_two_long_string_batch`. - Update doc comments for `test_inline_multi_col_long_strings` and `test_inline_multi_col_short_strings` — both were inherited from PR Eventual-Inc#6748 but now route through `agg_packed_u64_path` first under this PR's dispatch order, so the comments mentioning "active symbolized path" and "symbolization skipped via gate" were stale. Addresses Greptile P2 comments on Eventual-Inc#6924.
…path clippy::needless_borrow (denied under `-D clippy::style`) flagged the `&cols[i]` arguments to `symbolize_string_col`. `cols` is already a sequence of `&Series`, so indexing yields a reference directly. Fixes the `style` CI failure on Eventual-Inc#6924.
Merging this PR will not alter performance
Comparing Footnotes
|
CodSpeed flagged a -11.22% regression on test_tpch_sql[1-in-memory-10] (TPC-H Q1, scale 1). Q1's CHAR(1) `l_returnflag` / `l_linestatus` are short enough that the symbolize pass costs more than the saved per-row hash/compare work, even with the packed-u64 typed hash map. Local microbench showed the opposite but failed to capture the cache / branch behavior on CodSpeed's runner and TPC-H's actual data shape. Mirror the avg-bytes-per-row gate that `agg_symbolized_path` already uses (`MIN_AVG_STRING_BYTES_PER_ROW = 16`): tally bytes across both Utf8/Binary cols up front and bail when total bytes per row falls below the threshold. Short-string shapes (Q1) now fall through to the existing generic-hash-on-raw-strings path (unchanged from main). Long-string shapes still get the packed-u64 win. Local Rust benchmarks confirm: - Q1 short strings (CHAR(1) × CHAR(1)): packed-u64 no longer fires; inline path matches main behavior (1.36x-1.68x vs Daft fallback, same as before this PR series). - Long-string two-col shapes (avg >= 16 bytes/row): packed-u64 still fires, still ~1.08x-1.19x faster than PR Eventual-Inc#6748 inline path. Also update doc comments on `test_inline_multi_col_short_strings_*` and `test_inline_packed_u64_short_strings_*` — both now exercise the gate's short-string skip branch and route through the generic path. Addresses Eventual-Inc#6924 CodSpeed regression.
|
Addressed the CodSpeed regression in The packed-u64 path now reuses CodSpeed should rerun and clear once the new bench cycle finishes. |
|
plz don't review this pr yet, it's part 2 of the multi-column grouping and is not really working the way I want it to |
Restructures the packed-u64 two-string-column optimization to live inside `agg_symbolized_path` rather than as a separate dispatch branch. Eliminates the double byte-tally that previous revisions of this PR paid on the TPC-H Q1 short-string path, and matches PR Eventual-Inc#6748's dispatch structure exactly — `agg_groupby_inline`'s multi-column branch is now identical to Eventual-Inc#6748's. Changes: - `agg_symbolized_path` now symbolizes each Utf8/Binary col into a `Vec<u32>` once, stored alongside a `None` slot for non-string cols. If exactly two string cols, pack the two symbol IDs into a u64 and group with `FnvHashMap<u64, u32>` via the new `agg_packed_u64_inner` helper. Otherwise (mixed shape or 3+ cols), rebuild the symbolized RecordBatch and call `agg_generic_hash_path` as before. - Delete the separate `agg_packed_u64_path` function and the `symbolize_string_col` helper; their logic is now inline in `agg_symbolized_path`. - Revert the dispatcher in `agg_groupby_inline` to PR Eventual-Inc#6748's structure (`match agg_symbolized_path → None: generic`). - Update doc comments on three inherited tests to reflect that the packed-u64 fast path lives inside `agg_symbolized_path`, not as a separate dispatch step. Tests: all 48 `inline_agg` tests pass, including the 5 `test_inline_packed_u64_*` cases. Benchmarks: long-string two-col shapes (1.2M-5M rows x 32-100k groups) continue to show ~1.06x-1.18x speedup vs the equivalent no-fast-path baseline on the same restructured code, matching the earlier separate-function implementation's measurements. Short-string shapes (TPC-H Q1) now route through the exact same code path as PR Eventual-Inc#6748 — no overhead added.
|
CodSpeed is now green after the retrigger ( Per-commit deltas across all 6 head commits make this unambiguous:
Net effect of all the code changes in this PR on Root cause: CodSpeed uses cachegrind CPU simulation, which models the runner CPU's cache hierarchy. The CI runner pool has both AMD EPYC and Intel Xeon — their L2/L3 sizes and associativities differ, so cachegrind produces different simulated miss counts for identical machine code. CodSpeed surfaces this via the "Different runtime environments detected" banner on the previous run; our +12.08% retrigger result confirms the swing was purely a runner change. |
Summary
Extends Item 3 of #6585 (and builds on PR #6748). For the
two-string-column groupby shape with sufficiently long string keys,
this change packs the two
u32symbol IDs produced bysymbolize_columninto a singleu64key and groups against a typedFnvHashMap<u64, u32>in a tight integer loop — no per-rowcomparator closure, no
IndexHash, no dynamic-typedSeriesequality dispatch.
PR #6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (
agg_generic_hash_path), which keeps the comparator-closureoverhead. This PR captures the remaining benefit for two-string-col
shapes via a dedicated typed map. Grouping semantics and final query
results are unchanged. Short-string shapes (e.g. TPC-H Q1's CHAR(1)
l_returnflag/l_linestatus) fall through to the existing generichash path on raw strings, exactly as before this PR.
Conceptual example
The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
symbolize_columnreserves symbol ID 0 for null and starts non-nullIDs at 1, so both-null rows share a unique packed key and never
collide with non-null rows.
Key changes
agg_packed_u64_pathfor exactly two Utf8/Binary group-bycolumns: symbolize each column into a
Vec<u32>, pack the pair intoa
u64, group withFnvHashMap<u64, u32>using the sameVacant/Occupied pattern as the existing single-column fast paths.
agg_symbolized_path'sMIN_AVG_STRING_BYTES_PER_ROW = 16gate: tally bytes across both string cols up front and bail when
avg bytes per row falls below threshold. Short-string shapes
(TPC-H Q1) fall through to the generic hash path on raw strings —
for very short keys the symbolize pass costs more than the saved
per-row hash/compare work, even with the typed u64 map.
symbolize_string_colhelper that returnsOk(Some(Vec<u32>))for Utf8/Binary andOk(None)otherwise — usedby the new path and isolates per-dtype null/value-accessor wiring.
agg_groupby_inlinetries the packed-u64 path first formulti-column shapes, then falls through to the existing
agg_symbolized_path, then toagg_generic_hash_path. All othermulti-column shapes (3+ columns, int×string, pure int multi-col)
are unchanged.
bench_packed_u64_two_stringsRust-level benchmark coveringtwo-Utf8-column shapes at varying cardinalities across the full
count/sum/min/max/count+sum agg matrix.
Benchmarks
Long-string two-column shapes (avg bytes per row ≥ 16) measured by
running the same benchmark twice on the same machine: once with this
PR's packed-u64 dispatch enabled, once with it disabled (falling
through to PR #6748's existing
agg_symbolized_path/agg_generic_hash_path). All numbers are best-of-10 after 3 warmups,Rust nightly
--release,--test-threads=1, Linux (WSL).Across all 20 measured long-string shapes the packed-u64 path is
1.06x-1.19x faster than PR #6748's inline path, with a single 1.00x
outlier in the 1.2M × 2048 sum case. This is the "missing benefit"
the comparator-closure /
IndexHashdispatch was hiding.Short-string shapes (avg < 16 bytes per row) — TPC-H Q1 included —
fall through to the existing generic hash path on raw strings and
behave identically to main. An earlier revision of this PR did not
gate the packed-u64 path and CodSpeed flagged a -11.22% regression on
test_tpch_sql[1-in-memory-10]; the gate added inee0657crestores that workload to baseline.
Reproduction:
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1Test plan
test_inline_packed_u64_*cases comparing inline output tothe fallback path:
skip branch (routes through generic hash path).
inline_aggtests still pass (48 total, 0 failures).bench_packed_u64_two_stringsbenchmark exercises the pathend-to-end at scale across the count/sum/min/max/count+sum matrix.
Related issues
Part of #6585 (deeper specialization of Item 3, on top of PR #6748).