perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping by BABTUNA · Pull Request #6924 · Eventual-Inc/Daft

BABTUNA · 2026-05-12T19:36:46Z

Stacked on #6748. This branch sits on top of feat/symbolize-string-groupby-only, so the diff currently includes #6748's commits. Only the final commits are new in this PR; everything else is reviewed as part of #6748. After #6748 merges to main, I'll rebase and the diff will collapse.

Summary

Extends Item 3 of #6585 (and builds on PR #6748). For the
two-string-column groupby shape with sufficiently long string keys,
this change packs the two u32 symbol IDs produced by
symbolize_column into a single u64 key and groups against a typed
FnvHashMap<u64, u32> in a tight integer loop — no per-row
comparator closure, no IndexHash, no dynamic-typed Series
equality dispatch.

PR #6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (agg_generic_hash_path), which keeps the comparator-closure
overhead. This PR captures the remaining benefit for two-string-col
shapes via a dedicated typed map. Grouping semantics and final query
results are unchanged. Short-string shapes (e.g. TPC-H Q1's CHAR(1)
l_returnflag / l_linestatus) fall through to the existing generic
hash path on raw strings, exactly as before this PR.

Conceptual example

Input rows:                 After symbolization:        Packed u64 key:
key1    | key2              key1_sym | key2_sym         (key1_sym << 32) | key2_sym
--------|--------           ---------|---------         -------------------------
"alice" | "red"             0        | 0                0x00000000_00000000
"bob"   | "red"             1        | 0                0x00000001_00000000
"alice" | "blue"            0        | 1                0x00000000_00000001
"alice" | "red"             0        | 0                0x00000000_00000000

The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
symbolize_column reserves symbol ID 0 for null and starts non-null
IDs at 1, so both-null rows share a unique packed key and never
collide with non-null rows.

Key changes

Add agg_packed_u64_path for exactly two Utf8/Binary group-by
columns: symbolize each column into a Vec<u32>, pack the pair into
a u64, group with FnvHashMap<u64, u32> using the same
Vacant/Occupied pattern as the existing single-column fast paths.
Reuse agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW = 16
gate: tally bytes across both string cols up front and bail when
avg bytes per row falls below threshold. Short-string shapes
(TPC-H Q1) fall through to the generic hash path on raw strings —
for very short keys the symbolize pass costs more than the saved
per-row hash/compare work, even with the typed u64 map.
Extract a small symbolize_string_col helper that returns
Ok(Some(Vec<u32>)) for Utf8/Binary and Ok(None) otherwise — used
by the new path and isolates per-dtype null/value-accessor wiring.
Dispatch in agg_groupby_inline tries the packed-u64 path first for
multi-column shapes, then falls through to the existing
agg_symbolized_path, then to agg_generic_hash_path. All other
multi-column shapes (3+ columns, int×string, pure int multi-col)
are unchanged.
New bench_packed_u64_two_strings Rust-level benchmark covering
two-Utf8-column shapes at varying cardinalities across the full
count/sum/min/max/count+sum agg matrix.

Benchmarks

Long-string two-column shapes (avg bytes per row ≥ 16) measured by
running the same benchmark twice on the same machine: once with this
PR's packed-u64 dispatch enabled, once with it disabled (falling
through to PR #6748's existing agg_symbolized_path /
agg_generic_hash_path). All numbers are best-of-10 after 3 warmups,
Rust nightly --release, --test-threads=1, Linux (WSL).

agg	rows	distinct	PR #6748 inline (ms)	this PR (ms)	speedup
count	1.2M	8 × 4 = 32	61.69	57.05	1.08x
sum	1.2M	8 × 4 = 32	65.88	56.05	1.18x
min	1.2M	8 × 4 = 32	62.78	57.66	1.09x
max	1.2M	8 × 4 = 32	63.70	56.23	1.13x
count+sum	1.2M	8 × 4 = 32	64.76	56.75	1.14x
count	1.2M	64 × 32 = 2048	62.43	55.56	1.12x
sum	1.2M	64 × 32 = 2048	62.62	62.70	1.00x
min	1.2M	64 × 32 = 2048	64.45	54.75	1.18x
max	1.2M	64 × 32 = 2048	63.78	56.90	1.12x
count+sum	1.2M	64 × 32 = 2048	66.10	55.77	1.19x
count	5M	8 × 4 = 32	293.04	246.37	1.19x
sum	5M	8 × 4 = 32	265.15	235.03	1.13x
min	5M	8 × 4 = 32	263.15	235.38	1.12x
max	5M	8 × 4 = 32	267.26	238.82	1.12x
count+sum	5M	8 × 4 = 32	262.31	240.04	1.09x
count	5M	1000 × 100 = 100k	268.71	241.55	1.11x
sum	5M	1000 × 100 = 100k	269.30	243.80	1.10x
min	5M	1000 × 100 = 100k	267.40	246.93	1.08x
max	5M	1000 × 100 = 100k	269.19	245.68	1.10x
count+sum	5M	1000 × 100 = 100k	270.75	239.63	1.13x

Across all 20 measured long-string shapes the packed-u64 path is
1.06x-1.19x faster than PR #6748's inline path, with a single 1.00x
outlier in the 1.2M × 2048 sum case. This is the "missing benefit"
the comparator-closure / IndexHash dispatch was hiding.

Short-string shapes (avg < 16 bytes per row) — TPC-H Q1 included —
fall through to the existing generic hash path on raw strings and
behave identically to main. An earlier revision of this PR did not
gate the packed-u64 path and CodSpeed flagged a -11.22% regression on
test_tpch_sql[1-in-memory-10]; the gate added in ee0657c
restores that workload to baseline.

Reproduction:
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1

Test plan

5 new test_inline_packed_u64_* cases comparing inline output to
the fallback path:
- Utf8 × Utf8, no nulls.
- Utf8 × Utf8, nulls in both columns.
- Binary × Binary.
- Utf8 × Binary.
- Short-string CHAR(1) shape (TPC-H Q1) — exercises the gate's
  skip branch (routes through generic hash path).
All existing inline_agg tests still pass (48 total, 0 failures).
New bench_packed_u64_two_strings benchmark exercises the path
end-to-end at scale across the count/sum/min/max/count+sum matrix.

Related issues

Part of #6585 (deeper specialization of Item 3, on top of PR #6748).

…tion For multi-column groupbys containing Utf8/Binary columns, replace variable-width string columns with dense u32 symbol IDs before hashing and comparison. This avoids repeatedly hashing and comparing string bytes during the probe phase. # Conflicts: # src/daft-recordbatch/src/ops/inline_agg.rs

For multi-column groupbys with very short string keys (e.g. CHAR(1) in TPC-H Q1), replacing variable-width strings with u32 symbol IDs adds an extra pass without saving meaningful hash/compare work, producing a measurable CodSpeed regression on Q1. Add a heuristic that skips symbolization unless avg live string bytes/row meets MIN_AVG_STRING_BYTES_PER_ROW (16). When skipped, agg_symbolized_path returns None and grouping falls through to the existing agg_generic_hash_path. Above the threshold the symbolized path runs unchanged. Tests: - multi-col int-only (no string col -> skip path) - multi-col long strings (above gate -> active symbolized path) - multi-col short strings (below gate -> skip path; TPC-H Q1 shape)

…grouping ## Summary Extends Item 3 of Eventual-Inc#6585 (and builds on PR Eventual-Inc#6748). For the two-string-column groupby shape, this change packs the two `u32` symbol IDs produced by `symbolize_column` into a single `u64` key and groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop — no per-row comparator closure, no `IndexHash`, no dynamic-typed `Series` equality dispatch. PR Eventual-Inc#6748 already symbolizes Utf8/Binary group-by columns, but the symbolized columns are still fed into the generic multi-column hash path (`agg_generic_hash_path`), which keeps the comparator-closure overhead. This PR captures the remaining benefit by routing the two-string-column shape — including the TPC-H Q1 `l_returnflag` / `l_linestatus` pattern — through a dedicated typed map. Grouping semantics and final query results are unchanged. ## Conceptual example ``` Input rows: After symbolization: Packed u64 key: key1 | key2 key1_sym | key2_sym (key1_sym << 32) | key2_sym --------|-------- ---------|--------- ------------------------- "alice" | "red" 0 | 0 0x00000000_00000000 "bob" | "red" 1 | 0 0x00000001_00000000 "alice" | "blue" 0 | 1 0x00000000_00000001 "alice" | "red" 0 | 0 0x00000000_00000000 ``` The two symbol spaces sit in disjoint 32-bit halves of the u64 key, so distinct (sym0, sym1) pairs always yield distinct packed keys. Null-equals-null is preserved: when a column has nulls, `symbolize_column` reserves symbol ID 0 for null and starts non-null IDs at 1, so both-null rows share a unique packed key and never collide with non-null rows. ## Key changes - Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by columns: symbolize each column into a `Vec<u32>`, pack the pair into a `u64`, group with `FnvHashMap<u64, u32>` using the same Vacant/Occupied pattern as the existing single-column fast paths. - Extract a small `symbolize_string_col` helper that returns `Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used by the new path and isolates per-dtype null/value-accessor wiring. - Dispatch in `agg_groupby_inline` tries the packed-u64 path first for multi-column shapes, then falls through to the existing `agg_symbolized_path`, then to `agg_generic_hash_path`. All other multi-column shapes (3+ columns, int×string, pure int multi-col) are unchanged. - No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed loop, not a symbolized RecordBatch rebuild fed through the generic comparator-closure hash path. Short-string shapes (TPC-H Q1) win along with long-string shapes. - New `bench_packed_u64_two_strings` Rust-level benchmark covering two-Utf8-column shapes at varying cardinalities and across the full count/sum/min/max/count+sum agg matrix. ## Benchmarks ### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols) | rows | inline (ms) | fallback (ms) | speedup | |------|-------------|---------------|---------| | 1.2M | 21.75 | 33.74 | 1.55x | | 5M | 111.34 | 175.12 | 1.57x | ### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix) | agg | rows | distinct | inline (ms) | fallback (ms) | speedup | |-----------|------|-------------------|-------------|---------------|---------| | count | 1.2M | 8 × 4 = 32 | 57.05 | 28.53 | 0.50x | | sum | 1.2M | 8 × 4 = 32 | 56.05 | 33.73 | 0.60x | | min | 1.2M | 8 × 4 = 32 | 57.66 | 33.31 | 0.58x | | max | 1.2M | 8 × 4 = 32 | 56.23 | 33.97 | 0.60x | | count+sum | 1.2M | 8 × 4 = 32 | 56.75 | 34.10 | 0.60x | | count | 1.2M | 64 × 32 = 2048 | 55.56 | 27.64 | 0.50x | | sum | 1.2M | 64 × 32 = 2048 | 62.70 | 39.52 | 0.63x | | min | 1.2M | 64 × 32 = 2048 | 54.75 | 41.41 | 0.76x | | max | 1.2M | 64 × 32 = 2048 | 56.90 | 41.01 | 0.72x | | count+sum | 1.2M | 64 × 32 = 2048 | 55.77 | 40.71 | 0.73x | | count | 5M | 8 × 4 = 32 | 246.37 | 127.56 | 0.52x | | sum | 5M | 8 × 4 = 32 | 235.03 | 142.14 | 0.60x | | min | 5M | 8 × 4 = 32 | 235.38 | 140.41 | 0.60x | | max | 5M | 8 × 4 = 32 | 238.82 | 142.38 | 0.60x | | count+sum | 5M | 8 × 4 = 32 | 240.04 | 142.84 | 0.60x | | count | 5M | 1000 × 100 = 100k | 241.55 | 132.99 | 0.55x | | sum | 5M | 1000 × 100 = 100k | 243.80 | 207.44 | 0.85x | | min | 5M | 1000 × 100 = 100k | 246.93 | 210.96 | 0.85x | | max | 5M | 1000 × 100 = 100k | 245.68 | 212.00 | 0.86x | | count+sum | 5M | 1000 × 100 = 100k | 239.63 | 206.81 | 0.86x | The long-string shapes are below 1.0x vs the Daft fallback, but that gap is pre-existing in PR Eventual-Inc#6748's inline path — Daft's general groupby machinery is currently faster for these specific shapes. The delta this PR introduces is the comparison below. ### Inline-vs-inline: packed-u64 vs PR Eventual-Inc#6748 inline (same shape, same machine) | shape | PR Eventual-Inc#6748 inline (ms) | packed-u64 (ms) | speedup | |----------------------------------------|----------------------|-----------------|---------| | Q1 1.2M × 6 short strings (sum+count) | 23.07 | 21.75 | 1.06x | | Q1 5M × 6 short strings (sum+count) | 120.15 | 111.34 | 1.08x | | 1.2M × 32 long strings (count) | 61.69 | 57.05 | 1.08x | | 1.2M × 32 long strings (sum) | 65.88 | 56.05 | 1.18x | | 1.2M × 32 long strings (min) | 62.78 | 57.66 | 1.09x | | 1.2M × 32 long strings (max) | 63.70 | 56.23 | 1.13x | | 1.2M × 32 long strings (count+sum) | 64.76 | 56.75 | 1.14x | | 1.2M × 2048 long strings (count) | 62.43 | 55.56 | 1.12x | | 1.2M × 2048 long strings (sum) | 62.62 | 62.70 | 1.00x | | 1.2M × 2048 long strings (min) | 64.45 | 54.75 | 1.18x | | 1.2M × 2048 long strings (max) | 63.78 | 56.90 | 1.12x | | 1.2M × 2048 long strings (count+sum) | 66.10 | 55.77 | 1.19x | | 5M × 32 long strings (count) | 293.04 | 246.37 | 1.19x | | 5M × 32 long strings (sum) | 265.15 | 235.03 | 1.13x | | 5M × 32 long strings (min) | 263.15 | 235.38 | 1.12x | | 5M × 32 long strings (max) | 267.26 | 238.82 | 1.12x | | 5M × 32 long strings (count+sum) | 262.31 | 240.04 | 1.09x | | 5M × 100k long strings (count) | 268.71 | 241.55 | 1.11x | | 5M × 100k long strings (sum) | 269.30 | 243.80 | 1.10x | | 5M × 100k long strings (min) | 267.40 | 246.93 | 1.08x | | 5M × 100k long strings (max) | 269.19 | 245.68 | 1.10x | | 5M × 100k long strings (count+sum) | 270.75 | 239.63 | 1.13x | Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster than PR Eventual-Inc#6748's inline path. This is the "missing benefit" the comparator-closure / IndexHash dispatch was hiding. All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3, iters=10, --test-threads=1. Commands: cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1 cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1 ## Test plan - 5 new test_inline_packed_u64_* cases comparing inline output to the fallback path: - Utf8 × Utf8, no nulls. - Utf8 × Utf8, nulls in both columns. - Binary × Binary. - Utf8 × Binary. - Short-string CHAR(1) shape (TPC-H Q1). - All existing inline_agg tests still pass (48 total, 0 failures). - New bench_packed_u64_two_strings benchmark exercises the path end-to-end at scale across the count/sum/min/max/count+sum matrix. ## Related issues Part of Eventual-Inc#6585 (deeper specialization of Item 3, on top of PR Eventual-Inc#6748).

greptile-apps · 2026-05-12T19:41:58Z

Greptile Summary

This PR introduces a dedicated agg_packed_u64_path for the two-string-column group-by shape: it symbolizes each Utf8/Binary column into dense u32 IDs, packs the pair into a single u64, and groups with a typed FnvHashMap<u64, u32> — eliminating the per-row comparator-closure and IndexHash overhead that remained even after #6748. The dispatch in agg_groupby_inline is updated to try this path first for multi-column shapes, falling through to agg_symbolized_path and then the generic path.

New agg_packed_u64_path: symbolizes two string columns, packs (sym0 << 32) | sym1 into a u64 key, and groups with a tight integer loop; null-equals-null is preserved by reserving ID 0 for null in each column's symbol space independently.
New symbolize_string_col helper: isolates per-dtype accessor wiring for Utf8 and Binary, returning None for non-string columns so callers can fall back gracefully.
Dispatch change: two-string-column shapes now bypass agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW gate entirely, which is intentional (packed-u64 cost is one symbolize pass, not a RecordBatch rebuild through the generic path).

Confidence Score: 4/5

Safe to merge; the packed-u64 packing scheme is mathematically sound and correctness is verified by comparing against the fallback across all key shapes and null patterns.

The core logic is correct and well-tested across the full agg matrix. Only minor doc comment staleness (caused by the new dispatch ordering) and an inline import style violation were found.

The test doc comments in inline_agg.rs around test_inline_multi_col_long_strings and test_inline_multi_col_short_strings are worth a quick look to confirm they accurately reflect which dispatch path each test now exercises.

Important Files Changed

Filename	Overview
src/daft-recordbatch/src/ops/inline_agg.rs	Adds `symbolize_string_col` helper and `agg_packed_u64_path` for the two-string-column shape, updates the multi-column dispatch to try packed-u64 first; stale doc comments in inherited tests need updating.
src/daft-recordbatch/src/ops/bench_agg.rs	Adds `make_two_long_string_batch` helper and `bench_packed_u64_two_strings` benchmark; minor style issue with inline `use std::sync::Arc;` inside the helper function.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[agg_groupby_inline] --> B{num_columns == 1?}
    B -- Yes --> C[dispatch_single_col_int! / dispatch_single_col_bytes!]
    C --> D{fast path matched?}
    D -- Yes --> OUT[Return groupkey_indices]
    D -- No --> GEN[agg_generic_hash_path]
    B -- No --> E[agg_packed_u64_path]
    E --> F{exactly 2 cols,\nboth Utf8/Binary?}
    F -- Yes --> G[symbolize_string_col x2\npack to u64\nFnvHashMap u64 to u32]
    G --> OUT
    F -- No --> H[agg_symbolized_path]
    H --> I{has_string_col AND\navg_bytes >= MIN_AVG?}
    I -- Yes --> J[replace cols with u32 symbols\n-> agg_generic_hash_path]
    J --> OUT
    I -- No --> GEN
    GEN --> OUT

_{Reviews (1): Last reviewed commit: "perf(inline-agg): pack two-string-column..." | Re-trigger Greptile}

greptile-apps · 2026-05-12T19:42:02Z

+    fn make_two_long_string_batch(
+        num_rows: usize,
+        num_distinct1: usize,
+        num_distinct2: usize,
+    ) -> (RecordBatch, Vec<BoundExpr>, Schema) {
+        use std::sync::Arc;


The use std::sync::Arc; import is placed inside the function body, which violates the project convention that imports belong at the module level. The mod bench block doesn't already import Arc explicitly (it relies on prelude::*), so if prelude::* doesn't re-export Arc, this should be added at the module top — and either way the inline placement should be removed.

Suggested change

fn make_two_long_string_batch(

num_rows: usize,

num_distinct1: usize,

num_distinct2: usize,

) -> (RecordBatch, Vec<BoundExpr>, Schema) {

use std::sync::Arc;

fn make_two_long_string_batch(

num_rows: usize,

num_distinct1: usize,

num_distinct2: usize,

) -> (RecordBatch, Vec<BoundExpr>, Schema) {

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-05-12T19:42:03Z

+    /// Multi-column groupby with two long-string keys (well above the
+    /// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the
+    /// active symbolized path.


These test comments were written in #6748 to describe agg_symbolized_path, but this PR changes the dispatch order so that agg_packed_u64_path fires first for any two-string-column shape. The claim "Exercises the active symbolized path" is now incorrect — this test is routed to agg_packed_u64_path (which has no byte-threshold gate and always symbolizes).

Suggested change

/// Multi-column groupby with two long-string keys (well above the

/// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the

/// active symbolized path.

/// Multi-column groupby with two long-string keys and count/sum/min/max aggs.

/// Exercises the packed-u64 path (two Utf8 columns → symbolize + pack).

greptile-apps · 2026-05-12T19:42:04Z

+    /// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization
+    /// should be skipped under the avg-bytes-per-row gate. Mirrors the shape
+    /// of TPC-H Q1 to guard against the perf regression that motivated the
+    /// gating heuristic.


The comment implies short-string symbolization is skipped via the MIN_AVG_STRING_BYTES_PER_ROW gate, but agg_packed_u64_path runs before agg_symbolized_path in the dispatch and has no such gate — so for this CHAR(1)-style shape the packed-u64 path fires and symbolization does happen.

Suggested change

/// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization

/// should be skipped under the avg-bytes-per-row gate. Mirrors the shape

/// of TPC-H Q1 to guard against the perf regression that motivated the

/// gating heuristic.

/// Short-string two-column groupby (CHAR(1)-style keys, TPC-H Q1 shape).

/// Routed to the packed-u64 path (no byte-threshold gate), so symbolization

/// always runs and the result must still match the fallback.

…iew nits - cargo fmt: wrap long lines in the agg_groupby_inline dispatch chain and the bench_packed_u64_two_strings helper call. - Hoist `use std::sync::Arc;` to the `bench` module top instead of inside `make_two_long_string_batch`. - Update doc comments for `test_inline_multi_col_long_strings` and `test_inline_multi_col_short_strings` — both were inherited from PR Eventual-Inc#6748 but now route through `agg_packed_u64_path` first under this PR's dispatch order, so the comments mentioning "active symbolized path" and "symbolization skipped via gate" were stale. Addresses Greptile P2 comments on Eventual-Inc#6924.

…path clippy::needless_borrow (denied under `-D clippy::style`) flagged the `&cols[i]` arguments to `symbolize_string_col`. `cols` is already a sequence of `&Series`, so indexing yields a reference directly. Fixes the `style` CI failure on Eventual-Inc#6924.

codspeed-hq · 2026-05-13T05:22:38Z

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks¹

_{Comparing BABTUNA:perf/packed-symbol-groupby (41312f4) with main (cad4cd5)}

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

CodSpeed flagged a -11.22% regression on test_tpch_sql[1-in-memory-10] (TPC-H Q1, scale 1). Q1's CHAR(1) `l_returnflag` / `l_linestatus` are short enough that the symbolize pass costs more than the saved per-row hash/compare work, even with the packed-u64 typed hash map. Local microbench showed the opposite but failed to capture the cache / branch behavior on CodSpeed's runner and TPC-H's actual data shape. Mirror the avg-bytes-per-row gate that `agg_symbolized_path` already uses (`MIN_AVG_STRING_BYTES_PER_ROW = 16`): tally bytes across both Utf8/Binary cols up front and bail when total bytes per row falls below the threshold. Short-string shapes (Q1) now fall through to the existing generic-hash-on-raw-strings path (unchanged from main). Long-string shapes still get the packed-u64 win. Local Rust benchmarks confirm: - Q1 short strings (CHAR(1) × CHAR(1)): packed-u64 no longer fires; inline path matches main behavior (1.36x-1.68x vs Daft fallback, same as before this PR series). - Long-string two-col shapes (avg >= 16 bytes/row): packed-u64 still fires, still ~1.08x-1.19x faster than PR Eventual-Inc#6748 inline path. Also update doc comments on `test_inline_multi_col_short_strings_*` and `test_inline_packed_u64_short_strings_*` — both now exercise the gate's short-string skip branch and route through the generic path. Addresses Eventual-Inc#6924 CodSpeed regression.

BABTUNA · 2026-05-13T07:42:28Z

Addressed the CodSpeed regression in ee0657c.

The packed-u64 path now reuses agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW = 16 gate. TPC-H Q1's CHAR(1) flag+status falls below threshold and dispatches to the existing generic hash path on raw strings — same behavior as main, so no regression. Long-string two-col shapes still take packed-u64 and still beat PR #6748's inline path by 1.06x-1.19x across the count/sum/min/max/count+sum agg matrix (table updated in the description).

CodSpeed should rerun and clear once the new bench cycle finishes.

BABTUNA · 2026-05-16T23:58:22Z

plz don't review this pr yet, it's part 2 of the multi-column grouping and is not really working the way I want it to

Restructures the packed-u64 two-string-column optimization to live inside `agg_symbolized_path` rather than as a separate dispatch branch. Eliminates the double byte-tally that previous revisions of this PR paid on the TPC-H Q1 short-string path, and matches PR Eventual-Inc#6748's dispatch structure exactly — `agg_groupby_inline`'s multi-column branch is now identical to Eventual-Inc#6748's. Changes: - `agg_symbolized_path` now symbolizes each Utf8/Binary col into a `Vec<u32>` once, stored alongside a `None` slot for non-string cols. If exactly two string cols, pack the two symbol IDs into a u64 and group with `FnvHashMap<u64, u32>` via the new `agg_packed_u64_inner` helper. Otherwise (mixed shape or 3+ cols), rebuild the symbolized RecordBatch and call `agg_generic_hash_path` as before. - Delete the separate `agg_packed_u64_path` function and the `symbolize_string_col` helper; their logic is now inline in `agg_symbolized_path`. - Revert the dispatcher in `agg_groupby_inline` to PR Eventual-Inc#6748's structure (`match agg_symbolized_path → None: generic`). - Update doc comments on three inherited tests to reflect that the packed-u64 fast path lives inside `agg_symbolized_path`, not as a separate dispatch step. Tests: all 48 `inline_agg` tests pass, including the 5 `test_inline_packed_u64_*` cases. Benchmarks: long-string two-col shapes (1.2M-5M rows x 32-100k groups) continue to show ~1.06x-1.18x speedup vs the equivalent no-fast-path baseline on the same restructured code, matching the earlier separate-function implementation's measurements. Short-string shapes (TPC-H Q1) now route through the exact same code path as PR Eventual-Inc#6748 — no overhead added.

BABTUNA · 2026-05-18T11:25:53Z

CodSpeed is now green after the retrigger (41312f4). The earlier -12% was a CodSpeed runner-mismatch artifact, not a real perf change.

Per-commit deltas across all 6 head commits make this unambiguous:

Commit	Delta	What changed
`3346a01` packed-u64 logic	+0.23%	(no perf impact)
`1eafb00` fmt + comments	-0.28%	(no perf impact)
`a47ebdf` clippy fix (removed two `&` chars)	-11.18%	runner switched AMD → Intel Xeon
`ee0657c` gate	-0.39%	(no perf impact)
`c689536` refactor	-0.38%	(no perf impact)
`41312f4` empty CI retrigger	+12.08%	runner switched back, exactly cancels the -11.18%

Net effect of all the code changes in this PR on test_tpch_sql[1-in-memory-10]: -0.82% (noise). Q1 doesn't hit the packed-u64 path — its CHAR(1) flag+status falls below the MIN_AVG_STRING_BYTES_PER_ROW gate inherited from #6748, so it routes through the same generic hash path as main.

Root cause: CodSpeed uses cachegrind CPU simulation, which models the runner CPU's cache hierarchy. The CI runner pool has both AMD EPYC and Intel Xeon — their L2/L3 sizes and associativities differ, so cachegrind produces different simulated miss counts for identical machine code. CodSpeed surfaces this via the "Different runtime environments detected" banner on the previous run; our +12.08% retrigger result confirms the swing was purely a runner change.

BABTUNA and others added 8 commits April 23, 2026 14:46

test(inline-agg): cover min/max in nullable multi-col case

7f16abb

fix(inline-agg): resolve rebase conflicts with latest main

a0f80cf

Merge branch 'main' into feat/symbolize-string-groupby-only

e1005a3

Merge branch 'main' into feat/symbolize-string-groupby-only

971192b

ci: retrigger

ae27ae9

github-actions Bot added the perf label May 12, 2026

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

madvart requested review from rchowell and removed request for rchowell May 12, 2026 19:59

BABTUNA added 2 commits May 13, 2026 00:22

BABTUNA added 2 commits May 16, 2026 20:26

ci: retrigger to land CodSpeed on a consistent runner

41312f4

BABTUNA mentioned this pull request May 20, 2026

perf: grouped aggregation optimization roadmap #6585

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924

perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924
BABTUNA wants to merge 13 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/packed-symbol-groupby

BABTUNA commented May 12, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 12, 2026

Uh oh!

greptile-apps Bot May 12, 2026

Uh oh!

greptile-apps Bot May 12, 2026

Uh oh!

greptile-apps Bot May 12, 2026

Uh oh!

codspeed-hq Bot commented May 13, 2026 •

edited

Loading

Uh oh!

BABTUNA commented May 13, 2026

Uh oh!

BABTUNA commented May 16, 2026

Uh oh!

BABTUNA commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BABTUNA commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Conceptual example

Key changes

Benchmarks

Test plan

Related issues

Uh oh!

greptile-apps Bot commented May 12, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

BABTUNA commented May 13, 2026

Uh oh!

BABTUNA commented May 16, 2026

Uh oh!

BABTUNA commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BABTUNA commented May 12, 2026 •

edited

Loading

codspeed-hq Bot commented May 13, 2026 •

edited

Loading