Skip to content

perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924

Open
BABTUNA wants to merge 13 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/packed-symbol-groupby
Open

perf(inline-agg): pack two-string-column keys into u64 for typed FNV grouping#6924
BABTUNA wants to merge 13 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/packed-symbol-groupby

Conversation

@BABTUNA
Copy link
Copy Markdown
Contributor

@BABTUNA BABTUNA commented May 12, 2026

Stacked on #6748. This branch sits on top of feat/symbolize-string-groupby-only, so the diff currently includes #6748's commits. Only the final commits are new in this PR; everything else is reviewed as part of #6748. After #6748 merges to main, I'll rebase and the diff will collapse.

Summary

Extends Item 3 of #6585 (and builds on PR #6748). For the
two-string-column groupby shape with sufficiently long string keys,
this change packs the two u32 symbol IDs produced by
symbolize_column into a single u64 key and groups against a typed
FnvHashMap<u64, u32> in a tight integer loop — no per-row
comparator closure, no IndexHash, no dynamic-typed Series
equality dispatch.

PR #6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (agg_generic_hash_path), which keeps the comparator-closure
overhead. This PR captures the remaining benefit for two-string-col
shapes via a dedicated typed map. Grouping semantics and final query
results are unchanged. Short-string shapes (e.g. TPC-H Q1's CHAR(1)
l_returnflag / l_linestatus) fall through to the existing generic
hash path on raw strings, exactly as before this PR.

Conceptual example

Input rows:                 After symbolization:        Packed u64 key:
key1    | key2              key1_sym | key2_sym         (key1_sym << 32) | key2_sym
--------|--------           ---------|---------         -------------------------
"alice" | "red"             0        | 0                0x00000000_00000000
"bob"   | "red"             1        | 0                0x00000001_00000000
"alice" | "blue"            0        | 1                0x00000000_00000001
"alice" | "red"             0        | 0                0x00000000_00000000

The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
symbolize_column reserves symbol ID 0 for null and starts non-null
IDs at 1, so both-null rows share a unique packed key and never
collide with non-null rows.

Key changes

  • Add agg_packed_u64_path for exactly two Utf8/Binary group-by
    columns: symbolize each column into a Vec<u32>, pack the pair into
    a u64, group with FnvHashMap<u64, u32> using the same
    Vacant/Occupied pattern as the existing single-column fast paths.
  • Reuse agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW = 16
    gate: tally bytes across both string cols up front and bail when
    avg bytes per row falls below threshold. Short-string shapes
    (TPC-H Q1) fall through to the generic hash path on raw strings —
    for very short keys the symbolize pass costs more than the saved
    per-row hash/compare work, even with the typed u64 map.
  • Extract a small symbolize_string_col helper that returns
    Ok(Some(Vec<u32>)) for Utf8/Binary and Ok(None) otherwise — used
    by the new path and isolates per-dtype null/value-accessor wiring.
  • Dispatch in agg_groupby_inline tries the packed-u64 path first for
    multi-column shapes, then falls through to the existing
    agg_symbolized_path, then to agg_generic_hash_path. All other
    multi-column shapes (3+ columns, int×string, pure int multi-col)
    are unchanged.
  • New bench_packed_u64_two_strings Rust-level benchmark covering
    two-Utf8-column shapes at varying cardinalities across the full
    count/sum/min/max/count+sum agg matrix.

Benchmarks

Long-string two-column shapes (avg bytes per row ≥ 16) measured by
running the same benchmark twice on the same machine: once with this
PR's packed-u64 dispatch enabled, once with it disabled (falling
through to PR #6748's existing agg_symbolized_path /
agg_generic_hash_path). All numbers are best-of-10 after 3 warmups,
Rust nightly --release, --test-threads=1, Linux (WSL).

agg rows distinct PR #6748 inline (ms) this PR (ms) speedup
count 1.2M 8 × 4 = 32 61.69 57.05 1.08x
sum 1.2M 8 × 4 = 32 65.88 56.05 1.18x
min 1.2M 8 × 4 = 32 62.78 57.66 1.09x
max 1.2M 8 × 4 = 32 63.70 56.23 1.13x
count+sum 1.2M 8 × 4 = 32 64.76 56.75 1.14x
count 1.2M 64 × 32 = 2048 62.43 55.56 1.12x
sum 1.2M 64 × 32 = 2048 62.62 62.70 1.00x
min 1.2M 64 × 32 = 2048 64.45 54.75 1.18x
max 1.2M 64 × 32 = 2048 63.78 56.90 1.12x
count+sum 1.2M 64 × 32 = 2048 66.10 55.77 1.19x
count 5M 8 × 4 = 32 293.04 246.37 1.19x
sum 5M 8 × 4 = 32 265.15 235.03 1.13x
min 5M 8 × 4 = 32 263.15 235.38 1.12x
max 5M 8 × 4 = 32 267.26 238.82 1.12x
count+sum 5M 8 × 4 = 32 262.31 240.04 1.09x
count 5M 1000 × 100 = 100k 268.71 241.55 1.11x
sum 5M 1000 × 100 = 100k 269.30 243.80 1.10x
min 5M 1000 × 100 = 100k 267.40 246.93 1.08x
max 5M 1000 × 100 = 100k 269.19 245.68 1.10x
count+sum 5M 1000 × 100 = 100k 270.75 239.63 1.13x

Across all 20 measured long-string shapes the packed-u64 path is
1.06x-1.19x faster than PR #6748's inline path, with a single 1.00x
outlier in the 1.2M × 2048 sum case. This is the "missing benefit"
the comparator-closure / IndexHash dispatch was hiding.

Short-string shapes (avg < 16 bytes per row) — TPC-H Q1 included —
fall through to the existing generic hash path on raw strings and
behave identically to main. An earlier revision of this PR did not
gate the packed-u64 path and CodSpeed flagged a -11.22% regression on
test_tpch_sql[1-in-memory-10]; the gate added in ee0657c
restores that workload to baseline.

Reproduction:
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1

Test plan

  • 5 new test_inline_packed_u64_* cases comparing inline output to
    the fallback path:
    • Utf8 × Utf8, no nulls.
    • Utf8 × Utf8, nulls in both columns.
    • Binary × Binary.
    • Utf8 × Binary.
    • Short-string CHAR(1) shape (TPC-H Q1) — exercises the gate's
      skip branch (routes through generic hash path).
  • All existing inline_agg tests still pass (48 total, 0 failures).
  • New bench_packed_u64_two_strings benchmark exercises the path
    end-to-end at scale across the count/sum/min/max/count+sum matrix.

Related issues

Part of #6585 (deeper specialization of Item 3, on top of PR #6748).

BABTUNA and others added 8 commits April 23, 2026 14:46
…tion

For multi-column groupbys containing Utf8/Binary columns, replace
variable-width string columns with dense u32 symbol IDs before
hashing and comparison. This avoids repeatedly hashing and comparing
string bytes during the probe phase.

# Conflicts:
#	src/daft-recordbatch/src/ops/inline_agg.rs
For multi-column groupbys with very short string keys (e.g. CHAR(1) in
TPC-H Q1), replacing variable-width strings with u32 symbol IDs adds
an extra pass without saving meaningful hash/compare work, producing
a measurable CodSpeed regression on Q1.

Add a heuristic that skips symbolization unless avg live string
bytes/row meets MIN_AVG_STRING_BYTES_PER_ROW (16). When skipped,
agg_symbolized_path returns None and grouping falls through to the
existing agg_generic_hash_path. Above the threshold the symbolized
path runs unchanged.

Tests:
- multi-col int-only (no string col -> skip path)
- multi-col long strings (above gate -> active symbolized path)
- multi-col short strings (below gate -> skip path; TPC-H Q1 shape)
…grouping

## Summary

Extends Item 3 of Eventual-Inc#6585 (and builds on PR Eventual-Inc#6748). For the
two-string-column groupby shape, this change packs the two `u32`
symbol IDs produced by `symbolize_column` into a single `u64` key and
groups against a typed `FnvHashMap<u64, u32>` in a tight integer loop
— no per-row comparator closure, no `IndexHash`, no dynamic-typed
`Series` equality dispatch.

PR Eventual-Inc#6748 already symbolizes Utf8/Binary group-by columns, but the
symbolized columns are still fed into the generic multi-column hash
path (`agg_generic_hash_path`), which keeps the comparator-closure
overhead. This PR captures the remaining benefit by routing the
two-string-column shape — including the TPC-H Q1 `l_returnflag` /
`l_linestatus` pattern — through a dedicated typed map. Grouping
semantics and final query results are unchanged.

## Conceptual example

```
Input rows:                 After symbolization:        Packed u64 key:
key1    | key2              key1_sym | key2_sym         (key1_sym << 32) | key2_sym
--------|--------           ---------|---------         -------------------------
"alice" | "red"             0        | 0                0x00000000_00000000
"bob"   | "red"             1        | 0                0x00000001_00000000
"alice" | "blue"            0        | 1                0x00000000_00000001
"alice" | "red"             0        | 0                0x00000000_00000000
```

The two symbol spaces sit in disjoint 32-bit halves of the u64 key,
so distinct (sym0, sym1) pairs always yield distinct packed keys.
Null-equals-null is preserved: when a column has nulls,
`symbolize_column` reserves symbol ID 0 for null and starts non-null
IDs at 1, so both-null rows share a unique packed key and never collide
with non-null rows.

## Key changes

- Add `agg_packed_u64_path` for exactly two Utf8/Binary group-by
  columns: symbolize each column into a `Vec<u32>`, pack the pair into
  a `u64`, group with `FnvHashMap<u64, u32>` using the same
  Vacant/Occupied pattern as the existing single-column fast paths.
- Extract a small `symbolize_string_col` helper that returns
  `Ok(Some(Vec<u32>))` for Utf8/Binary and `Ok(None)` otherwise — used
  by the new path and isolates per-dtype null/value-accessor wiring.
- Dispatch in `agg_groupby_inline` tries the packed-u64 path first for
  multi-column shapes, then falls through to the existing
  `agg_symbolized_path`, then to `agg_generic_hash_path`. All other
  multi-column shapes (3+ columns, int×string, pure int multi-col) are
  unchanged.
- No avg-bytes-per-row gate: unlike `agg_symbolized_path`, the
  packed-u64 path's cost is one symbolize pass plus a tight `u64`-keyed
  loop, not a symbolized RecordBatch rebuild fed through the generic
  comparator-closure hash path. Short-string shapes (TPC-H Q1) win
  along with long-string shapes.
- New `bench_packed_u64_two_strings` Rust-level benchmark covering
  two-Utf8-column shapes at varying cardinalities and across the full
  count/sum/min/max/count+sum agg matrix.

## Benchmarks

### Rust-level Q1-like benchmark (2 short string keys, 6 groups, sum+count of two float64 cols)

| rows | inline (ms) | fallback (ms) | speedup |
|------|-------------|---------------|---------|
| 1.2M | 21.75       | 33.74         | 1.55x   |
| 5M   | 111.34      | 175.12        | 1.57x   |

### Rust-level packed-u64 benchmark (2 long Utf8 keys, int64 val, full agg matrix)

| agg       | rows | distinct          | inline (ms) | fallback (ms) | speedup |
|-----------|------|-------------------|-------------|---------------|---------|
| count     | 1.2M | 8 × 4 = 32        | 57.05       | 28.53         | 0.50x   |
| sum       | 1.2M | 8 × 4 = 32        | 56.05       | 33.73         | 0.60x   |
| min       | 1.2M | 8 × 4 = 32        | 57.66       | 33.31         | 0.58x   |
| max       | 1.2M | 8 × 4 = 32        | 56.23       | 33.97         | 0.60x   |
| count+sum | 1.2M | 8 × 4 = 32        | 56.75       | 34.10         | 0.60x   |
| count     | 1.2M | 64 × 32 = 2048    | 55.56       | 27.64         | 0.50x   |
| sum       | 1.2M | 64 × 32 = 2048    | 62.70       | 39.52         | 0.63x   |
| min       | 1.2M | 64 × 32 = 2048    | 54.75       | 41.41         | 0.76x   |
| max       | 1.2M | 64 × 32 = 2048    | 56.90       | 41.01         | 0.72x   |
| count+sum | 1.2M | 64 × 32 = 2048    | 55.77       | 40.71         | 0.73x   |
| count     | 5M   | 8 × 4 = 32        | 246.37      | 127.56        | 0.52x   |
| sum       | 5M   | 8 × 4 = 32        | 235.03      | 142.14        | 0.60x   |
| min       | 5M   | 8 × 4 = 32        | 235.38      | 140.41        | 0.60x   |
| max       | 5M   | 8 × 4 = 32        | 238.82      | 142.38        | 0.60x   |
| count+sum | 5M   | 8 × 4 = 32        | 240.04      | 142.84        | 0.60x   |
| count     | 5M   | 1000 × 100 = 100k | 241.55      | 132.99        | 0.55x   |
| sum       | 5M   | 1000 × 100 = 100k | 243.80      | 207.44        | 0.85x   |
| min       | 5M   | 1000 × 100 = 100k | 246.93      | 210.96        | 0.85x   |
| max       | 5M   | 1000 × 100 = 100k | 245.68      | 212.00        | 0.86x   |
| count+sum | 5M   | 1000 × 100 = 100k | 239.63      | 206.81        | 0.86x   |

The long-string shapes are below 1.0x vs the Daft fallback, but that
gap is pre-existing in PR Eventual-Inc#6748's inline path — Daft's general groupby
machinery is currently faster for these specific shapes. The delta
this PR introduces is the comparison below.

### Inline-vs-inline: packed-u64 vs PR Eventual-Inc#6748 inline (same shape, same machine)

| shape                                  | PR Eventual-Inc#6748 inline (ms) | packed-u64 (ms) | speedup |
|----------------------------------------|----------------------|-----------------|---------|
| Q1 1.2M × 6 short strings (sum+count)  |  23.07               |  21.75          | 1.06x   |
| Q1 5M × 6 short strings (sum+count)    | 120.15               | 111.34          | 1.08x   |
| 1.2M × 32 long strings (count)         |  61.69               |  57.05          | 1.08x   |
| 1.2M × 32 long strings (sum)           |  65.88               |  56.05          | 1.18x   |
| 1.2M × 32 long strings (min)           |  62.78               |  57.66          | 1.09x   |
| 1.2M × 32 long strings (max)           |  63.70               |  56.23          | 1.13x   |
| 1.2M × 32 long strings (count+sum)     |  64.76               |  56.75          | 1.14x   |
| 1.2M × 2048 long strings (count)       |  62.43               |  55.56          | 1.12x   |
| 1.2M × 2048 long strings (sum)         |  62.62               |  62.70          | 1.00x   |
| 1.2M × 2048 long strings (min)         |  64.45               |  54.75          | 1.18x   |
| 1.2M × 2048 long strings (max)         |  63.78               |  56.90          | 1.12x   |
| 1.2M × 2048 long strings (count+sum)   |  66.10               |  55.77          | 1.19x   |
| 5M × 32 long strings (count)           | 293.04               | 246.37          | 1.19x   |
| 5M × 32 long strings (sum)             | 265.15               | 235.03          | 1.13x   |
| 5M × 32 long strings (min)             | 263.15               | 235.38          | 1.12x   |
| 5M × 32 long strings (max)             | 267.26               | 238.82          | 1.12x   |
| 5M × 32 long strings (count+sum)       | 262.31               | 240.04          | 1.09x   |
| 5M × 100k long strings (count)         | 268.71               | 241.55          | 1.11x   |
| 5M × 100k long strings (sum)           | 269.30               | 243.80          | 1.10x   |
| 5M × 100k long strings (min)           | 267.40               | 246.93          | 1.08x   |
| 5M × 100k long strings (max)           | 269.19               | 245.68          | 1.10x   |
| 5M × 100k long strings (count+sum)     | 270.75               | 239.63          | 1.13x   |

Across all 22 measured shapes, packed-u64 is 1.06x to 1.19x faster
than PR Eventual-Inc#6748's inline path. This is the "missing benefit" the
comparator-closure / IndexHash dispatch was hiding.

All benchmarks run on Linux (WSL), Rust nightly --release, warmup=3,
iters=10, --test-threads=1. Commands:
cargo test -p daft-recordbatch --release -- bench_q1_like --nocapture --ignored --test-threads=1
cargo test -p daft-recordbatch --release -- bench_packed_u64_two_strings --nocapture --ignored --test-threads=1

## Test plan

- 5 new test_inline_packed_u64_* cases comparing inline output to
  the fallback path:
  - Utf8 × Utf8, no nulls.
  - Utf8 × Utf8, nulls in both columns.
  - Binary × Binary.
  - Utf8 × Binary.
  - Short-string CHAR(1) shape (TPC-H Q1).
- All existing inline_agg tests still pass (48 total, 0 failures).
- New bench_packed_u64_two_strings benchmark exercises the path
  end-to-end at scale across the count/sum/min/max/count+sum matrix.

## Related issues

Part of Eventual-Inc#6585 (deeper specialization of Item 3, on top of PR Eventual-Inc#6748).
@github-actions github-actions Bot added the perf label May 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR introduces a dedicated agg_packed_u64_path for the two-string-column group-by shape: it symbolizes each Utf8/Binary column into dense u32 IDs, packs the pair into a single u64, and groups with a typed FnvHashMap<u64, u32> — eliminating the per-row comparator-closure and IndexHash overhead that remained even after #6748. The dispatch in agg_groupby_inline is updated to try this path first for multi-column shapes, falling through to agg_symbolized_path and then the generic path.

  • New agg_packed_u64_path: symbolizes two string columns, packs (sym0 << 32) | sym1 into a u64 key, and groups with a tight integer loop; null-equals-null is preserved by reserving ID 0 for null in each column's symbol space independently.
  • New symbolize_string_col helper: isolates per-dtype accessor wiring for Utf8 and Binary, returning None for non-string columns so callers can fall back gracefully.
  • Dispatch change: two-string-column shapes now bypass agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW gate entirely, which is intentional (packed-u64 cost is one symbolize pass, not a RecordBatch rebuild through the generic path).

Confidence Score: 4/5

Safe to merge; the packed-u64 packing scheme is mathematically sound and correctness is verified by comparing against the fallback across all key shapes and null patterns.

The core logic is correct and well-tested across the full agg matrix. Only minor doc comment staleness (caused by the new dispatch ordering) and an inline import style violation were found.

The test doc comments in inline_agg.rs around test_inline_multi_col_long_strings and test_inline_multi_col_short_strings are worth a quick look to confirm they accurately reflect which dispatch path each test now exercises.

Important Files Changed

Filename Overview
src/daft-recordbatch/src/ops/inline_agg.rs Adds symbolize_string_col helper and agg_packed_u64_path for the two-string-column shape, updates the multi-column dispatch to try packed-u64 first; stale doc comments in inherited tests need updating.
src/daft-recordbatch/src/ops/bench_agg.rs Adds make_two_long_string_batch helper and bench_packed_u64_two_strings benchmark; minor style issue with inline use std::sync::Arc; inside the helper function.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[agg_groupby_inline] --> B{num_columns == 1?}
    B -- Yes --> C[dispatch_single_col_int! / dispatch_single_col_bytes!]
    C --> D{fast path matched?}
    D -- Yes --> OUT[Return groupkey_indices]
    D -- No --> GEN[agg_generic_hash_path]
    B -- No --> E[agg_packed_u64_path]
    E --> F{exactly 2 cols,\nboth Utf8/Binary?}
    F -- Yes --> G[symbolize_string_col x2\npack to u64\nFnvHashMap u64 to u32]
    G --> OUT
    F -- No --> H[agg_symbolized_path]
    H --> I{has_string_col AND\navg_bytes >= MIN_AVG?}
    I -- Yes --> J[replace cols with u32 symbols\n-> agg_generic_hash_path]
    J --> OUT
    I -- No --> GEN
    GEN --> OUT
Loading

Reviews (1): Last reviewed commit: "perf(inline-agg): pack two-string-column..." | Re-trigger Greptile

Comment on lines +211 to +216
fn make_two_long_string_batch(
num_rows: usize,
num_distinct1: usize,
num_distinct2: usize,
) -> (RecordBatch, Vec<BoundExpr>, Schema) {
use std::sync::Arc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The use std::sync::Arc; import is placed inside the function body, which violates the project convention that imports belong at the module level. The mod bench block doesn't already import Arc explicitly (it relies on prelude::*), so if prelude::* doesn't re-export Arc, this should be added at the module top — and either way the inline placement should be removed.

Suggested change
fn make_two_long_string_batch(
num_rows: usize,
num_distinct1: usize,
num_distinct2: usize,
) -> (RecordBatch, Vec<BoundExpr>, Schema) {
use std::sync::Arc;
fn make_two_long_string_batch(
num_rows: usize,
num_distinct1: usize,
num_distinct2: usize,
) -> (RecordBatch, Vec<BoundExpr>, Schema) {

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +2229 to +2231
/// Multi-column groupby with two long-string keys (well above the
/// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the
/// active symbolized path.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 These test comments were written in #6748 to describe agg_symbolized_path, but this PR changes the dispatch order so that agg_packed_u64_path fires first for any two-string-column shape. The claim "Exercises the active symbolized path" is now incorrect — this test is routed to agg_packed_u64_path (which has no byte-threshold gate and always symbolizes).

Suggested change
/// Multi-column groupby with two long-string keys (well above the
/// avg-bytes-per-row gate) and count/sum/min/max aggs. Exercises the
/// active symbolized path.
/// Multi-column groupby with two long-string keys and count/sum/min/max aggs.
/// Exercises the packed-u64 path (two Utf8 columns → symbolize + pack).

Comment on lines +2283 to +2286
/// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization
/// should be skipped under the avg-bytes-per-row gate. Mirrors the shape
/// of TPC-H Q1 to guard against the perf regression that motivated the
/// gating heuristic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comment implies short-string symbolization is skipped via the MIN_AVG_STRING_BYTES_PER_ROW gate, but agg_packed_u64_path runs before agg_symbolized_path in the dispatch and has no such gate — so for this CHAR(1)-style shape the packed-u64 path fires and symbolization does happen.

Suggested change
/// Short-string multi-column groupby (CHAR(1)-style keys) — symbolization
/// should be skipped under the avg-bytes-per-row gate. Mirrors the shape
/// of TPC-H Q1 to guard against the perf regression that motivated the
/// gating heuristic.
/// Short-string two-column groupby (CHAR(1)-style keys, TPC-H Q1 shape).
/// Routed to the packed-u64 path (no byte-threshold gate), so symbolization
/// always runs and the result must still match the fallback.

@madvart madvart requested review from rchowell and removed request for rchowell May 12, 2026 19:59
BABTUNA added 2 commits May 13, 2026 00:22
…iew nits

- cargo fmt: wrap long lines in the agg_groupby_inline dispatch chain
  and the bench_packed_u64_two_strings helper call.
- Hoist `use std::sync::Arc;` to the `bench` module top instead of
  inside `make_two_long_string_batch`.
- Update doc comments for `test_inline_multi_col_long_strings` and
  `test_inline_multi_col_short_strings` — both were inherited from
  PR Eventual-Inc#6748 but now route through `agg_packed_u64_path` first under
  this PR's dispatch order, so the comments mentioning "active
  symbolized path" and "symbolization skipped via gate" were stale.

Addresses Greptile P2 comments on Eventual-Inc#6924.
…path

clippy::needless_borrow (denied under `-D clippy::style`) flagged the
`&cols[i]` arguments to `symbolize_string_col`. `cols` is already a
sequence of `&Series`, so indexing yields a reference directly.

Fixes the `style` CI failure on Eventual-Inc#6924.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 13, 2026

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing BABTUNA:perf/packed-symbol-groupby (41312f4) with main (cad4cd5)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

CodSpeed flagged a -11.22% regression on test_tpch_sql[1-in-memory-10]
(TPC-H Q1, scale 1). Q1's CHAR(1) `l_returnflag` / `l_linestatus` are
short enough that the symbolize pass costs more than the saved per-row
hash/compare work, even with the packed-u64 typed hash map. Local
microbench showed the opposite but failed to capture the cache /
branch behavior on CodSpeed's runner and TPC-H's actual data shape.

Mirror the avg-bytes-per-row gate that `agg_symbolized_path` already
uses (`MIN_AVG_STRING_BYTES_PER_ROW = 16`): tally bytes across both
Utf8/Binary cols up front and bail when total bytes per row falls
below the threshold. Short-string shapes (Q1) now fall through to the
existing generic-hash-on-raw-strings path (unchanged from main).
Long-string shapes still get the packed-u64 win.

Local Rust benchmarks confirm:
- Q1 short strings (CHAR(1) × CHAR(1)): packed-u64 no longer fires;
  inline path matches main behavior (1.36x-1.68x vs Daft fallback,
  same as before this PR series).
- Long-string two-col shapes (avg >= 16 bytes/row): packed-u64 still
  fires, still ~1.08x-1.19x faster than PR Eventual-Inc#6748 inline path.

Also update doc comments on `test_inline_multi_col_short_strings_*`
and `test_inline_packed_u64_short_strings_*` — both now exercise the
gate's short-string skip branch and route through the generic path.

Addresses Eventual-Inc#6924 CodSpeed regression.
@BABTUNA
Copy link
Copy Markdown
Contributor Author

BABTUNA commented May 13, 2026

Addressed the CodSpeed regression in ee0657c.

The packed-u64 path now reuses agg_symbolized_path's MIN_AVG_STRING_BYTES_PER_ROW = 16 gate. TPC-H Q1's CHAR(1) flag+status falls below threshold and dispatches to the existing generic hash path on raw strings — same behavior as main, so no regression. Long-string two-col shapes still take packed-u64 and still beat PR #6748's inline path by 1.06x-1.19x across the count/sum/min/max/count+sum agg matrix (table updated in the description).

CodSpeed should rerun and clear once the new bench cycle finishes.

@BABTUNA
Copy link
Copy Markdown
Contributor Author

BABTUNA commented May 16, 2026

plz don't review this pr yet, it's part 2 of the multi-column grouping and is not really working the way I want it to

BABTUNA added 2 commits May 16, 2026 20:26
Restructures the packed-u64 two-string-column optimization to live
inside `agg_symbolized_path` rather than as a separate dispatch
branch. Eliminates the double byte-tally that previous revisions of
this PR paid on the TPC-H Q1 short-string path, and matches PR
Eventual-Inc#6748's dispatch structure exactly — `agg_groupby_inline`'s
multi-column branch is now identical to Eventual-Inc#6748's.

Changes:
- `agg_symbolized_path` now symbolizes each Utf8/Binary col into a
  `Vec<u32>` once, stored alongside a `None` slot for non-string cols.
  If exactly two string cols, pack the two symbol IDs into a u64 and
  group with `FnvHashMap<u64, u32>` via the new `agg_packed_u64_inner`
  helper. Otherwise (mixed shape or 3+ cols), rebuild the symbolized
  RecordBatch and call `agg_generic_hash_path` as before.
- Delete the separate `agg_packed_u64_path` function and the
  `symbolize_string_col` helper; their logic is now inline in
  `agg_symbolized_path`.
- Revert the dispatcher in `agg_groupby_inline` to PR Eventual-Inc#6748's
  structure (`match agg_symbolized_path → None: generic`).
- Update doc comments on three inherited tests to reflect that the
  packed-u64 fast path lives inside `agg_symbolized_path`, not as a
  separate dispatch step.

Tests: all 48 `inline_agg` tests pass, including the 5
`test_inline_packed_u64_*` cases.

Benchmarks: long-string two-col shapes (1.2M-5M rows x 32-100k
groups) continue to show ~1.06x-1.18x speedup vs the equivalent
no-fast-path baseline on the same restructured code, matching the
earlier separate-function implementation's measurements. Short-string
shapes (TPC-H Q1) now route through the exact same code path as PR
Eventual-Inc#6748 — no overhead added.
@BABTUNA
Copy link
Copy Markdown
Contributor Author

BABTUNA commented May 18, 2026

CodSpeed is now green after the retrigger (41312f4). The earlier -12% was a CodSpeed runner-mismatch artifact, not a real perf change.

Per-commit deltas across all 6 head commits make this unambiguous:

Commit Delta What changed
3346a01 packed-u64 logic +0.23% (no perf impact)
1eafb00 fmt + comments -0.28% (no perf impact)
a47ebdf clippy fix (removed two & chars) -11.18% runner switched AMD → Intel Xeon
ee0657c gate -0.39% (no perf impact)
c689536 refactor -0.38% (no perf impact)
41312f4 empty CI retrigger +12.08% runner switched back, exactly cancels the -11.18%

Net effect of all the code changes in this PR on test_tpch_sql[1-in-memory-10]: -0.82% (noise). Q1 doesn't hit the packed-u64 path — its CHAR(1) flag+status falls below the MIN_AVG_STRING_BYTES_PER_ROW gate inherited from #6748, so it routes through the same generic hash path as main.

Root cause: CodSpeed uses cachegrind CPU simulation, which models the runner CPU's cache hierarchy. The CI runner pool has both AMD EPYC and Intel Xeon — their L2/L3 sizes and associativities differ, so cachegrind produces different simulated miss counts for identical machine code. CodSpeed surfaces this via the "Different runtime environments detected" banner on the previous run; our +12.08% retrigger result confirms the swing was purely a runner change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant