perf: replace SMJ's join_filter_not_matched_map HashMap with Vec<FilterState>#21517
Conversation
|
run benchmark smj |
|
run benchmark tpch10 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing smj_replace_filtered_hashmap (701fddc) to 6a770aa (merge-base) diff using: smj File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing smj_replace_filtered_hashmap (701fddc) to 6a770aa (merge-base) diff using: tpch10 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesmj — base (merge-base)
smj — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch10 — base (merge-base)
tpch10 — branch
File an issue against this benchmark runner |
|
This is the only relevant microbenchmark query, I think. TPC-H does not have a relevant query for this codepath IIRC. |
There was a problem hiding this comment.
Thanks @mbutrovich
Makes the code much readable, the benchmarks look like within a noise range
Which issue does this PR close?
Partially addresses #20910, might be the last one for now.
Rationale for this change
In full outer joins with filters,
BufferedBatchtracks which buffered rows had all filter evaluations fail using aHashMap<u64, bool>. This map is read and written per-row in a hot loop duringfreeze_streamed_matched. The HashMap pays ~40-60 bytes per entry (8-byte u64 key + 1-byte bool value + hash table overhead), hashes every key twice per iteration (once forget, once forinsert), and scatters entries across heap allocations with poor cache locality.What changes are included in this PR?
Replaces
HashMap<u64, bool>withVec<FilterState>indexed by absolute row position within the batch.FilterStateis a#[repr(u8)]enum with three variants (Unvisited,AllFailed,SomePassed), so the Vec is 1 byte per row — allocated once, direct-indexed, no hashing. At the default batch size of 8192 rows the Vec is 8 KB (fits in L1 cache). Even at large batch sizes (32K+), 32 KB is still within L1 on most machines, while the HashMap at 32K entries would consume ~1-2 MB of scattered heap memory.Three states are needed because a simple
Vec<bool>cannot distinguish "never matched" (handled separately bynull_joined) from "matched but all filters failed" (must be emitted as null-joined). The enum variant names are self-documenting, unlikeOption<bool>whereNone/Some(true)/Some(false)would be opaque.Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.