Commit 0be5982
authored
perf: sort-merge join (SMJ) batch deferred filtering and move mark joins to bitwise stream. Near-unique LEFT and FULL SMJ 20-50x faster (#21184)
## Which issue does this PR close?
Partially addresses #20910. Fixes #21197.
## Rationale for this change
Sort-merge join with a filter on outer joins (LEFT/RIGHT/FULL) runs
`process_filtered_batches()` on every key transition in the Init state.
With near-unique keys (1:1 cardinality), this means running the full
deferred filtering pipeline (concat + `get_corrected_filter_mask` +
`filter_record_batch_by_join_type`) once per row — making filtered
LEFT/RIGHT/FULL **55x slower** than INNER for 10M unique keys.
Additionally, mark join logic in `MaterializingSortMergeJoinStream`
materializes full `(streamed, buffered)` pairs only to discard most of
them via `get_corrected_filter_mask()`. Mark joins are structurally
identical to semi joins (one output row per outer row with a boolean
result) and belong in `BitwiseSortMergeJoinStream`, which avoids pair
materialization entirely using a per-outer-batch bitset.
## What changes are included in this PR?
Three areas of improvement, building on the specialized semi/anti stream
from #20806:
**1. Move mark joins to `BitwiseSortMergeJoinStream`**
- Match on join type; `emit_outer_batch()` emits all rows with the match
bitset as a boolean column (vs semi's filter / anti's invert-and-filter)
- Route `LeftMark`/`RightMark` from `SortMergeJoinExec::execute()` to
the bitwise stream
- Remove all mark-specific logic from `MaterializingSortMergeJoinStream`
(`mark_row_as_match`, `is_not_null` column generation, mark arms in
filter correction)
**2. Batch filter evaluation in `freeze_streamed()`**
- Split `freeze_streamed()` into null-joined classification +
`freeze_streamed_matched()` for batched materialization
- Collect indices across chunks, materialize left/right columns once
using tiered Arrow kernels (`slice` → `take` → `interleave`)
- Single `RecordBatch` construction and single `expression.evaluate()`
per freeze instead of per chunk
- Vectorize `append_filter_metadata()` using builder `extend()` instead
of per-element loop
**3. Batch deferred filtering in Init state** (this is the big win for
Q22 and Q23)
- Gate `process_filtered_batches()` on accumulated rows >= `batch_size`
instead of running on every Init entry
- Accumulated data bounded to ~2×batch_size (one from
`freeze_dequeuing_buffered`, one accumulating toward next freeze) — does
not reintroduce unbounded buffering fixed by PR #20482
- `Exhausted` state flushes any remainder
**Cleanup:**
- Rename `SortMergeJoinStream` → `MaterializingSortMergeJoinStream`
(materializes explicit row pairs for join output) and
`SemiAntiMarkSortMergeJoinStream` → `BitwiseSortMergeJoinStream` (tracks
matches via boolean bitset)
- Consolidate `semi_anti_mark_sort_merge_join/` into `sort_merge_join/`
as `bitwise_stream.rs` / `bitwise_tests.rs`; rename `stream.rs` →
`materializing_stream.rs` and `tests.rs` → `materializing_tests.rs`
- Consolidate `SpillManager` construction into
`SortMergeJoinExec::execute()` (shared across both streams); move
`peak_mem_used` gauge into `BitwiseSortMergeJoinStream::try_new`
- `MaterializingSortMergeJoinStream` now handles only
Inner/Left/Right/Full — all semi/anti/mark branching removed
- `get_corrected_filter_mask()`: merge identical Left/Right/Full
branches; add null-metadata passthrough for already-null-joined rows
- `filter_record_batch_by_join_type()`: rewrite from `filter(true) +
filter(false) + concat` to `zip()` for in-place null-joining — preserves
row ordering and removes `create_null_joined_batch()` entirely; add
early return for empty batches
- `filter_record_batch_by_join_type()`: use `compute::filter()` directly
on `BooleanArray` instead of wrapping in temporary `RecordBatch`
## Benchmarks
`cargo run --release --bin dfbench -- smj`
| Query | Join Type | Rows | Keys | Filter | Main (ms) | PR (ms) |
Speedup |
|-------|-----------|------|------|--------|-----------|---------|---------|
| Q1 | INNER | 1M×1M | 1:1 | — | 16.3 | 14.4 | 1.1x |
| Q2 | INNER | 1M×10M | 1:10 | — | 117.4 | 120.1 | 1.0x |
| Q3 | INNER | 1M×1M | 1:100 | — | 74.2 | 66.6 | 1.1x |
| Q4 | INNER | 1M×10M | 1:10 | 1% | 17.1 | 15.1 | 1.1x |
| Q5 | INNER | 1M×1M | 1:100 | 10% | 18.4 | 14.4 | 1.3x |
| Q6 | LEFT | 1M×10M | 1:10 | — | 129.3 | 122.7 | 1.1x |
| Q7 | LEFT | 1M×10M | 1:10 | 50% | 150.2 | 142.2 | 1.1x |
| Q8 | FULL | 1M×1M | 1:10 | — | 16.6 | 16.7 | 1.0x |
| Q9 | FULL | 1M×10M | 1:10 | 10% | 153.5 | 136.2 | 1.1x |
| Q10 | LEFT SEMI | 1M×10M | 1:10 | — | 53.1 | 53.1 | 1.0x |
| Q11 | LEFT SEMI | 1M×10M | 1:10 | 1% | 15.5 | 14.7 | 1.1x |
| Q12 | LEFT SEMI | 1M×10M | 1:10 | 50% | 65.0 | 67.3 | 1.0x |
| Q13 | LEFT SEMI | 1M×10M | 1:10 | 90% | 105.7 | 109.8 | 1.0x |
| Q14 | LEFT ANTI | 1M×10M | 1:10 | — | 54.3 | 53.9 | 1.0x |
| Q15 | LEFT ANTI | 1M×10M | 1:10 | partial | 51.5 | 50.5 | 1.0x |
| Q16 | LEFT ANTI | 1M×1M | 1:1 | — | 10.3 | 11.3 | 0.9x |
| Q17 | INNER | 1M×50M | 1:50 | 5% | 75.9 | 79.0 | 1.0x |
| Q18 | LEFT SEMI | 1M×50M | 1:50 | 2% | 50.2 | 49.0 | 1.0x |
| Q19 | LEFT ANTI | 1M×50M | 1:50 | partial | 336.4 | 344.2 | 1.0x |
| Q20 | INNER | 1M×10M | 1:100 | GROUP BY | 763.7 | 803.9 | 1.0x |
| Q21 | INNER | 10M×10M | 1:1 | 50% | 186.1 | 187.8 | 1.0x |
| Q22 | LEFT | 10M×10M | 1:1 | 50% | 10,193.8 | 185.8 | **54.9x** |
| Q23 | FULL | 10M×10M | 1:1 | 50% | 10,194.7 | 233.6 | **43.6x** |
| Q24 | LEFT MARK | 1M×10M | 1:10 | 1% | FAILS | 15.1 | — |
| Q25 | LEFT MARK | 1M×10M | 1:10 | 50% | FAILS | 67.3 | — |
| Q26 | LEFT MARK | 1M×10M | 1:10 | 90% | FAILS | 110.0 | — |
General workload (Q1-Q20, various join
types/cardinalities/selectivities): no regressions.
## Are these changes tested?
In addition to existing unit and sqllogictests:
- I ran 50 iterations of the fuzz tests (modified to only test against
hash join as the baseline because nested loop join takes too long)
`cargo test -p datafusion --features extended_tests --test fuzz --
join_fuzz`
- One new sqllogictest for #21197 that fails on main
- Four new unit tests: three for full join with filter that spills
- One new fuzz test to exercise full join with filter that spills
- New benchmark queries Q21-Q23: 10M×10M unique keys with 50% join
filter for INNER/LEFT/FULL — exercises the degenerate case this PR fixes
- New benchmark queries Q24-Q26 duplicated Q11-Q13 but for Mark joins,
showing that they have the same performance as other joins (`LeftSemi`)
that use this stream
## Are there any user-facing changes?
No.1 parent 010e5ee commit 0be5982
File tree
13 files changed
+2006
-1872
lines changed- benchmarks/src
- datafusion
- core/tests/fuzz_cases
- physical-plan/src/joins
- semi_anti_sort_merge_join
- sort_merge_join
- sqllogictest/test_files
13 files changed
+2006
-1872
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
456 | 456 | | |
457 | 457 | | |
458 | 458 | | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
459 | 525 | | |
460 | 526 | | |
461 | 527 | | |
| |||
489 | 555 | | |
490 | 556 | | |
491 | 557 | | |
492 | | - | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
493 | 562 | | |
494 | 563 | | |
495 | 564 | | |
| |||
513 | 582 | | |
514 | 583 | | |
515 | 584 | | |
| 585 | + | |
516 | 586 | | |
517 | 587 | | |
518 | 588 | | |
| |||
528 | 598 | | |
529 | 599 | | |
530 | 600 | | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
531 | 607 | | |
532 | 608 | | |
533 | 609 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
41 | 44 | | |
42 | 45 | | |
43 | 46 | | |
| |||
1125 | 1128 | | |
1126 | 1129 | | |
1127 | 1130 | | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + | |
| 1173 | + | |
| 1174 | + | |
| 1175 | + | |
| 1176 | + | |
| 1177 | + | |
| 1178 | + | |
| 1179 | + | |
| 1180 | + | |
| 1181 | + | |
| 1182 | + | |
| 1183 | + | |
| 1184 | + | |
| 1185 | + | |
| 1186 | + | |
| 1187 | + | |
| 1188 | + | |
| 1189 | + | |
| 1190 | + | |
| 1191 | + | |
| 1192 | + | |
| 1193 | + | |
| 1194 | + | |
| 1195 | + | |
| 1196 | + | |
| 1197 | + | |
| 1198 | + | |
| 1199 | + | |
| 1200 | + | |
| 1201 | + | |
| 1202 | + | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
| 1224 | + | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
| 1228 | + | |
| 1229 | + | |
| 1230 | + | |
| 1231 | + | |
| 1232 | + | |
| 1233 | + | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
| 1243 | + | |
| 1244 | + | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
| 1251 | + | |
| 1252 | + | |
| 1253 | + | |
| 1254 | + | |
| 1255 | + | |
| 1256 | + | |
| 1257 | + | |
| 1258 | + | |
| 1259 | + | |
| 1260 | + | |
| 1261 | + | |
| 1262 | + | |
1128 | 1263 | | |
1129 | 1264 | | |
1130 | 1265 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
38 | 37 | | |
39 | 38 | | |
40 | 39 | | |
| |||
Lines changed: 0 additions & 25 deletions
This file was deleted.
0 commit comments