Skip to content

Commit ba20f5d

Browse files
committed
docs: design doc + reframe slides and report around the clean stack
- Add design.md as an upstream-ready proposal-style spec for the six-commit pr/round6-stack (problem, goals/non-goals, mechanism, alternatives, validation, migration, open questions). - slides/datafusion-meetup-05-2026/make_plots.py now reads the R6-STACK-pushdown[-lat] result dirs (the clean-stack branch's bench output) and labels the bars 'main / main + pushdown / change' for clarity. - Regenerate the four chart PNGs with the new framing and numbers. TPC-H SSD chart in particular flips visually: the change column now sits below 'main' instead of above 'main + pushdown'. - Rewrite the four content slides to match: ClickBench / TPC-DS / TPC-H SSD all show the change beating both 'main' and 'main + pushdown'; TPC-H S3 now reads 'parity with main, 0.46× of main + pushdown'; the closing slide replaces the deferred 'latency-aware z' bullet (which is now in the stack) with 'pushdown=on by default' as the next milestone. - Regenerate presentation.html via marp-cli. - Extend report.md §10 with the clean-stack listing, the new three-column 'main / main + pushdown / change' bench tables, and a §10.3 explaining the literal_columns() bugfix the workspace test suite uncovered.
1 parent ca1672a commit ba20f5d

9 files changed

Lines changed: 888 additions & 73 deletions

File tree

design.md

Lines changed: 549 additions & 0 deletions
Large diffs are not rendered by default.

report.md

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1218,3 +1218,274 @@ expected cost / impact ratio:
12181218
- [#21440 — Dynamic BufferExec sizing: row limit + memory cap for sort pushdown](https://github.com/apache/datafusion/issues/21440)
12191219
- [#20443 — Support filter pushdown through `SortMergeJoinExec`](https://github.com/apache/datafusion/issues/20443)
12201220
- [#21145 — Add example implementing filter pushdown](https://github.com/apache/datafusion/issues/21145)
1221+
1222+
---
1223+
1224+
## 10. Round-6 update — what shipped, and how to get it across the line
1225+
1226+
This section is an addendum written after the round-6 work landed
1227+
and was reorganised into a clean six-commit stack on
1228+
`adriangb/datafusion:pr/round6-stack`. It supersedes the speculative
1229+
sketches in §7.2.a / §7.2.b / §7.2.c by reporting what was actually
1230+
built and where the resulting branch sits against the success
1231+
criteria.
1232+
1233+
### 10.0 The stacked branch
1234+
1235+
```
1236+
72a3eaa85 test(parquet): force-promote filters in tests that drove the legacy 'pushdown=on' contract
1237+
e799f85d7 fix(pruning): union sub-predicate columns in literal_columns()
1238+
d698aae5a feat(parquet): adaptive placement prior from per-conjunct pruning rates
1239+
ffbcd0c1c feat(pruning): per-conjunct PruningPredicate rates API
1240+
79beaf339 refactor(parquet): split selectivity.rs into modules
1241+
9bb321284 test(parquet): update selectivity tests for scatter-aware bytes API
1242+
9a1705088 Revert "feat(parquet): coalesce post-scan-filtered batches" ← PR #11 base
1243+
```
1244+
1245+
Each commit builds and is lint-clean; tests pass at every step. The
1246+
refactor in commit 2 is pure code-motion, splitting the 2.3k-line
1247+
`selectivity.rs` into a six-file submodule. The pruning bugfix in
1248+
commit 5 is load-bearing — it surfaced when running the full
1249+
workspace test suite and is described in §10.4.
1250+
1251+
cargo test --workspace --no-fail-fast: 9 240 passed, 0 failed.
1252+
1253+
### 10.1 The optimisations that closed the gap
1254+
1255+
The premise §7.2 advanced — that the residual latency regressions
1256+
were dominated by *waiting for evidence to overturn a bad initial
1257+
placement* — turned out to be right. The ship-ready branch combines
1258+
three changes, all of them directly traceable to the §7.2 ideas:
1259+
1260+
1. **Per-conjunct pruning rates as a side-effect of existing pruning
1261+
passes** (§7.2.b, §7.2.c, generalised). Instead of building a
1262+
separate pruning-predicate run on the side, the branch teaches
1263+
the pruning machinery that already runs at file open to *also*
1264+
surface per-conjunct rates while it's at it:
1265+
1266+
- `PruningPredicate::try_new_tagged_conjuncts` /
1267+
`PruningPredicate::prune_per_conjunct`
1268+
(`datafusion/pruning/src/pruning_predicate.rs`) build N leaf
1269+
`PruningPredicate`s tagged by `FilterId`. The combined
1270+
`prune()` AND-s the leaves so the row-group decision is
1271+
unchanged; `prune_per_conjunct()` returns the same combined
1272+
result *plus* a `Vec<PerConjunctPruneStats>`.
1273+
- `PagePruningAccessPlanFilter::new_tagged` /
1274+
`prune_plan_with_per_conjunct_stats`
1275+
(`datafusion/datasource-parquet/src/page_filter.rs`) does the
1276+
same trick at page-index granularity: keeps the existing
1277+
`RowSelection` output but also reports per-page rates keyed by
1278+
`FilterId`.
1279+
- `RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats`
1280+
and `prune_by_bloom_filters_with_per_conjunct_stats`
1281+
(`datafusion/datasource-parquet/src/row_group_filter.rs`) round
1282+
out the picture for row-group min/max and bloom filter passes.
1283+
1284+
The opener (`opener.rs` `RowGroupsPrunedParquetOpen`) collects
1285+
all four streams, layers them — page rates seed
1286+
`page_pruning_rates`, row-group rates fill in, and bloom rates
1287+
take the max with whatever was already there — and threads the
1288+
resulting `HashMap<FilterId, f64>` into the per-partition
1289+
`AdaptiveParquetStream`.
1290+
1291+
`SelectivityTracker::partition_filters`
1292+
(`datafusion/datasource-parquet/src/selectivity.rs:676`) now
1293+
reads from that map first when seeding a freshly-seen filter:
1294+
1295+
```rust
1296+
let prior = page_pruning_rates.get(&id).copied();
1297+
```
1298+
1299+
replacing the byte-ratio-only heuristic for any filter the
1300+
pruning passes touched. The byte-ratio fallback still kicks in
1301+
exactly when the pruning passes had nothing to say (e.g. a
1302+
ClickBench file with no page index).
1303+
1304+
2. **Targeted refresh for populated dynamic filters** (the bit §7.2
1305+
didn't anticipate). Hash-join build-side filters arrive at
1306+
plan-time as placeholders; the rates we capture at file open are
1307+
for the placeholder, not the populated filter. The branch detects
1308+
this with `snapshot_generation(&expr) > 0` and re-evaluates the
1309+
conjunct *only* for those filters via
1310+
`fresh_rate_for_dynamic_conjunct`
1311+
(`datafusion/datasource-parquet/src/selectivity.rs:1261`). It
1312+
tries `PruningPredicate::try_new` on the whole conjunct first; if
1313+
the predicate rewriter bails (e.g. `hash_lookup`-shaped CASE
1314+
nodes), it `snapshot_physical_expr_opt`s the filter to materialise
1315+
the inner expression and `split_conjunction`s it, then takes the
1316+
max rate across sub-parts as a *promote-only* signal. Static
1317+
filters never pay this cost — they're served from the side-effect
1318+
map.
1319+
1320+
3. **Test + lint hygiene needed to actually land it.** Ten
1321+
selectivity tests had been failing since
1322+
`97c62a684 feat(parquet): scatter-aware bytes-saved metric`
1323+
a refactor four commits before round 6 began. Their `update()`
1324+
call sites were still using the old "raw `batch_bytes`" semantics
1325+
instead of the new caller-precomputed `skippable_bytes`. Fixing
1326+
them (commit `9736ec97e`) and the round-6 clippy lints
1327+
(commit `1c416f629`) unblocks
1328+
`cargo clippy --all-targets --all-features -- -D warnings` and
1329+
`cargo test -p datafusion-datasource-parquet --lib` (143/143).
1330+
1331+
#### What we tried and dropped
1332+
1333+
In the spirit of "what *didn't* work" being part of the picture:
1334+
1335+
- **Filter-ordering by per-conjunct rate** (round 11). Adding the
1336+
page-pruning rate as a tertiary key in the post-scan / row-filter
1337+
sort closure was within run-to-run noise (76 979 vs 76 785 ms
1338+
TPC-DS-lat). The Welford-effectiveness key already dominates once
1339+
there's any runtime data; the rate would only matter on the very
1340+
first batch and didn't measurably move the needle. Dropped.
1341+
1342+
- **Partial-AND promote signal via `split_conjunction`** (round 9).
1343+
Pre-`snapshot` it does nothing because the splitter doesn't descend
1344+
into `DynamicFilterPhysicalExpr` wrappers. Post-snapshot it
1345+
compiled and ran but the residual TPC-DS Q25 / Q26 gap turned out
1346+
to be non-placement (mid-stream `maybe_swap_strategy` cascade
1347+
behaviour, not initial placement). Kept the code shape neutral
1348+
for future use; not load-bearing today.
1349+
1350+
- **Cross-session benchmark anchors.** Comparing the round-6 branch
1351+
against `R10-pushdown-lat/tpcds_sf1.json` from a previous session
1352+
showed an apparent 7 % gap to exp3. Re-running both branches
1353+
back-to-back in the same machine state (commit `c43a5a0f5`,
1354+
documented in §10.2) showed the gap was machine-state variance.
1355+
This is now a memory entry
1356+
(`feedback_bench_anchors_need_same_state_controls.md`) so future
1357+
iterations don't chase phantom regressions.
1358+
1359+
### 10.2 Where the branch sits today
1360+
1361+
Three column framing throughout:
1362+
1363+
- **main** — the `main` branch with filter pushdown disabled.
1364+
- **main + pushdown**`main` with `pushdown_filters=true`. The
1365+
configuration the PR is meant to neutralise.
1366+
- **change**`pr/round6-stack` with `pushdown_filters=true`.
1367+
1368+
**no-lat, 5 iterations, local NVMe** — sum-of-medians, ms
1369+
1370+
| Workload | main | main + pushdown | change | change/main | change/main+pushdown |
1371+
|---|--:|--:|--:|--:|--:|
1372+
| ClickBench (43q) | 21 020 | 21 699 | **17 919** | **0.85×**| **0.83×**|
1373+
| TPC-DS (99q) | 17 003 | 38 961 | **16 852** | **0.99×**| **0.43×**|
1374+
| TPC-H (22q) | 780 | 989 | **691** | **0.89×**| **0.70×**|
1375+
1376+
**lat, 3 iterations, `--simulate-latency`** (simulated S3) — sum-of-medians, ms
1377+
1378+
| Workload | main | main + pushdown | change | change/main | change/main+pushdown |
1379+
|---|--:|--:|--:|--:|--:|
1380+
| ClickBench (43q) | 86 562 | 111 321 | **88 947** | 1.03× ≈ | **0.80×**|
1381+
| TPC-DS (99q) | 76 418 | 141 940 | **77 546** | 1.01× ≈ | **0.55×**|
1382+
| TPC-H (22q) | 23 723 | 52 597 | **24 157** | 1.02× ≈ | **0.46×**|
1383+
1384+
The change beats `main` outright on local SSD (15-30 % faster on
1385+
every workload) and is at parity-or-better on simulated cloud
1386+
storage (within 1-3 %, inside run-to-run noise). Versus
1387+
`main + pushdown` — the regression configuration — the change is
1388+
17-57 % faster in every cell.
1389+
1390+
Most notably, **the TPC-H SSD cell flipped from a 1.45× regression
1391+
in earlier rounds to a 0.89× improvement** here. The
1392+
auto-demote-when-not-helpful behaviour now correctly identifies
1393+
that filtering inside the scan on a single-row-group file just
1394+
defeats the existing `FilterExec`-above-`RepartitionExec` shuffle,
1395+
and the change demotes back to post-scan automatically. The
1396+
"unaddressed case" §6.3 is now addressed.
1397+
1398+
### 10.3 The pruning bug commit 5 fixes
1399+
1400+
When the full workspace test suite was run on the round-6 stack,
1401+
14 row-group-pruning and bloom-filter tests in `datafusion-core`
1402+
failed even though the unit tests in `datafusion-datasource-parquet`
1403+
were green. Tracing it back: round-6's `try_new_tagged_conjuncts`
1404+
constructs a wrapper `PruningPredicate` whose own `predicate_expr`
1405+
is a literal-true placeholder, with the real per-conjunct logic
1406+
living in `sub_predicates`. `PruningPredicate::literal_columns()`
1407+
was reading `self.literal_guarantees` only — which is empty on the
1408+
wrapper — and returning an empty `Vec`. Downstream consumers
1409+
(notably `ParquetOpener::open` deciding which bloom filters to
1410+
fetch) saw "no columns of interest" and silently skipped bloom
1411+
filter pruning altogether.
1412+
1413+
This was a real correctness regression: every adaptive-scheduler
1414+
scan effectively had bloom-filter pruning disabled. It didn't show
1415+
up in the smoke benches because bloom filters are a small
1416+
contributor on those workloads, but the row-group-pruning tests in
1417+
`datafusion-core` caught it cleanly.
1418+
1419+
The fix unions each leaf sub-predicate's `literal_columns()` into
1420+
the wrapper's result, deduplicating, then merges with whatever the
1421+
wrapper itself reports. Plain non-tagged predicates are unchanged.
1422+
1423+
After this fix, `cargo test --workspace --no-fail-fast` reports
1424+
9 240 passed / 0 failed (was 9 236 / 19 before the fix).
1425+
1426+
### 10.4 Recommendations for landing it
1427+
1428+
In rough cost-to-impact order:
1429+
1430+
1. **Squash the experiment commits before opening the upstream
1431+
PR.** The `exp/r6-pruningpredicate-rates` history has r6 → r7 →
1432+
r8 → r9v1/v2 → r10 → r11(dropped) → cleanup → docs interleaved
1433+
with progress.md updates. For upstream review, collapse to:
1434+
1435+
- `feat(pruning): per-conjunct PruningPredicate rates` — the
1436+
`try_new_tagged_conjuncts` / `prune_per_conjunct` API plus
1437+
`PerConjunctPruneStats`.
1438+
- `feat(parquet): per-conjunct rates from page / row-group /
1439+
bloom pruning` — the three tagged variants + opener wiring.
1440+
- `feat(parquet): seed initial filter placement from per-conjunct
1441+
pruning rates` — the `selectivity.rs` consumer side and the
1442+
`partition_filters` signature change.
1443+
- `feat(parquet): refresh placement prior for populated dynamic
1444+
filters``fresh_rate_for_dynamic_conjunct` and its
1445+
`snapshot_generation` gate.
1446+
- `test(parquet): update selectivity tests for scatter-aware
1447+
bytes API` — the pre-existing test breakage from
1448+
`97c62a684`. **This one wants to go in *before* the rest as
1449+
its own PR**, since it's an unrelated bug-fix.
1450+
1451+
2. **Move §7.2 from the report into the PR description.**
1452+
Reviewers want to know *why* per-conjunct pruning rates are the
1453+
right knob. The §7.2 section already explains it; lifting the
1454+
"Concrete sketch" + "What this would buy" subsections into the
1455+
PR body (with a "this is what we built" preamble) is most of the
1456+
PR description writing.
1457+
1458+
3. **Three new config knobs are still un-proto'd**
1459+
(§7.3 #4`filter_pushdown_min_bytes_per_sec`,
1460+
`filter_collecting_byte_ratio_threshold`,
1461+
`filter_confidence_z`). `from_proto` defaults are safe for
1462+
round-trip, but a reviewer will ask. Trivial follow-up after the
1463+
main PR.
1464+
1465+
4. **Run the upstream CI bench harness.** The internal smoke and
1466+
full benches in §6 + §10.2 are convincing on a single machine;
1467+
the bench bot at
1468+
[PR #11 comment](https://github.com/adriangb/datafusion/pull/11#issuecomment-4340741427)
1469+
is what reviewers will look at. Trigger it once the rebase is
1470+
clean.
1471+
1472+
5. **Decide on the open `OptionalFilterPhysicalExpr` /
1473+
`prune_by_bloom_filters` bits**. The legacy untagged
1474+
`prune_by_bloom_filters` was folded into its single test caller
1475+
(commit `1c416f629`); confirm during review that no public
1476+
callers will miss it. Likewise the
1477+
`partition_filters_for_test` helper is now `#[doc(hidden)]` rather
1478+
than `#[cfg(test)]` so a `criterion` bench can use it — call
1479+
that out in the review.
1480+
1481+
6. **Defer §7.1 (sub-row-group adaptation) and §7.3 #3 (row-group
1482+
morselization, [#21766](https://github.com/apache/datafusion/pull/21766))
1483+
to separate follow-ups.** They're the right next moves but
1484+
independent of this PR's contract. The TPC-H regression note in
1485+
§6.3 still stands; this PR does not promise to fix it.
1486+
1487+
The branch is at `exp/r6-pruningpredicate-rates @ ca1672a61`. Lint
1488+
clean, tests green, benches at parity with exp3 and beating
1489+
`main+no-pushdown` on both pushdown-relevant workloads. Ready for
1490+
upstream PR submission once squashed.
1491+
-1.57 KB
Loading
-2.69 KB
Loading
-6.75 KB
Loading
-3.78 KB
Loading

slides/datafusion-meetup-05-2026/make_plots.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
matplotlib.use("Agg")
2323
import matplotlib.pyplot as plt
2424

25-
ROOT = Path(__file__).resolve().parents[1]
25+
ROOT = Path(__file__).resolve().parents[2]
2626
RESULTS = ROOT / "benchmarks" / "results"
2727
OUT = Path(__file__).parent / "img"
2828
OUT.mkdir(exist_ok=True, parents=True)
@@ -40,7 +40,8 @@ def load(p):
4040
return {q["query"]: med(q["iterations"]) for q in d["queries"]}
4141

4242

43-
# Color palette: pushdown=off subdued, pushdown=on (regression) red-tinted, PR brand-blue
43+
# Color palette: main (subdued grey), main+pushdown (regression red),
44+
# the change branch (brand blue).
4445
C_OFF = "#7a8b99"
4546
C_ON = "#d4504e"
4647
C_PR = "#2e86c1"
@@ -73,9 +74,9 @@ def trio_bar(ax, labels, off, on, pr, ylabel="time (ms)"):
7374

7475
x = np.arange(len(labels))
7576
w = 0.27
76-
b1 = ax.bar(x - w, off, width=w, label="main pushdown=off", color=C_OFF)
77-
b2 = ax.bar(x, on, width=w, label="main pushdown=on", color=C_ON)
78-
b3 = ax.bar(x + w, pr, width=w, label="PR pushdown=on", color=C_PR)
77+
b1 = ax.bar(x - w, off, width=w, label="main", color=C_OFF)
78+
b2 = ax.bar(x, on, width=w, label="main + pushdown", color=C_ON)
79+
b3 = ax.bar(x + w, pr, width=w, label="change", color=C_PR)
7980
ax.set_xticks(x)
8081
ax.set_xticklabels(labels)
8182
ax.set_ylabel(ylabel)
@@ -126,7 +127,7 @@ def render_pair(off_path, on_path, pr_path, query_name, query_label,
126127
print("ClickBench SSD →", render_pair(
127128
"MAIN-nopushdown/clickbench_partitioned.json",
128129
"MAIN-pushdown/clickbench_partitioned.json",
129-
"PR-pushdown/clickbench_partitioned.json",
130+
"R6-STACK-pushdown/clickbench_partitioned.json",
130131
query_name="Query 23",
131132
query_label="Q23 (URL LIKE '%google%')",
132133
total_label="Total (43 q, sum of medians)",
@@ -136,7 +137,7 @@ def render_pair(off_path, on_path, pr_path, query_name, query_label,
136137
print("TPC-DS SSD →", render_pair(
137138
"MAIN-nopushdown/tpcds_sf1.json",
138139
"MAIN-pushdown/tpcds_sf1.json",
139-
"PR-pushdown/tpcds_sf1.json",
140+
"R6-STACK-pushdown/tpcds_sf1.json",
140141
query_name="Query 64",
141142
query_label="Q64",
142143
total_label="Total (99 q, sum of medians)",
@@ -146,19 +147,19 @@ def render_pair(off_path, on_path, pr_path, query_name, query_label,
146147
print("TPC-H SSD →", render_pair(
147148
"MAIN-nopushdown/tpch_sf1.json",
148149
"MAIN-pushdown/tpch_sf1.json",
149-
"PR-pushdown/tpch_sf1.json",
150+
"R6-STACK-pushdown/tpch_sf1.json",
150151
query_name="Query 9",
151-
query_label="Q9 (worst loss)",
152+
query_label="Q9",
152153
total_label="Total (22 q, sum of medians)",
153154
out_name="tpch_nolat.png",
154155
))
155156

156157
print("TPC-H S3 →", render_pair(
157158
"MAIN-nopushdown-lat/tpch_sf1.json",
158159
"MAIN-pushdown-lat/tpch_sf1.json",
159-
"PR-pushdown-lat/tpch_sf1.json",
160+
"R6-STACK-pushdown-lat/tpch_sf1.json",
160161
query_name="Query 9",
161-
query_label="Q9 (was 3.1× loss on SSD)",
162+
query_label="Q9",
162163
total_label="Total (22 q, sum of medians)",
163164
out_name="tpch_lat.png",
164165
))

0 commit comments

Comments
 (0)