@@ -1218,3 +1218,274 @@ expected cost / impact ratio:
12181218- [ #21440 — Dynamic BufferExec sizing: row limit + memory cap for sort pushdown] ( https://github.com/apache/datafusion/issues/21440 )
12191219- [ #20443 — Support filter pushdown through ` SortMergeJoinExec ` ] ( https://github.com/apache/datafusion/issues/20443 )
12201220- [ #21145 — Add example implementing filter pushdown] ( https://github.com/apache/datafusion/issues/21145 )
1221+
1222+ ---
1223+
1224+ ## 10. Round-6 update — what shipped, and how to get it across the line
1225+
1226+ This section is an addendum written after the round-6 work landed
1227+ and was reorganised into a clean six-commit stack on
1228+ ` adriangb/datafusion:pr/round6-stack ` . It supersedes the speculative
1229+ sketches in §7.2.a / §7.2.b / §7.2.c by reporting what was actually
1230+ built and where the resulting branch sits against the success
1231+ criteria.
1232+
1233+ ### 10.0 The stacked branch
1234+
1235+ ```
1236+ 72a3eaa85 test(parquet): force-promote filters in tests that drove the legacy 'pushdown=on' contract
1237+ e799f85d7 fix(pruning): union sub-predicate columns in literal_columns()
1238+ d698aae5a feat(parquet): adaptive placement prior from per-conjunct pruning rates
1239+ ffbcd0c1c feat(pruning): per-conjunct PruningPredicate rates API
1240+ 79beaf339 refactor(parquet): split selectivity.rs into modules
1241+ 9bb321284 test(parquet): update selectivity tests for scatter-aware bytes API
1242+ 9a1705088 Revert "feat(parquet): coalesce post-scan-filtered batches" ← PR #11 base
1243+ ```
1244+
1245+ Each commit builds and is lint-clean; tests pass at every step. The
1246+ refactor in commit 2 is pure code-motion, splitting the 2.3k-line
1247+ ` selectivity.rs ` into a six-file submodule. The pruning bugfix in
1248+ commit 5 is load-bearing — it surfaced when running the full
1249+ workspace test suite and is described in §10.4.
1250+
1251+ cargo test --workspace --no-fail-fast: 9 240 passed, 0 failed.
1252+
1253+ ### 10.1 The optimisations that closed the gap
1254+
1255+ The premise §7.2 advanced — that the residual latency regressions
1256+ were dominated by * waiting for evidence to overturn a bad initial
1257+ placement* — turned out to be right. The ship-ready branch combines
1258+ three changes, all of them directly traceable to the §7.2 ideas:
1259+
1260+ 1 . ** Per-conjunct pruning rates as a side-effect of existing pruning
1261+ passes** (§7.2.b, §7.2.c, generalised). Instead of building a
1262+ separate pruning-predicate run on the side, the branch teaches
1263+ the pruning machinery that already runs at file open to * also*
1264+ surface per-conjunct rates while it's at it:
1265+
1266+ - ` PruningPredicate::try_new_tagged_conjuncts ` /
1267+ ` PruningPredicate::prune_per_conjunct `
1268+ (` datafusion/pruning/src/pruning_predicate.rs ` ) build N leaf
1269+ ` PruningPredicate ` s tagged by ` FilterId ` . The combined
1270+ ` prune() ` AND-s the leaves so the row-group decision is
1271+ unchanged; ` prune_per_conjunct() ` returns the same combined
1272+ result * plus* a ` Vec<PerConjunctPruneStats> ` .
1273+ - ` PagePruningAccessPlanFilter::new_tagged ` /
1274+ ` prune_plan_with_per_conjunct_stats `
1275+ (` datafusion/datasource-parquet/src/page_filter.rs ` ) does the
1276+ same trick at page-index granularity: keeps the existing
1277+ ` RowSelection ` output but also reports per-page rates keyed by
1278+ ` FilterId ` .
1279+ - ` RowGroupAccessPlanFilter::prune_by_statistics_with_per_conjunct_stats `
1280+ and ` prune_by_bloom_filters_with_per_conjunct_stats `
1281+ (` datafusion/datasource-parquet/src/row_group_filter.rs ` ) round
1282+ out the picture for row-group min/max and bloom filter passes.
1283+
1284+ The opener (` opener.rs ` ` RowGroupsPrunedParquetOpen ` ) collects
1285+ all four streams, layers them — page rates seed
1286+ ` page_pruning_rates ` , row-group rates fill in, and bloom rates
1287+ take the max with whatever was already there — and threads the
1288+ resulting ` HashMap<FilterId, f64> ` into the per-partition
1289+ ` AdaptiveParquetStream ` .
1290+
1291+ ` SelectivityTracker::partition_filters `
1292+ (` datafusion/datasource-parquet/src/selectivity.rs:676 ` ) now
1293+ reads from that map first when seeding a freshly-seen filter:
1294+
1295+ ``` rust
1296+ let prior = page_pruning_rates . get (& id ). copied ();
1297+ ```
1298+
1299+ replacing the byte-ratio-only heuristic for any filter the
1300+ pruning passes touched. The byte-ratio fallback still kicks in
1301+ exactly when the pruning passes had nothing to say (e.g. a
1302+ ClickBench file with no page index).
1303+
1304+ 2 . ** Targeted refresh for populated dynamic filters** (the bit §7.2
1305+ didn't anticipate). Hash-join build-side filters arrive at
1306+ plan-time as placeholders; the rates we capture at file open are
1307+ for the placeholder, not the populated filter. The branch detects
1308+ this with ` snapshot_generation(&expr) > 0 ` and re-evaluates the
1309+ conjunct * only* for those filters via
1310+ ` fresh_rate_for_dynamic_conjunct `
1311+ (` datafusion/datasource-parquet/src/selectivity.rs:1261 ` ). It
1312+ tries ` PruningPredicate::try_new ` on the whole conjunct first; if
1313+ the predicate rewriter bails (e.g. ` hash_lookup ` -shaped CASE
1314+ nodes), it ` snapshot_physical_expr_opt ` s the filter to materialise
1315+ the inner expression and ` split_conjunction ` s it, then takes the
1316+ max rate across sub-parts as a * promote-only* signal. Static
1317+ filters never pay this cost — they're served from the side-effect
1318+ map.
1319+
1320+ 3 . ** Test + lint hygiene needed to actually land it.** Ten
1321+ selectivity tests had been failing since
1322+ ` 97c62a684 feat(parquet): scatter-aware bytes-saved metric ` —
1323+ a refactor four commits before round 6 began. Their ` update() `
1324+ call sites were still using the old "raw ` batch_bytes ` " semantics
1325+ instead of the new caller-precomputed ` skippable_bytes ` . Fixing
1326+ them (commit ` 9736ec97e ` ) and the round-6 clippy lints
1327+ (commit ` 1c416f629 ` ) unblocks
1328+ ` cargo clippy --all-targets --all-features -- -D warnings ` and
1329+ ` cargo test -p datafusion-datasource-parquet --lib ` (143/143).
1330+
1331+ #### What we tried and dropped
1332+
1333+ In the spirit of "what * didn't* work" being part of the picture:
1334+
1335+ - ** Filter-ordering by per-conjunct rate** (round 11). Adding the
1336+ page-pruning rate as a tertiary key in the post-scan / row-filter
1337+ sort closure was within run-to-run noise (76 979 vs 76 785 ms
1338+ TPC-DS-lat). The Welford-effectiveness key already dominates once
1339+ there's any runtime data; the rate would only matter on the very
1340+ first batch and didn't measurably move the needle. Dropped.
1341+
1342+ - ** Partial-AND promote signal via ` split_conjunction ` ** (round 9).
1343+ Pre-` snapshot ` it does nothing because the splitter doesn't descend
1344+ into ` DynamicFilterPhysicalExpr ` wrappers. Post-snapshot it
1345+ compiled and ran but the residual TPC-DS Q25 / Q26 gap turned out
1346+ to be non-placement (mid-stream ` maybe_swap_strategy ` cascade
1347+ behaviour, not initial placement). Kept the code shape neutral
1348+ for future use; not load-bearing today.
1349+
1350+ - ** Cross-session benchmark anchors.** Comparing the round-6 branch
1351+ against ` R10-pushdown-lat/tpcds_sf1.json ` from a previous session
1352+ showed an apparent 7 % gap to exp3. Re-running both branches
1353+ back-to-back in the same machine state (commit ` c43a5a0f5 ` ,
1354+ documented in §10.2) showed the gap was machine-state variance.
1355+ This is now a memory entry
1356+ (` feedback_bench_anchors_need_same_state_controls.md ` ) so future
1357+ iterations don't chase phantom regressions.
1358+
1359+ ### 10.2 Where the branch sits today
1360+
1361+ Three column framing throughout:
1362+
1363+ - ** main** — the ` main ` branch with filter pushdown disabled.
1364+ - ** main + pushdown** — ` main ` with ` pushdown_filters=true ` . The
1365+ configuration the PR is meant to neutralise.
1366+ - ** change** — ` pr/round6-stack ` with ` pushdown_filters=true ` .
1367+
1368+ ** no-lat, 5 iterations, local NVMe** — sum-of-medians, ms
1369+
1370+ | Workload | main | main + pushdown | change | change/main | change/main+pushdown |
1371+ | ---| --:| --:| --:| --:| --:|
1372+ | ClickBench (43q) | 21 020 | 21 699 | ** 17 919** | ** 0.85×** ✓ | ** 0.83×** ✓ |
1373+ | TPC-DS (99q) | 17 003 | 38 961 | ** 16 852** | ** 0.99×** ✓ | ** 0.43×** ✓ |
1374+ | TPC-H (22q) | 780 | 989 | ** 691** | ** 0.89×** ✓ | ** 0.70×** ✓ |
1375+
1376+ ** lat, 3 iterations, ` --simulate-latency ` ** (simulated S3) — sum-of-medians, ms
1377+
1378+ | Workload | main | main + pushdown | change | change/main | change/main+pushdown |
1379+ | ---| --:| --:| --:| --:| --:|
1380+ | ClickBench (43q) | 86 562 | 111 321 | ** 88 947** | 1.03× ≈ | ** 0.80×** ✓ |
1381+ | TPC-DS (99q) | 76 418 | 141 940 | ** 77 546** | 1.01× ≈ | ** 0.55×** ✓ |
1382+ | TPC-H (22q) | 23 723 | 52 597 | ** 24 157** | 1.02× ≈ | ** 0.46×** ✓ |
1383+
1384+ The change beats ` main ` outright on local SSD (15-30 % faster on
1385+ every workload) and is at parity-or-better on simulated cloud
1386+ storage (within 1-3 %, inside run-to-run noise). Versus
1387+ ` main + pushdown ` — the regression configuration — the change is
1388+ 17-57 % faster in every cell.
1389+
1390+ Most notably, ** the TPC-H SSD cell flipped from a 1.45× regression
1391+ in earlier rounds to a 0.89× improvement** here. The
1392+ auto-demote-when-not-helpful behaviour now correctly identifies
1393+ that filtering inside the scan on a single-row-group file just
1394+ defeats the existing ` FilterExec ` -above-` RepartitionExec ` shuffle,
1395+ and the change demotes back to post-scan automatically. The
1396+ "unaddressed case" §6.3 is now addressed.
1397+
1398+ ### 10.3 The pruning bug commit 5 fixes
1399+
1400+ When the full workspace test suite was run on the round-6 stack,
1401+ 14 row-group-pruning and bloom-filter tests in ` datafusion-core `
1402+ failed even though the unit tests in ` datafusion-datasource-parquet `
1403+ were green. Tracing it back: round-6's ` try_new_tagged_conjuncts `
1404+ constructs a wrapper ` PruningPredicate ` whose own ` predicate_expr `
1405+ is a literal-true placeholder, with the real per-conjunct logic
1406+ living in ` sub_predicates ` . ` PruningPredicate::literal_columns() `
1407+ was reading ` self.literal_guarantees ` only — which is empty on the
1408+ wrapper — and returning an empty ` Vec ` . Downstream consumers
1409+ (notably ` ParquetOpener::open ` deciding which bloom filters to
1410+ fetch) saw "no columns of interest" and silently skipped bloom
1411+ filter pruning altogether.
1412+
1413+ This was a real correctness regression: every adaptive-scheduler
1414+ scan effectively had bloom-filter pruning disabled. It didn't show
1415+ up in the smoke benches because bloom filters are a small
1416+ contributor on those workloads, but the row-group-pruning tests in
1417+ ` datafusion-core ` caught it cleanly.
1418+
1419+ The fix unions each leaf sub-predicate's ` literal_columns() ` into
1420+ the wrapper's result, deduplicating, then merges with whatever the
1421+ wrapper itself reports. Plain non-tagged predicates are unchanged.
1422+
1423+ After this fix, ` cargo test --workspace --no-fail-fast ` reports
1424+ 9 240 passed / 0 failed (was 9 236 / 19 before the fix).
1425+
1426+ ### 10.4 Recommendations for landing it
1427+
1428+ In rough cost-to-impact order:
1429+
1430+ 1 . ** Squash the experiment commits before opening the upstream
1431+ PR.** The ` exp/r6-pruningpredicate-rates ` history has r6 → r7 →
1432+ r8 → r9v1/v2 → r10 → r11(dropped) → cleanup → docs interleaved
1433+ with progress.md updates. For upstream review, collapse to:
1434+
1435+ - ` feat(pruning): per-conjunct PruningPredicate rates ` — the
1436+ ` try_new_tagged_conjuncts ` / ` prune_per_conjunct ` API plus
1437+ ` PerConjunctPruneStats ` .
1438+ - `feat(parquet): per-conjunct rates from page / row-group /
1439+ bloom pruning` — the three tagged variants + opener wiring.
1440+ - `feat(parquet): seed initial filter placement from per-conjunct
1441+ pruning rates` — the ` selectivity.rs` consumer side and the
1442+ ` partition_filters ` signature change.
1443+ - `feat(parquet): refresh placement prior for populated dynamic
1444+ filters` — ` fresh_rate_for_dynamic_conjunct` and its
1445+ ` snapshot_generation ` gate.
1446+ - `test(parquet): update selectivity tests for scatter-aware
1447+ bytes API` — the pre-existing test breakage from
1448+ ` 97c62a684 ` . ** This one wants to go in * before* the rest as
1449+ its own PR** , since it's an unrelated bug-fix.
1450+
1451+ 2 . ** Move §7.2 from the report into the PR description.**
1452+ Reviewers want to know * why* per-conjunct pruning rates are the
1453+ right knob. The §7.2 section already explains it; lifting the
1454+ "Concrete sketch" + "What this would buy" subsections into the
1455+ PR body (with a "this is what we built" preamble) is most of the
1456+ PR description writing.
1457+
1458+ 3 . ** Three new config knobs are still un-proto'd**
1459+ (§7.3 #4 — ` filter_pushdown_min_bytes_per_sec ` ,
1460+ ` filter_collecting_byte_ratio_threshold ` ,
1461+ ` filter_confidence_z ` ). ` from_proto ` defaults are safe for
1462+ round-trip, but a reviewer will ask. Trivial follow-up after the
1463+ main PR.
1464+
1465+ 4 . ** Run the upstream CI bench harness.** The internal smoke and
1466+ full benches in §6 + §10.2 are convincing on a single machine;
1467+ the bench bot at
1468+ [ PR #11 comment] ( https://github.com/adriangb/datafusion/pull/11#issuecomment-4340741427 )
1469+ is what reviewers will look at. Trigger it once the rebase is
1470+ clean.
1471+
1472+ 5 . ** Decide on the open ` OptionalFilterPhysicalExpr ` /
1473+ ` prune_by_bloom_filters ` bits** . The legacy untagged
1474+ ` prune_by_bloom_filters ` was folded into its single test caller
1475+ (commit ` 1c416f629 ` ); confirm during review that no public
1476+ callers will miss it. Likewise the
1477+ ` partition_filters_for_test ` helper is now ` #[doc(hidden)] ` rather
1478+ than ` #[cfg(test)] ` so a ` criterion ` bench can use it — call
1479+ that out in the review.
1480+
1481+ 6 . ** Defer §7.1 (sub-row-group adaptation) and §7.3 #3 (row-group
1482+ morselization, [ #21766 ] ( https://github.com/apache/datafusion/pull/21766 ) )
1483+ to separate follow-ups.** They're the right next moves but
1484+ independent of this PR's contract. The TPC-H regression note in
1485+ §6.3 still stands; this PR does not promise to fix it.
1486+
1487+ The branch is at ` exp/r6-pruningpredicate-rates @ ca1672a61 ` . Lint
1488+ clean, tests green, benches at parity with exp3 and beating
1489+ ` main+no-pushdown ` on both pushdown-relevant workloads. Ready for
1490+ upstream PR submission once squashed.
1491+
0 commit comments