Commit 0b26d70
feat: SamplePushdown rule + Sample logical/physical nodes for parquet
Adds the infrastructure for pushing TABLESAMPLE-shaped sampling into
file sources, with parquet as the first absorbing source. There is
no SQL surface yet; this commit only ships the primitives. Wiring a
RelationPlanner / ExtensionPlanner so it works out of the box from
SQL is a follow-up.
- `Sample` `UserDefinedLogicalNodeCore` extension node in
`datafusion-expr` (`logical_plan/sample.rs`). Schema-preserving;
validates `fraction ∈ (0, 1]`. Currently encodes
`SampleMethod::System` only.
- `SampleExec` placeholder in `datafusion-physical-plan`. Errors at
`execute` (it's a marker — the `SamplePushdown` rule is expected
to remove it). Implements filter / sort pushdown passthrough so
unrelated optimizer rules see straight through it.
- New `try_push_sample` method on `ExecutionPlan` and `FileSource`,
returning `Absorbed { inner }` / `Passthrough` / `Unsupported
{ reason }`. Default is `Unsupported`; per-node `Passthrough`
overrides on filter, projection, coalesce_batches,
coalesce_partitions, repartition, and non-fetch sort.
- `ParquetSource::try_push_sample` runs the (intentionally private)
hierarchical block-level reduction across files / row groups /
rows, with adaptive collapse when an axis can't reduce. Coordinates
with the opener via a `pub(crate)` `system_target_remaining` field
on `ParquetSampling`. Single-file, single-row-group inputs hit
~p × N rows instead of undershooting at p^(1/3) × N.
- `SamplePushdown` optimizer rule (between `PushdownSort` and
`EnsureCooperative`) walks top-down. On `Absorbed` it replaces
`SampleExec` with the rebuilt source; on `Passthrough` it pushes
through the single-child node and recurses; on `Unsupported` it
errors at planning time with `"TABLESAMPLE is not supported for
this source"`. There is intentionally no generic post-scan
`SampleExec` yet.
- EXPLAIN visibility: `ParquetSource::fmt_extra` surfaces
`sample_system_target_remaining` when set.
- `optimizer_rule_reference.md` updated to list `SamplePushdown` in
the documented rule order.
- `explain.slt` updated with `physical_plan after SamplePushdown SAME
TEXT AS ABOVE` lines under each verbose-explain test.
Tests: 7 unit tests on `ParquetSource::try_push_sample` covering the
pushdown contract (full / single-file / multi-file / target clamping
/ REPEATABLE determinism / multi-file rounding compensation), and 2
opener end-to-end tests covering the adaptive split for single vs
multi row group inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 3d0dc4a commit 0b26d70
23 files changed
Lines changed: 1810 additions & 212 deletions
File tree
- datafusion
- core/src
- datasource-parquet/src
- datasource/src
- expr/src/logical_plan
- physical-optimizer/src
- physical-plan/src
- repartition
- sorts
- sqllogictest/test_files
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
92 | | - | |
93 | | - | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
0 commit comments