Commit 3d0dc4a
feat(parquet): row-group and row-range sampling on ParquetSource
Adds two opt-in sampling primitives to parquet scans, both built on
the existing `ParquetAccessPlan` infrastructure:
* `ParquetSource::with_row_group_sampling(fraction)` — keep `fraction`
of row groups in each scanned file. Selection is deferred until the
opener has loaded the parquet footer (so we sample by real row-group
index, not guess) and is deterministic per `(file_name,
row_group_count, fraction)` via a seeded `SmallRng`.
* `ParquetSource::with_row_fraction(fraction)` — within each kept row
group, keep `fraction` of rows by translating to a `RowSelection` of
K small contiguous windows (size controlled by
`with_row_cluster_size`, default 32 768 rows). The parquet reader
uses the page index to read only the data pages covering the
selected rows, so this gives "page-level" IO savings without
requiring per-column page alignment. Falls back gracefully (no
IO win, still correct) when the page index is missing.
The two layers compose: scanning with both `row_group_fraction=0.1`
and `row_fraction=0.1` reads ~1% of the rows in ~10% of the row
groups, with windows spread out so the sample isn't clustered at one
end of each row group.
Selection within a row group is deterministic-but-random per
`(file_name, row_group_index, fraction, cluster_size)` — same inputs
yield the same windows, so re-runs are repeatable.
## Why this lives on `ParquetSource`
The natural entry-point for "I want a sample" is at config time,
before any metadata IO. The actual *which* row groups / *which* rows
selection still has to be deferred to the opener (after the footer is
parsed) — that's why `ParquetSampling` carries fractions plus a cluster
size, and the opener pulls them through to its lazy decision points.
This is intentionally orthogonal to file-level sampling: `ParquetSource`
doesn't own the file list (`FileScanConfig.file_groups` does), so a
file-fraction setter here would have been a confusing no-op. Callers
that want to drop files should rebuild the `FileScanConfig` directly.
## Use cases
* `TABLESAMPLE` SQL syntax (any future implementation can lower to
these primitives).
* Ad-hoc data exploration / `EXPLAIN ANALYZE` against a sample.
* Mini-query-style stats sampling (a layered helper can call these
to bound the cost of computing approximate min/max/NDV/histograms
for the optimizer — out of scope here, see the linked POC in the
PR description).
* `EXPLAIN ANALYZE`-driven debug runs against a representative slice.
## Tests
5 unit tests on `apply_row_group_sampling` (target count, determinism,
file-name dependence, no-op at fraction=1.0, target floor of 1) plus
2 end-to-end tests that build a real parquet file in `InMemory` object
store and confirm the row counts emitted are what the sampling implies.
`cargo build --workspace`, `cargo fmt --all`, and
`cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings`
are clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 9a29e33 commit 3d0dc4a
5 files changed
Lines changed: 448 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
49 | 50 | | |
0 commit comments