Commit 2026412
feat(parquet): row-group and row-range sampling on ParquetSource
Adds two opt-in sampling primitives to parquet scans, both built on
the existing `ParquetAccessPlan` infrastructure:
* `ParquetSource::with_row_group_sampling(fraction)` — keep `fraction`
of row groups in each scanned file. Selection is deferred until the
opener has loaded the parquet footer (so we sample by real row-group
index, not guess) and is deterministic per `(file_name,
row_group_count, fraction)` via a seeded `SmallRng`.
* `ParquetSource::with_row_fraction(fraction)` — within each kept row
group, keep `fraction` of rows by translating to a `RowSelection` of
K small contiguous windows (size controlled by
`with_row_cluster_size`, default 32 768 rows). The parquet reader
uses the page index to read only the data pages covering the
selected rows, so this gives "page-level" IO savings without
requiring per-column page alignment. Falls back gracefully (no
IO win, still correct) when the page index is missing.
The two layers compose: scanning with both `row_group_fraction=0.1`
and `row_fraction=0.1` reads ~1% of the rows in ~10% of the row
groups, with windows spread out so the sample isn't clustered at one
end of each row group.
Selection within a row group is deterministic-but-random per
`(file_name, row_group_index, fraction, cluster_size)` — same inputs
yield the same windows, so re-runs are repeatable.
## Why this lives on `ParquetSource`
The natural entry-point for "I want a sample" is at config time,
before any metadata IO. The actual *which* row groups / *which* rows
selection still has to be deferred to the opener (after the footer is
parsed) — that's why `ParquetSampling` carries fractions plus a cluster
size, and the opener pulls them through to its lazy decision points.
This is intentionally orthogonal to file-level sampling: `ParquetSource`
doesn't own the file list (`FileScanConfig.file_groups` does), so a
file-fraction setter here would have been a confusing no-op. Callers
that want to drop files should rebuild the `FileScanConfig` directly.
## Use cases
* `TABLESAMPLE` SQL syntax (any future implementation can lower to
these primitives).
* Ad-hoc data exploration / `EXPLAIN ANALYZE` against a sample.
* Mini-query-style stats sampling (a layered helper can call these
to bound the cost of computing approximate min/max/NDV/histograms
for the optimizer — out of scope here, see the linked POC in the
PR description).
* `EXPLAIN ANALYZE`-driven debug runs against a representative slice.
## Tests
5 unit tests on `apply_row_group_sampling` (target count, determinism,
file-name dependence, no-op at fraction=1.0, target floor of 1) plus
2 end-to-end tests that build a real parquet file in `InMemory` object
store and confirm the row counts emitted are what the sampling implies.
`cargo build --workspace`, `cargo fmt --all`, and
`cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings`
are clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 9a29e33 commit 2026412
6 files changed
Lines changed: 502 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| 50 | + | |
49 | 51 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
139 | 142 | | |
140 | 143 | | |
141 | 144 | | |
| |||
287 | 290 | | |
288 | 291 | | |
289 | 292 | | |
| 293 | + | |
290 | 294 | | |
291 | 295 | | |
292 | 296 | | |
| |||
656 | 660 | | |
657 | 661 | | |
658 | 662 | | |
| 663 | + | |
659 | 664 | | |
660 | 665 | | |
661 | 666 | | |
| |||
882 | 887 | | |
883 | 888 | | |
884 | 889 | | |
885 | | - | |
| 890 | + | |
886 | 891 | | |
887 | 892 | | |
888 | 893 | | |
889 | | - | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
890 | 913 | | |
891 | 914 | | |
892 | 915 | | |
| |||
1676 | 1699 | | |
1677 | 1700 | | |
1678 | 1701 | | |
| 1702 | + | |
1679 | 1703 | | |
1680 | 1704 | | |
1681 | 1705 | | |
| |||
1702 | 1726 | | |
1703 | 1727 | | |
1704 | 1728 | | |
| 1729 | + | |
1705 | 1730 | | |
1706 | 1731 | | |
1707 | 1732 | | |
| 1733 | + | |
| 1734 | + | |
| 1735 | + | |
| 1736 | + | |
| 1737 | + | |
| 1738 | + | |
1708 | 1739 | | |
1709 | 1740 | | |
1710 | 1741 | | |
| |||
1816 | 1847 | | |
1817 | 1848 | | |
1818 | 1849 | | |
| 1850 | + | |
1819 | 1851 | | |
1820 | 1852 | | |
1821 | 1853 | | |
| |||
2720 | 2752 | | |
2721 | 2753 | | |
2722 | 2754 | | |
| 2755 | + | |
| 2756 | + | |
| 2757 | + | |
| 2758 | + | |
| 2759 | + | |
| 2760 | + | |
| 2761 | + | |
| 2762 | + | |
| 2763 | + | |
| 2764 | + | |
| 2765 | + | |
| 2766 | + | |
| 2767 | + | |
| 2768 | + | |
| 2769 | + | |
| 2770 | + | |
| 2771 | + | |
| 2772 | + | |
| 2773 | + | |
| 2774 | + | |
| 2775 | + | |
| 2776 | + | |
| 2777 | + | |
| 2778 | + | |
| 2779 | + | |
| 2780 | + | |
| 2781 | + | |
| 2782 | + | |
| 2783 | + | |
| 2784 | + | |
| 2785 | + | |
| 2786 | + | |
| 2787 | + | |
| 2788 | + | |
| 2789 | + | |
| 2790 | + | |
| 2791 | + | |
| 2792 | + | |
| 2793 | + | |
| 2794 | + | |
| 2795 | + | |
| 2796 | + | |
| 2797 | + | |
| 2798 | + | |
| 2799 | + | |
| 2800 | + | |
| 2801 | + | |
| 2802 | + | |
| 2803 | + | |
| 2804 | + | |
| 2805 | + | |
| 2806 | + | |
| 2807 | + | |
| 2808 | + | |
| 2809 | + | |
| 2810 | + | |
| 2811 | + | |
| 2812 | + | |
| 2813 | + | |
| 2814 | + | |
| 2815 | + | |
| 2816 | + | |
| 2817 | + | |
| 2818 | + | |
| 2819 | + | |
| 2820 | + | |
| 2821 | + | |
| 2822 | + | |
| 2823 | + | |
| 2824 | + | |
| 2825 | + | |
| 2826 | + | |
| 2827 | + | |
| 2828 | + | |
| 2829 | + | |
| 2830 | + | |
| 2831 | + | |
| 2832 | + | |
| 2833 | + | |
| 2834 | + | |
| 2835 | + | |
| 2836 | + | |
| 2837 | + | |
| 2838 | + | |
| 2839 | + | |
| 2840 | + | |
| 2841 | + | |
| 2842 | + | |
| 2843 | + | |
| 2844 | + | |
| 2845 | + | |
| 2846 | + | |
| 2847 | + | |
| 2848 | + | |
| 2849 | + | |
| 2850 | + | |
| 2851 | + | |
| 2852 | + | |
| 2853 | + | |
| 2854 | + | |
| 2855 | + | |
| 2856 | + | |
| 2857 | + | |
| 2858 | + | |
| 2859 | + | |
| 2860 | + | |
| 2861 | + | |
| 2862 | + | |
2723 | 2863 | | |
0 commit comments