Skip to content

Commit 9a88f04

Browse files
adriangbclaude
andcommitted
docs: TABLESAMPLE SYSTEM in SQL reference + extending-sql guide
User-facing docs explaining what TABLESAMPLE is, the SYSTEM vs BERNOULLI trade-off, the cube-root hybrid implementation against ParquetSource, the deterministic REPEATABLE(seed) behavior, and the list of forms rejected by the built-in planner with a pointer to RelationPlanner for the rest. Also: - Updates docs/source/library-user-guide/extending-sql.md to acknowledge that SYSTEM is built-in and reframe the existing example as adding *additional* forms (BERNOULLI, ROW, BUCKET). - Adds a CSV-rejection case to the sqllogictest fixture: scans whose FileSource doesn't implement try_push_sample (CSV) error at planning time with the diagnostic from the SamplePushdown rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 38315c5 commit 9a88f04

3 files changed

Lines changed: 150 additions & 3 deletions

File tree

datafusion/sqllogictest/test_files/tablesample.slt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,5 +83,26 @@ SELECT count(*) FROM sample_table TABLESAMPLE SYSTEM (150);
8383
statement error TABLESAMPLE fraction must be in
8484
SELECT count(*) FROM sample_table TABLESAMPLE SYSTEM (0);
8585

86+
# Sources that don't implement `try_push_sample` (e.g. CSV) cause the
87+
# `SamplePushdown` rule to error at planning time, since there is no
88+
# generic post-scan SampleExec yet — the rule is the only thing that
89+
# currently turns a `Sample` into a runnable plan.
90+
statement ok
91+
COPY (SELECT i AS id FROM generate_series(1, 16) t(i))
92+
TO 'test_files/scratch/tablesample/sample_table.csv'
93+
STORED AS CSV;
94+
95+
statement ok
96+
CREATE EXTERNAL TABLE sample_csv
97+
STORED AS CSV
98+
LOCATION 'test_files/scratch/tablesample/sample_table.csv'
99+
OPTIONS ('format.has_header' 'true');
100+
101+
statement error TABLESAMPLE could not be pushed into the source
102+
SELECT count(*) FROM sample_csv TABLESAMPLE SYSTEM (50);
103+
104+
statement ok
105+
DROP TABLE sample_csv;
106+
86107
statement ok
87108
DROP TABLE sample_table;

docs/source/library-user-guide/extending-sql.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -341,13 +341,28 @@ approach:
341341

342342
### TABLESAMPLE (Custom Logical and Physical Nodes)
343343

344-
The [table_sample.rs] example shows a complete end-to-end implementation of how to
345-
support queries such as:
344+
DataFusion ships with a built-in `RelationPlanner` for
345+
`TABLESAMPLE SYSTEM(p%)` (block-level sampling), auto-registered on
346+
any default `SessionContext` and pushed into `ParquetSource` by the
347+
`SamplePushdown` optimizer rule. See the [`TABLESAMPLE` section] of
348+
the SQL reference for a full description of the semantics and the
349+
parquet implementation.
350+
351+
The [table_sample.rs] example shows how to register a *custom*
352+
`RelationPlanner` ahead of the built-in one to add other forms — row-
353+
level `BERNOULLI`, `ROW` count limits, Hive-style `BUCKET`, etc. — that
354+
DataFusion intentionally doesn't ship in core because the semantics
355+
vary across systems. Because `register_relation_planner` prepends
356+
to the chain, the custom planner runs first; returning
357+
`RelationPlanning::Original` falls through to the built-in
358+
`SYSTEM` planner.
346359

347360
```sql
348361
SELECT * FROM table TABLESAMPLE BERNOULLI(10 PERCENT) REPEATABLE(42)
349362
```
350363

364+
[`TABLESAMPLE` section]: ../user-guide/sql/select.md#tablesample-clause
365+
351366
### PIVOT/UNPIVOT (Rewrite Strategy)
352367

353368
The [pivot_unpivot.rs] example demonstrates rewriting custom syntax to standard SQL

docs/source/user-guide/sql/select.md

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ DataFusion supports the following syntax for queries:
3030

3131
[ [WITH](#with-clause) with_query [, ...] ] <br/>
3232
[SELECT](#select-clause) [ ALL | DISTINCT ] select_expr [, ...] <br/>
33-
[ [FROM](#from-clause) from_item [, ...] ] <br/>
33+
[ [FROM](#from-clause) from_item [ [TABLESAMPLE](#tablesample-clause) ... ] [, ...] ] <br/>
3434
[ [JOIN](#join-clause) join_item [, ...] ] <br/>
3535
[ [WHERE](#where-clause) condition ] <br/>
3636
[ [GROUP BY](#group-by-clause) grouping_element [, ...] ] <br/>
@@ -76,6 +76,117 @@ Example:
7676
SELECT t.a FROM table AS t
7777
```
7878

79+
## TABLESAMPLE clause
80+
81+
`TABLESAMPLE` returns a random subset of rows from a table. It's
82+
useful for ad-hoc data exploration ("give me roughly 1% of this
83+
table"), bounded `EXPLAIN ANALYZE` runs against representative data,
84+
and any analytics workload where an approximate answer is acceptable
85+
in exchange for reading less data.
86+
87+
```sql
88+
SELECT * FROM table TABLESAMPLE SYSTEM (10); -- ~10% of the table
89+
SELECT * FROM table TABLESAMPLE SYSTEM (5) REPEATABLE (42); -- deterministic
90+
```
91+
92+
The percentage is in the range `(0, 100]`. `REPEATABLE(seed)` makes
93+
the sample deterministic — the same seed against the same data always
94+
returns the same rows.
95+
96+
### What `SYSTEM` means
97+
98+
`SYSTEM` is **block-level** sampling: instead of evaluating a
99+
per-row coin flip, the scan keeps or drops whole blocks of rows
100+
chosen at random. This is the same behaviour PostgreSQL documents
101+
for `TABLESAMPLE SYSTEM` and what Hive calls `BLOCK` sampling
102+
(DataFusion accepts `BLOCK` as an alias for `SYSTEM`).
103+
104+
The trade-off vs. row-level sampling (`BERNOULLI`):
105+
106+
- **`SYSTEM`** is fast — the scan can skip blocks entirely, so it
107+
reads less I/O proportional to the requested fraction. Rows
108+
inside each kept block are correlated, so it's statistically
109+
lossier than per-row sampling.
110+
- **`BERNOULLI`** evaluates `random() < p` per row, so every row is
111+
read but only some are kept. Statistically tighter, but no I/O
112+
saving.
113+
114+
DataFusion only ships `SYSTEM` out of the box. To add `BERNOULLI` or
115+
other forms, register a [`RelationPlanner`] extension; see the
116+
[extending SQL] guide and the [`relation_planner` example].
117+
118+
[`RelationPlanner`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.RelationPlanner.html
119+
[extending SQL]: ../../library-user-guide/extending-sql.md
120+
[`relation_planner` example]: https://github.com/apache/datafusion/tree/main/datafusion-examples/examples/relation_planner
121+
122+
### How `SYSTEM` is implemented for Parquet
123+
124+
For a `ParquetSource`, `TABLESAMPLE SYSTEM(p%)` is pushed all the way
125+
into the scan rather than evaluated as a post-scan filter. The plan
126+
contains no `SampleExec` — instead, `ParquetSource` itself drops
127+
files, row groups, and row-clusters in proportion to `p`.
128+
129+
The selection uses a **cube-root hybrid**: with the requested
130+
fraction `p`, the source applies `q = p^(1/3)` at three independent
131+
levels:
132+
133+
1. **File level** — keep `⌈n_files * q⌉` files (always at least one)
134+
chosen by a seeded shuffle.
135+
2. **Row-group level** — within each kept file, keep
136+
`⌈n_row_groups * q⌉` row groups, chosen by the parquet opener
137+
once it has loaded the footer.
138+
3. **Row level** — within each kept row group, keep `q` of the rows
139+
as a small number of contiguous windows, materialised as a
140+
`RowSelection` so the parquet reader can use the page index to
141+
skip data pages entirely.
142+
143+
Multiplying the three axes gives `q × q × q = p`, so the expected
144+
result size is `p × N` rows — the same as if you'd dropped each
145+
row independently with probability `p`. Spreading the reduction
146+
across all three axes means the I/O win at small fractions does not
147+
concentrate at one granularity: dropping 90% of files (1/0.1 ≈ 10×
148+
fewer files) produces a coarser sample than dropping 90% across
149+
all axes evenly.
150+
151+
For a single-file, single-row-group input the file and row-group
152+
axes can't reduce — only the row axis is effective — so the actual
153+
fraction read is `q ≈ 0.79` for `p = 0.5`, not `0.5`. As file and
154+
row-group counts grow, the scan reads progressively closer to the
155+
requested `p`.
156+
157+
`REPEATABLE(seed)` mixes the seed into every random draw, so all
158+
three levels produce the same selection across runs. The selection
159+
also depends on the file name, the row-group index within the file,
160+
and the cluster size, so different files don't accidentally see
161+
correlated samples.
162+
163+
The sampling is visible in `EXPLAIN`:
164+
165+
```text
166+
DataSourceExec: file_groups={...}, projection=[...],
167+
file_type=parquet,
168+
sample_row_group_fraction=0.7937, sample_row_fraction=0.7937
169+
```
170+
171+
There is no `SampleExec` in this plan — the `SamplePushdown`
172+
optimizer rule absorbed the sample into the source. If pushdown is
173+
not possible (for example, against a non-Parquet source that does
174+
not implement `try_push_sample`), the rule errors at planning time.
175+
176+
### Limitations
177+
178+
The built-in planner accepts only `TABLESAMPLE SYSTEM(p%)` with an
179+
optional `REPEATABLE(seed)`. The following forms error at planning
180+
time:
181+
182+
- `TABLESAMPLE BERNOULLI(...)` — register a custom `RelationPlanner`.
183+
- `TABLESAMPLE (N ROWS)` — use `LIMIT N` instead, or a custom planner.
184+
- `TABLESAMPLE BUCKET m OUT OF n` — Hive bucket sampling is not
185+
supported.
186+
- `TABLESAMPLE ... OFFSET ...` — ClickHouse-style offset sampling is
187+
not supported.
188+
- Fractions outside `(0, 100]`.
189+
79190
## WHERE clause
80191

81192
Example:

0 commit comments

Comments
 (0)