docs: TABLESAMPLE SYSTEM in SQL reference + extending-sql guide

adriangb · claude · adriangb · commit 9a88f04ab3c3 · 2026-05-03T19:11:54.000-05:00
User-facing docs explaining what TABLESAMPLE is, the SYSTEM vs
BERNOULLI trade-off, the cube-root hybrid implementation against
ParquetSource, the deterministic REPEATABLE(seed) behavior, and the
list of forms rejected by the built-in planner with a pointer to
RelationPlanner for the rest.

Also:

- Updates docs/source/library-user-guide/extending-sql.md to
  acknowledge that SYSTEM is built-in and reframe the existing
  example as adding *additional* forms (BERNOULLI, ROW, BUCKET).
- Adds a CSV-rejection case to the sqllogictest fixture: scans
  whose FileSource doesn't implement try_push_sample (CSV) error at
  planning time with the diagnostic from the SamplePushdown rule.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/datafusion/sqllogictest/test_files/tablesample.slt b/datafusion/sqllogictest/test_files/tablesample.slt
@@ -83,5 +83,26 @@ SELECT count(*) FROM sample_table TABLESAMPLE SYSTEM (150);
 statement error TABLESAMPLE fraction must be in
 SELECT count(*) FROM sample_table TABLESAMPLE SYSTEM (0);
 
+# Sources that don't implement `try_push_sample` (e.g. CSV) cause the
+# `SamplePushdown` rule to error at planning time, since there is no
+# generic post-scan SampleExec yet — the rule is the only thing that
+# currently turns a `Sample` into a runnable plan.
+statement ok
+COPY (SELECT i AS id FROM generate_series(1, 16) t(i))
+TO 'test_files/scratch/tablesample/sample_table.csv'
+STORED AS CSV;
+
+statement ok
+CREATE EXTERNAL TABLE sample_csv
+STORED AS CSV
+LOCATION 'test_files/scratch/tablesample/sample_table.csv'
+OPTIONS ('format.has_header' 'true');
+
+statement error TABLESAMPLE could not be pushed into the source
+SELECT count(*) FROM sample_csv TABLESAMPLE SYSTEM (50);
+
+statement ok
+DROP TABLE sample_csv;
+
 statement ok
 DROP TABLE sample_table;
diff --git a/docs/source/library-user-guide/extending-sql.md b/docs/source/library-user-guide/extending-sql.md
@@ -341,13 +341,28 @@ approach:
 
 ### TABLESAMPLE (Custom Logical and Physical Nodes)
 
-The [table_sample.rs] example shows a complete end-to-end implementation of how to
-support queries such as:
+DataFusion ships with a built-in `RelationPlanner` for
+`TABLESAMPLE SYSTEM(p%)` (block-level sampling), auto-registered on
+any default `SessionContext` and pushed into `ParquetSource` by the
+`SamplePushdown` optimizer rule. See the [`TABLESAMPLE` section] of
+the SQL reference for a full description of the semantics and the
+parquet implementation.
+
+The [table_sample.rs] example shows how to register a *custom*
+`RelationPlanner` ahead of the built-in one to add other forms — row-
+level `BERNOULLI`, `ROW` count limits, Hive-style `BUCKET`, etc. — that
+DataFusion intentionally doesn't ship in core because the semantics
+vary across systems. Because `register_relation_planner` prepends
+to the chain, the custom planner runs first; returning
+`RelationPlanning::Original` falls through to the built-in
+`SYSTEM` planner.
 
 ```sql
 SELECT * FROM table TABLESAMPLE BERNOULLI(10 PERCENT) REPEATABLE(42)
 ```
 
+[`TABLESAMPLE` section]: ../user-guide/sql/select.md#tablesample-clause
+
 ### PIVOT/UNPIVOT (Rewrite Strategy)
 
 The [pivot_unpivot.rs] example demonstrates rewriting custom syntax to standard SQL
diff --git a/docs/source/user-guide/sql/select.md b/docs/source/user-guide/sql/select.md
@@ -30,7 +30,7 @@ DataFusion supports the following syntax for queries:
 
 [ [WITH](#with-clause) with_query [, ...] ] <br/>
 [SELECT](#select-clause) [ ALL | DISTINCT ] select_expr [, ...] <br/>
-[ [FROM](#from-clause) from_item [, ...] ] <br/>
+[ [FROM](#from-clause) from_item [ [TABLESAMPLE](#tablesample-clause) ... ] [, ...] ] <br/>
 [ [JOIN](#join-clause) join_item [, ...] ] <br/>
 [ [WHERE](#where-clause) condition ] <br/>
 [ [GROUP BY](#group-by-clause) grouping_element [, ...] ] <br/>
@@ -76,6 +76,117 @@ Example:
 SELECT t.a FROM table AS t
 ```
 
+## TABLESAMPLE clause
+
+`TABLESAMPLE` returns a random subset of rows from a table. It's
+useful for ad-hoc data exploration ("give me roughly 1% of this
+table"), bounded `EXPLAIN ANALYZE` runs against representative data,
+and any analytics workload where an approximate answer is acceptable
+in exchange for reading less data.
+
+```sql
+SELECT * FROM table TABLESAMPLE SYSTEM (10);             -- ~10% of the table
+SELECT * FROM table TABLESAMPLE SYSTEM (5) REPEATABLE (42);  -- deterministic
+```
+
+The percentage is in the range `(0, 100]`. `REPEATABLE(seed)` makes
+the sample deterministic — the same seed against the same data always
+returns the same rows.
+
+### What `SYSTEM` means
+
+`SYSTEM` is **block-level** sampling: instead of evaluating a
+per-row coin flip, the scan keeps or drops whole blocks of rows
+chosen at random. This is the same behaviour PostgreSQL documents
+for `TABLESAMPLE SYSTEM` and what Hive calls `BLOCK` sampling
+(DataFusion accepts `BLOCK` as an alias for `SYSTEM`).
+
+The trade-off vs. row-level sampling (`BERNOULLI`):
+
+- **`SYSTEM`** is fast — the scan can skip blocks entirely, so it
+  reads less I/O proportional to the requested fraction. Rows
+  inside each kept block are correlated, so it's statistically
+  lossier than per-row sampling.
+- **`BERNOULLI`** evaluates `random() < p` per row, so every row is
+  read but only some are kept. Statistically tighter, but no I/O
+  saving.
+
+DataFusion only ships `SYSTEM` out of the box. To add `BERNOULLI` or
+other forms, register a [`RelationPlanner`] extension; see the
+[extending SQL] guide and the [`relation_planner` example].
+
+[`RelationPlanner`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.RelationPlanner.html
+[extending SQL]: ../../library-user-guide/extending-sql.md
+[`relation_planner` example]: https://github.com/apache/datafusion/tree/main/datafusion-examples/examples/relation_planner
+
+### How `SYSTEM` is implemented for Parquet
+
+For a `ParquetSource`, `TABLESAMPLE SYSTEM(p%)` is pushed all the way
+into the scan rather than evaluated as a post-scan filter. The plan
+contains no `SampleExec` — instead, `ParquetSource` itself drops
+files, row groups, and row-clusters in proportion to `p`.
+
+The selection uses a **cube-root hybrid**: with the requested
+fraction `p`, the source applies `q = p^(1/3)` at three independent
+levels:
+
+1. **File level** — keep `⌈n_files * q⌉` files (always at least one)
+   chosen by a seeded shuffle.
+2. **Row-group level** — within each kept file, keep
+   `⌈n_row_groups * q⌉` row groups, chosen by the parquet opener
+   once it has loaded the footer.
+3. **Row level** — within each kept row group, keep `q` of the rows
+   as a small number of contiguous windows, materialised as a
+   `RowSelection` so the parquet reader can use the page index to
+   skip data pages entirely.
+
+Multiplying the three axes gives `q × q × q = p`, so the expected
+result size is `p × N` rows — the same as if you'd dropped each
+row independently with probability `p`. Spreading the reduction
+across all three axes means the I/O win at small fractions does not
+concentrate at one granularity: dropping 90% of files (1/0.1 ≈ 10×
+fewer files) produces a coarser sample than dropping 90% across
+all axes evenly.
+
+For a single-file, single-row-group input the file and row-group
+axes can't reduce — only the row axis is effective — so the actual
+fraction read is `q ≈ 0.79` for `p = 0.5`, not `0.5`. As file and
+row-group counts grow, the scan reads progressively closer to the
+requested `p`.
+
+`REPEATABLE(seed)` mixes the seed into every random draw, so all
+three levels produce the same selection across runs. The selection
+also depends on the file name, the row-group index within the file,
+and the cluster size, so different files don't accidentally see
+correlated samples.
+
+The sampling is visible in `EXPLAIN`:
+
+```text
+DataSourceExec: file_groups={...}, projection=[...],
+  file_type=parquet,
+  sample_row_group_fraction=0.7937, sample_row_fraction=0.7937
+```
+
+There is no `SampleExec` in this plan — the `SamplePushdown`
+optimizer rule absorbed the sample into the source. If pushdown is
+not possible (for example, against a non-Parquet source that does
+not implement `try_push_sample`), the rule errors at planning time.
+
+### Limitations
+
+The built-in planner accepts only `TABLESAMPLE SYSTEM(p%)` with an
+optional `REPEATABLE(seed)`. The following forms error at planning
+time:
+
+- `TABLESAMPLE BERNOULLI(...)` — register a custom `RelationPlanner`.
+- `TABLESAMPLE (N ROWS)` — use `LIMIT N` instead, or a custom planner.
+- `TABLESAMPLE BUCKET m OUT OF n` — Hive bucket sampling is not
+  supported.
+- `TABLESAMPLE ... OFFSET ...` — ClickHouse-style offset sampling is
+  not supported.
+- Fractions outside `(0, 100]`.
+
 ## WHERE clause
 
 Example: