@@ -30,7 +30,7 @@ DataFusion supports the following syntax for queries:
3030
3131[ [ WITH] ( #with-clause ) with_query [ , ...] ] <br />
3232[ SELECT] ( #select-clause ) [ ALL | DISTINCT ] select_expr [ , ...] <br />
33- [ [ FROM] ( #from-clause ) from_item [ , ...] ] <br />
33+ [ [ FROM] ( #from-clause ) from_item [ [ TABLESAMPLE ] ( #tablesample-clause ) ... ] [ , ...] ] <br />
3434[ [ JOIN] ( #join-clause ) join_item [ , ...] ] <br />
3535[ [ WHERE] ( #where-clause ) condition ] <br />
3636[ [ GROUP BY] ( #group-by-clause ) grouping_element [ , ...] ] <br />
@@ -76,6 +76,117 @@ Example:
7676SELECT t .a FROM table AS t
7777```
7878
79+ ## TABLESAMPLE clause
80+
81+ ` TABLESAMPLE ` returns a random subset of rows from a table. It's
82+ useful for ad-hoc data exploration ("give me roughly 1% of this
83+ table"), bounded ` EXPLAIN ANALYZE ` runs against representative data,
84+ and any analytics workload where an approximate answer is acceptable
85+ in exchange for reading less data.
86+
87+ ``` sql
88+ SELECT * FROM table TABLESAMPLE SYSTEM (10 ); -- ~10% of the table
89+ SELECT * FROM table TABLESAMPLE SYSTEM (5 ) REPEATABLE (42 ); -- deterministic
90+ ```
91+
92+ The percentage is in the range ` (0, 100] ` . ` REPEATABLE(seed) ` makes
93+ the sample deterministic — the same seed against the same data always
94+ returns the same rows.
95+
96+ ### What ` SYSTEM ` means
97+
98+ ` SYSTEM ` is ** block-level** sampling: instead of evaluating a
99+ per-row coin flip, the scan keeps or drops whole blocks of rows
100+ chosen at random. This is the same behaviour PostgreSQL documents
101+ for ` TABLESAMPLE SYSTEM ` and what Hive calls ` BLOCK ` sampling
102+ (DataFusion accepts ` BLOCK ` as an alias for ` SYSTEM ` ).
103+
104+ The trade-off vs. row-level sampling (` BERNOULLI ` ):
105+
106+ - ** ` SYSTEM ` ** is fast — the scan can skip blocks entirely, so it
107+ reads less I/O proportional to the requested fraction. Rows
108+ inside each kept block are correlated, so it's statistically
109+ lossier than per-row sampling.
110+ - ** ` BERNOULLI ` ** evaluates ` random() < p ` per row, so every row is
111+ read but only some are kept. Statistically tighter, but no I/O
112+ saving.
113+
114+ DataFusion only ships ` SYSTEM ` out of the box. To add ` BERNOULLI ` or
115+ other forms, register a [ ` RelationPlanner ` ] extension; see the
116+ [ extending SQL] guide and the [ ` relation_planner ` example] .
117+
118+ [ `RelationPlanner` ] : https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.RelationPlanner.html
119+ [ extending SQL ] : ../../library-user-guide/extending-sql.md
120+ [ `relation_planner` example ] : https://github.com/apache/datafusion/tree/main/datafusion-examples/examples/relation_planner
121+
122+ ### How ` SYSTEM ` is implemented for Parquet
123+
124+ For a ` ParquetSource ` , ` TABLESAMPLE SYSTEM(p%) ` is pushed all the way
125+ into the scan rather than evaluated as a post-scan filter. The plan
126+ contains no ` SampleExec ` — instead, ` ParquetSource ` itself drops
127+ files, row groups, and row-clusters in proportion to ` p ` .
128+
129+ The selection uses a ** cube-root hybrid** : with the requested
130+ fraction ` p ` , the source applies ` q = p^(1/3) ` at three independent
131+ levels:
132+
133+ 1 . ** File level** — keep ` ⌈n_files * q⌉ ` files (always at least one)
134+ chosen by a seeded shuffle.
135+ 2 . ** Row-group level** — within each kept file, keep
136+ ` ⌈n_row_groups * q⌉ ` row groups, chosen by the parquet opener
137+ once it has loaded the footer.
138+ 3 . ** Row level** — within each kept row group, keep ` q ` of the rows
139+ as a small number of contiguous windows, materialised as a
140+ ` RowSelection ` so the parquet reader can use the page index to
141+ skip data pages entirely.
142+
143+ Multiplying the three axes gives ` q × q × q = p ` , so the expected
144+ result size is ` p × N ` rows — the same as if you'd dropped each
145+ row independently with probability ` p ` . Spreading the reduction
146+ across all three axes means the I/O win at small fractions does not
147+ concentrate at one granularity: dropping 90% of files (1/0.1 ≈ 10×
148+ fewer files) produces a coarser sample than dropping 90% across
149+ all axes evenly.
150+
151+ For a single-file, single-row-group input the file and row-group
152+ axes can't reduce — only the row axis is effective — so the actual
153+ fraction read is ` q ≈ 0.79 ` for ` p = 0.5 ` , not ` 0.5 ` . As file and
154+ row-group counts grow, the scan reads progressively closer to the
155+ requested ` p ` .
156+
157+ ` REPEATABLE(seed) ` mixes the seed into every random draw, so all
158+ three levels produce the same selection across runs. The selection
159+ also depends on the file name, the row-group index within the file,
160+ and the cluster size, so different files don't accidentally see
161+ correlated samples.
162+
163+ The sampling is visible in ` EXPLAIN ` :
164+
165+ ``` text
166+ DataSourceExec: file_groups={...}, projection=[...],
167+ file_type=parquet,
168+ sample_row_group_fraction=0.7937, sample_row_fraction=0.7937
169+ ```
170+
171+ There is no ` SampleExec ` in this plan — the ` SamplePushdown `
172+ optimizer rule absorbed the sample into the source. If pushdown is
173+ not possible (for example, against a non-Parquet source that does
174+ not implement ` try_push_sample ` ), the rule errors at planning time.
175+
176+ ### Limitations
177+
178+ The built-in planner accepts only ` TABLESAMPLE SYSTEM(p%) ` with an
179+ optional ` REPEATABLE(seed) ` . The following forms error at planning
180+ time:
181+
182+ - ` TABLESAMPLE BERNOULLI(...) ` — register a custom ` RelationPlanner ` .
183+ - ` TABLESAMPLE (N ROWS) ` — use ` LIMIT N ` instead, or a custom planner.
184+ - ` TABLESAMPLE BUCKET m OUT OF n ` — Hive bucket sampling is not
185+ supported.
186+ - ` TABLESAMPLE ... OFFSET ... ` — ClickHouse-style offset sampling is
187+ not supported.
188+ - Fractions outside ` (0, 100] ` .
189+
79190## WHERE clause
80191
81192Example:
0 commit comments