hide cube-root math from public API

adriangb · claude · adriangb · commit 4cbbd1053eb9 · 2026-05-03T19:29:34.000-05:00
The hierarchical sampling strategy (cube-root file split, sqrt
row-group/row split, single-axis fallback for small inputs) is an
implementation detail and may evolve. Stop leaking it through:

- `ParquetSampling.system_target_remaining` becomes `pub(crate)`.
  External callers can't see or set it; it stays a coordination
  channel between `try_push_sample` and the parquet opener.
- `try_push_sample`'s rustdoc no longer describes the cube-root /
  sqrt math. It now documents *behaviour* (absorb a TABLESAMPLE
  request, drop files / row groups / rows in proportion) and
  points at the SQL reference for users who want to know how it's
  implemented today. The math survives as an inline comment for
  maintainers.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/datafusion/datasource-parquet/src/source.rs b/datafusion/datasource-parquet/src/source.rs
@@ -346,26 +346,15 @@ pub struct ParquetSampling {
     /// `ceil(target / row_cluster_size)` windows distributed across
     /// the row group with a random offset within each stride.
     pub row_cluster_size: usize,
-    /// SYSTEM-mode adaptive target: the fraction of rows the opener
-    /// should retain *after* file-level filtering has already been
-    /// applied (via `keep_files` from the `SamplePushdown` rule).
+    /// Internal coordination channel between
+    /// [`ParquetSource::try_push_sample`] and the parquet opener.
+    /// Not part of the public sampling API — direct callers configure
+    /// sampling via the per-axis builders above. See the
+    /// [`TABLESAMPLE clause`] section of the SQL reference for the
+    /// pushdown strategy this implements.
     ///
-    /// When set, the opener ignores `row_group_fraction` and
-    /// `row_fraction` and instead chooses them adaptively based on
-    /// the actual row-group count it sees per file:
-    ///
-    /// * Multiple row groups: split as `q = sqrt(remaining)` at both
-    ///   row-group and row level so the product stays at `remaining`.
-    /// * Single row group: skip row-group sampling and apply
-    ///   `remaining` directly at the row level — otherwise a 1-RG
-    ///   file would only achieve `sqrt(remaining)` of the target.
-    ///
-    /// This is the field [`ParquetSource::try_push_sample`] sets when
-    /// it absorbs a `TABLESAMPLE SYSTEM(p%)`. Direct callers of
-    /// [`ParquetSource::with_row_group_sampling`] /
-    /// [`ParquetSource::with_row_fraction`] leave this `None` and
-    /// retain the explicit per-axis behaviour.
-    pub system_target_remaining: Option<f64>,
+    /// [`TABLESAMPLE clause`]: https://datafusion.apache.org/user-guide/sql/select.html#tablesample-clause
+    pub(crate) system_target_remaining: Option<f64>,
 }
 
 impl Default for ParquetSampling {
@@ -987,30 +976,17 @@ impl FileSource for ParquetSource {
         Ok(tnr)
     }
 
-    /// Absorb a TABLESAMPLE-shaped sample request.
+    /// Absorb a `TABLESAMPLE`-shaped sample request into the parquet
+    /// scan: drop files, row groups, and rows in proportion to the
+    /// requested fraction, with no `SampleExec` left in the plan.
     ///
-    /// For SYSTEM sampling we use a hierarchical "cube-root" split
-    /// across files, row groups, and rows. The ideal split for a
-    /// requested fraction `p` is `q = p^(1/3)` at each axis so the
-    /// product stays at `p`, but we adapt when an axis can't reduce:
+    /// `SYSTEM` sampling is the only supported method. The sampling
+    /// strategy (a hierarchical block-level reduction across files,
+    /// row groups, and rows) is described in the [`TABLESAMPLE clause`]
+    /// section of the SQL reference; it is intentionally not part of
+    /// the public sampling API and may evolve.
     ///
-    /// * Single-file scan (`num_files <= 1`): the file axis is fixed
-    ///   at 1.0, so we hand the full `p` budget to the opener via
-    ///   [`ParquetSampling::system_target_remaining`]. The opener
-    ///   then either splits row-group × row at `sqrt(p)` each (when
-    ///   it sees ≥ 2 row groups) or applies the full `p` at the row
-    ///   level alone (when it sees 1 row group).
-    /// * Multi-file scan: keep `target_files = ⌈n × p^(1/3)⌉` files
-    ///   via a seeded shuffle, then ask the opener to retain
-    ///   `remaining = p × num_files / target_files` of the rows it
-    ///   sees. The opener again splits row-group × row adaptively.
-    ///
-    /// The result: the expected output size is always close to
-    /// `p × N_total`, regardless of how many files / row groups the
-    /// scan happens to have. Without this adaptation, a single-file
-    /// `SYSTEM(10)` would read `cbrt(0.1) ≈ 46%` of the rows because
-    /// the file and row-group axes can't reduce. See the PR
-    /// description for the full math.
+    /// [`TABLESAMPLE clause`]: https://datafusion.apache.org/user-guide/sql/select.html#tablesample-clause
     fn try_push_sample(
         &self,
         spec: &datafusion_physical_plan::sample_pushdown::SampleSpec,
@@ -1020,6 +996,12 @@ impl FileSource for ParquetSource {
         use datafusion_datasource::file::FileSourceSampleResult;
         use datafusion_physical_plan::sample_pushdown::SampleMethod;
 
+        // Implementation detail (not promised by the public API): the
+        // file axis uses ~p^(1/3) so the cube-root product across
+        // file × row-group × row stays at p; the opener handles the
+        // remaining row-group × row split adaptively based on the
+        // actual row-group count it sees per file.
+
         match spec.method {
             SampleMethod::System => {
                 let p = spec.fraction.clamp(0.0, 1.0);
@@ -1031,10 +1013,6 @@ impl FileSource for ParquetSource {
                     });
                 }
 
-                // File level: when there is more than one file, drop
-                // ~p^(1/3) of them. With a single file the axis can't
-                // reduce, so pass the full budget through to the
-                // opener.
                 let (keep_files, remaining_p) = if num_files <= 1 {
                     (None, p)
                 } else {