You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
adaptive cube-root: don't undershoot small parquet scans
The fixed cube-root split read `cbrt(p) ≈ 46%` of the rows for a
single-file `SYSTEM(10)` because the file and row-group axes had
no capacity to reduce. The user-visible behaviour was: ask for 10%,
read 46%.
Adapt the split based on the actual axis cardinality:
- `try_push_sample` (knows num_files): when num_files == 1, hand the
entire fraction to the opener via a new
`ParquetSampling.system_target_remaining`. When num_files >= 2,
drop ~cbrt(p) of files and pass `p × num_files / target_files` as
the residual (compensates for the file-level rounding too).
- Opener (knows row_group count per file): when remaining is set
and n_row_groups >= 2, split as sqrt(remaining) at row-group and
row axes; when n_row_groups == 1, apply the full residual at the
row level. Falls back to legacy explicit fractions when
`system_target_remaining` is None — direct callers of
`with_row_group_sampling` / `with_row_fraction` are unaffected.
- EXPLAIN now surfaces `sample_system_target_remaining` for the
pushdown path.
Also:
- 7 unit tests on `try_push_sample` updated for the new contract +
new tests for the adaptive single-file case and rounding
compensation.
- 2 new opener end-to-end tests covering the single-RG and multi-RG
adaptive splits.
- SLT updated: `SYSTEM(50) REPEATABLE(42)` on a 1024-row single-file
/ single-RG input now returns 512 (was 813), the EXPLAIN line shows
`sample_system_target_remaining=0.5000`.
- SLT error message: "could not be pushed" → "is not supported".
- User-guide docs updated to describe the adaptive split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments