Skip to content

Commit edf02c2

Browse files
adriangbclaude
andcommitted
refactor(parquet sampling): address PR apache#22000 review feedback
Two changes responding to review on the parent commit: 1. Key sampling on a stable `file_index` instead of `file_name` (apache#22000 (comment)). Both `apply_row_group_sampling` and `apply_row_fraction_sampling` now take `file_index: usize` rather than `file_name: &str`. The parquet opener passes the execution `partition_index`. This makes sampling reproducible across environments (no dependency on the on-disk path), while still decorrelating files assigned to different partitions. 2. Extract the row-window selection into `build_row_window_selectors` and add fuzz coverage (apache#22000 (comment)). The previous inline arithmetic could produce overlapping windows when `target_rows` was close to `total_rows`: `window_size = ceil(target / n_windows)` could exceed `stride = total / n_windows`, so adjacent strides' windows would intersect. The extracted function caps `window_size` at `stride` (the construction that guarantees disjointness) and is covered by: * `row_window_selection_basic_layout` — hand-checked anchor case. * `row_window_selection_returns_none_on_invalid_input` — degenerate inputs return `None` cleanly. * `row_window_selection_full_target_no_overlap` — the previously buggy `target_rows == total_rows` case. * `row_window_selection_fuzz_invariants` — 5 000 randomized `(total_rows, target_rows, cluster_size, seed)` configurations, asserting full coverage, in-bounds positions, and no overlap. * `row_window_selection_fuzz_determinism` — 1 000 iterations verifying identical seeds produce identical layouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e4c9d3b commit edf02c2

2 files changed

Lines changed: 314 additions & 71 deletions

File tree

datafusion/datasource-parquet/src/opener.rs

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -896,17 +896,21 @@ impl FiltersPreparedParquetOpen {
896896
// Apply optional row-group and row-range sampling now that we
897897
// know the actual row-group count. Both calls are no-ops when
898898
// their respective fraction is `None`. Selection is
899-
// deterministic per `(file_name, row_group_index, fraction,
900-
// cluster_size)` so re-runs match.
899+
// deterministic per `(partition_index, row_group_index,
900+
// fraction, cluster_size)` so re-runs match. The execution
901+
// `partition_index` is the stable per-file id we plumb in:
902+
// it makes sampling reproducible across environments without
903+
// depending on object-store paths, and decorrelates files
904+
// assigned to different partitions.
901905
prepared.sampling.apply_row_group_sampling(
902906
&mut initial_plan,
903907
rg_metadata.len(),
904-
&prepared.file_name,
908+
prepared.partition_index,
905909
);
906910
prepared.sampling.apply_row_fraction_sampling(
907911
&mut initial_plan,
908912
rg_metadata,
909-
&prepared.file_name,
913+
prepared.partition_index,
910914
);
911915

912916
let mut row_groups = RowGroupAccessPlanFilter::new(initial_plan);

0 commit comments

Comments
 (0)