Commit edf02c2
refactor(parquet sampling): address PR apache#22000 review feedback
Two changes responding to review on the parent commit:
1. Key sampling on a stable `file_index` instead of `file_name`
(apache#22000 (comment)).
Both `apply_row_group_sampling` and `apply_row_fraction_sampling`
now take `file_index: usize` rather than `file_name: &str`. The
parquet opener passes the execution `partition_index`. This makes
sampling reproducible across environments (no dependency on the
on-disk path), while still decorrelating files assigned to
different partitions.
2. Extract the row-window selection into `build_row_window_selectors`
and add fuzz coverage
(apache#22000 (comment)).
The previous inline arithmetic could produce overlapping windows
when `target_rows` was close to `total_rows`: `window_size =
ceil(target / n_windows)` could exceed `stride = total / n_windows`,
so adjacent strides' windows would intersect. The extracted
function caps `window_size` at `stride` (the construction that
guarantees disjointness) and is covered by:
* `row_window_selection_basic_layout` — hand-checked anchor case.
* `row_window_selection_returns_none_on_invalid_input` — degenerate
inputs return `None` cleanly.
* `row_window_selection_full_target_no_overlap` — the previously
buggy `target_rows == total_rows` case.
* `row_window_selection_fuzz_invariants` — 5 000 randomized
`(total_rows, target_rows, cluster_size, seed)` configurations,
asserting full coverage, in-bounds positions, and no overlap.
* `row_window_selection_fuzz_determinism` — 1 000 iterations
verifying identical seeds produce identical layouts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent e4c9d3b commit edf02c2
2 files changed
Lines changed: 314 additions & 71 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
896 | 896 | | |
897 | 897 | | |
898 | 898 | | |
899 | | - | |
900 | | - | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
901 | 905 | | |
902 | 906 | | |
903 | 907 | | |
904 | | - | |
| 908 | + | |
905 | 909 | | |
906 | 910 | | |
907 | 911 | | |
908 | 912 | | |
909 | | - | |
| 913 | + | |
910 | 914 | | |
911 | 915 | | |
912 | 916 | | |
| |||
0 commit comments