Commit ec48974
refactor(parquet sampling): address PR apache#22000 review feedback
Two changes responding to review on the parent commit:
1. Key sampling on a stable `file_index` instead of `file_name`
(apache#22000 (comment)).
Both `apply_row_group_sampling` and `apply_row_fraction_sampling`
now take `file_index: usize` rather than `file_name: &str`. The
parquet opener passes the execution `partition_index`. This makes
sampling reproducible across environments (no dependency on the
on-disk path), while still decorrelating files assigned to
different partitions.
2. Extract the row-window selection into `build_row_window_selectors`
and add fuzz coverage
(apache#22000 (comment)).
The previous inline arithmetic could produce overlapping windows
when `target_rows` was close to `total_rows`: `window_size =
ceil(target / n_windows)` could exceed `stride = total / n_windows`,
so adjacent strides' windows would intersect. The extracted
function caps `window_size` at `stride` (the construction that
guarantees disjointness) and is covered by:
* `row_window_selection_basic_layout` — hand-checked anchor case.
* `row_window_selection_returns_none_on_invalid_input` — degenerate
inputs return `None` cleanly.
* `row_window_selection_full_target_no_overlap` — the previously
buggy `target_rows == total_rows` case.
* `row_window_selection_fuzz_invariants` — 5 000 randomized
`(total_rows, target_rows, cluster_size, seed)` configurations,
asserting full coverage, in-bounds positions, and no overlap.
* `row_window_selection_fuzz_determinism` — 1 000 iterations
verifying identical seeds produce identical layouts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 30cd44d commit ec48974
2 files changed
Lines changed: 346 additions & 111 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
893 | 893 | | |
894 | 894 | | |
895 | 895 | | |
896 | | - | |
897 | | - | |
898 | | - | |
899 | | - | |
900 | | - | |
901 | | - | |
902 | | - | |
903 | | - | |
904 | | - | |
905 | | - | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
906 | 904 | | |
907 | | - | |
908 | | - | |
909 | | - | |
910 | | - | |
911 | | - | |
912 | | - | |
913 | | - | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
914 | 914 | | |
915 | 915 | | |
916 | 916 | | |
| |||
924 | 924 | | |
925 | 925 | | |
926 | 926 | | |
927 | | - | |
| 927 | + | |
928 | 928 | | |
929 | 929 | | |
930 | 930 | | |
931 | 931 | | |
932 | | - | |
| 932 | + | |
933 | 933 | | |
934 | 934 | | |
935 | 935 | | |
936 | 936 | | |
937 | 937 | | |
938 | 938 | | |
939 | 939 | | |
940 | | - | |
| 940 | + | |
941 | 941 | | |
942 | 942 | | |
943 | 943 | | |
944 | 944 | | |
945 | | - | |
| 945 | + | |
946 | 946 | | |
947 | 947 | | |
948 | 948 | | |
| |||
0 commit comments