Commit 71e5c6a
fix(merge): adapter rejects unsorted input; consumer honors SS-3; stronger test verifiers (#6426)
Three adversarial-review findings on the prefix/RG machinery, bundled
because they touch the same producer/consumer contract:
**F8: Legacy adapter rejects SS-1-violating input upfront.**
The adapter walked rows in physical order and emitted one RG per
prefix-value run. An unsorted legacy input (rows `[A,A,B,B,A,A]`)
produced a 3-RG file where two RGs shared prefix `A`, violating PA-3.
The streaming merge engine would later reject it mid-merge — but only
after a quietly-bad file had been built. Now `compute_prefix_value_slices`
tracks each slice's composite prefix-value bytes and bails with
`LegacyAdapterError::InputNotSorted` on duplicates, surfacing the
SS-1 violation before any file lands on disk.
**F12: Consumer-side SS-3 (cross-layer divergence, discovered while
wiring F2's chunk-level verifier into the SS-3 test).** The adapter
implements SS-3 correctly (missing-from-schema → synthesized NullArray
during slice computation, file stamps `prefix_len = N`). The streaming
engine's reader did not: `find_prefix_parquet_col_indices` hard-required
every named prefix column to be physically present, so a file the
adapter produced from an SS-3 input was unreadable by the merge engine.
Now `find_prefix_parquet_col_indices` returns `Vec<Option<PrefixColumn>>`
and `extract_rg_composite_prefix_key` emits a constant null marker
(`encode_byte_array_prefix(&[])`) for None slots. The column contributes
no cross-RG ordering signal (constant everywhere) so region boundaries
are driven entirely by the present columns. Both halves of SS-3 now
agree end-to-end.
Known limitation: cross-file SS-3 — where some inputs have a sort
column and others don't — uses [0x00, 0x00] for the null contribution,
which sorts BEFORE non-null per the encoded-empty-string convention.
That weakly violates SS-2 (nulls sort last). Single-file SS-3 is
correct because every RG in such a file contributes the same constant.
If cross-file SS-3 becomes a production scenario, the encoding needs
a leading-0xff sentinel instead. Not exercised today.
**F2/F9/F11: Wire `assert_unique_rg_prefix_keys` into prefix-claiming
tests.** Tests asserting `num_row_groups == N` + KV stamped to N would
have passed even with an off-by-one in slice-boundary detection or
column-content scrambling. The verifier reads chunk-level statistics
directly: PA-1 (intra-RG `min == max`) + PA-3 (inter-RG uniqueness)
on the composite key. Wired into six tests:
- streaming engine: `test_streaming_merge_with_prefix_len_two`,
`test_multi_rg_metric_aligned_input_produces_multi_rg_output`,
`test_streaming_merge_with_desc_prefix_col`
- legacy adapter: `test_target_prefix_len_two_splits_by_metric_and_service`,
`test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`,
`test_missing_prefix_col_treated_as_null_satisfies_alignment` (now
passes thanks to F12).
Also: `assert_unique_rg_prefix_keys` no longer short-circuits on
single-RG files — they still go through PA-1 because an unsorted
single-RG file CAN have `min != max` on a prefix column.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 54c4872 commit 71e5c6a
2 files changed
Lines changed: 167 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1893 | 1893 | | |
1894 | 1894 | | |
1895 | 1895 | | |
| 1896 | + | |
| 1897 | + | |
| 1898 | + | |
| 1899 | + | |
| 1900 | + | |
| 1901 | + | |
| 1902 | + | |
| 1903 | + | |
| 1904 | + | |
| 1905 | + | |
| 1906 | + | |
1896 | 1907 | | |
1897 | 1908 | | |
1898 | 1909 | | |
| |||
2081 | 2092 | | |
2082 | 2093 | | |
2083 | 2094 | | |
| 2095 | + | |
| 2096 | + | |
| 2097 | + | |
| 2098 | + | |
| 2099 | + | |
| 2100 | + | |
| 2101 | + | |
| 2102 | + | |
| 2103 | + | |
| 2104 | + | |
| 2105 | + | |
| 2106 | + | |
| 2107 | + | |
| 2108 | + | |
2084 | 2109 | | |
2085 | 2110 | | |
2086 | 2111 | | |
| |||
2768 | 2793 | | |
2769 | 2794 | | |
2770 | 2795 | | |
| 2796 | + | |
| 2797 | + | |
| 2798 | + | |
| 2799 | + | |
| 2800 | + | |
| 2801 | + | |
| 2802 | + | |
| 2803 | + | |
| 2804 | + | |
| 2805 | + | |
2771 | 2806 | | |
2772 | 2807 | | |
2773 | 2808 | | |
| |||
Lines changed: 132 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| 60 | + | |
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| |||
141 | 142 | | |
142 | 143 | | |
143 | 144 | | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
144 | 165 | | |
145 | 166 | | |
146 | 167 | | |
| |||
305 | 326 | | |
306 | 327 | | |
307 | 328 | | |
308 | | - | |
| 329 | + | |
309 | 330 | | |
310 | 331 | | |
311 | 332 | | |
| |||
404 | 425 | | |
405 | 426 | | |
406 | 427 | | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
407 | 437 | | |
408 | 438 | | |
409 | 439 | | |
| 440 | + | |
410 | 441 | | |
411 | 442 | | |
412 | 443 | | |
| |||
428 | 459 | | |
429 | 460 | | |
430 | 461 | | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
431 | 465 | | |
432 | 466 | | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
433 | 484 | | |
434 | 485 | | |
435 | | - | |
| 486 | + | |
436 | 487 | | |
437 | 488 | | |
438 | 489 | | |
439 | | - | |
| 490 | + | |
440 | 491 | | |
441 | 492 | | |
442 | 493 | | |
| |||
1363 | 1414 | | |
1364 | 1415 | | |
1365 | 1416 | | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
| 1423 | + | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
| 1427 | + | |
| 1428 | + | |
| 1429 | + | |
1366 | 1430 | | |
1367 | 1431 | | |
1368 | 1432 | | |
| |||
1660 | 1724 | | |
1661 | 1725 | | |
1662 | 1726 | | |
| 1727 | + | |
| 1728 | + | |
| 1729 | + | |
| 1730 | + | |
| 1731 | + | |
| 1732 | + | |
| 1733 | + | |
| 1734 | + | |
| 1735 | + | |
| 1736 | + | |
| 1737 | + | |
| 1738 | + | |
| 1739 | + | |
1663 | 1740 | | |
1664 | 1741 | | |
1665 | 1742 | | |
| |||
1732 | 1809 | | |
1733 | 1810 | | |
1734 | 1811 | | |
| 1812 | + | |
| 1813 | + | |
| 1814 | + | |
| 1815 | + | |
| 1816 | + | |
| 1817 | + | |
| 1818 | + | |
| 1819 | + | |
| 1820 | + | |
| 1821 | + | |
| 1822 | + | |
| 1823 | + | |
| 1824 | + | |
| 1825 | + | |
| 1826 | + | |
| 1827 | + | |
| 1828 | + | |
| 1829 | + | |
| 1830 | + | |
| 1831 | + | |
| 1832 | + | |
| 1833 | + | |
| 1834 | + | |
| 1835 | + | |
| 1836 | + | |
| 1837 | + | |
| 1838 | + | |
| 1839 | + | |
| 1840 | + | |
| 1841 | + | |
| 1842 | + | |
| 1843 | + | |
| 1844 | + | |
| 1845 | + | |
| 1846 | + | |
| 1847 | + | |
| 1848 | + | |
| 1849 | + | |
| 1850 | + | |
| 1851 | + | |
| 1852 | + | |
| 1853 | + | |
| 1854 | + | |
| 1855 | + | |
| 1856 | + | |
| 1857 | + | |
| 1858 | + | |
| 1859 | + | |
| 1860 | + | |
| 1861 | + | |
| 1862 | + | |
| 1863 | + | |
1735 | 1864 | | |
1736 | 1865 | | |
1737 | 1866 | | |
| |||
0 commit comments