Commit 54c4872
feat(legacy-adapter): prefix-aware output with caller-supplied target_prefix_len (#6425)
* feat(legacy-adapter): synthesize prefix-aligned row groups
The legacy adapter previously consolidated multi-RG legacy inputs
into a single oversized row group and left `rg_partition_prefix_len`
at the original's (typically `0`). The streaming merge engine then
sent these single-RG/prefix=0 inputs through the new sub-region
splitting path — correct, but it forfeits the prefix-aware fast path
for outputs derived from legacy inputs and gives up the row-group
pruning that prefix alignment enables.
After consolidating, the adapter now slices the resulting record
batch at first-sort-col transitions (typically `metric_name`) and
emits one parquet row group per slice, stamping the re-encoded file
with `qh.rg_partition_prefix_len = 1`. The merge engine then reads
it through the prefix-aware fast path: one region per metric_name,
the existing duplicate-prefix invariant on read validates uniqueness.
Fallback: if the original file has no `qh.sort_fields` KV, the
sort-fields string fails to parse, the first column can't be
resolved in the arrow schema, or the consolidated batch is empty,
the adapter reverts to a single-RG re-encode without claiming any
prefix alignment. That input still works — the engine's
prefix_len=0 sub-region splitting path picks it up. This keeps the
adapter robust for files written by very early versions of the
indexer that may pre-date the standard KV layout.
Implementation: `reencode_prefix_aligned` replaces
`reencode_as_single_row_group` and either dispatches to the new
multi-RG writer or to the legacy single-RG writer based on whether
the first sort col is resolvable. `RowConverter` handles the
prefix-value equality check uniformly across dictionary, utf8, and
primitive types. The KV injection helper replaces (rather than
appends) any existing `qh.rg_partition_prefix_len` so re-runs and
files mistakenly carrying a stale value still land at the freshly
synthesized prefix.
Tests:
- `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`
— 3 metrics × 40 rows, multi-RG input → 3 prefix-aligned output
RGs and `qh.rg_partition_prefix_len = 1` KV.
- `test_legacy_input_single_metric_yields_one_rg_with_prefix_kv` —
one metric → one RG, prefix KV still stamped (vacuously aligned).
- `test_legacy_input_without_sort_fields_falls_back_to_single_rg` —
fallback path preserved when sort-fields KV is missing.
- All existing tests pass unchanged (they use empty KVs or
unparseable sort-fields strings, both of which exercise the
fallback path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(legacy-adapter): parameterize on target_prefix_len with composite-prefix support
`LegacyInputAdapter::try_open` now takes `target_prefix_len: u32`
chosen by the caller, matching the merge plan's consensus prefix
length. The adapter slices the consolidated batch at every transition
of the first N sort columns (composite key, via `RowConverter` over
all N fields) and emits one output row group per slice, stamping the
output with `qh.rg_partition_prefix_len = target_prefix_len`. With
`target_prefix_len = 0` the adapter takes the original single-RG
passthrough path with no prefix-alignment claim.
A sort column that is named in `qh.sort_fields` but missing from the
file's arrow schema is treated as implicitly null at every row per
SS-3. A constantly-null column trivially satisfies alignment on that
column (null == null) and contributes no transitions, so the split
boundaries are driven by the columns that are present. This matches
the merge engine's compaction-time treatment of missing columns and
keeps a legacy file with an evolved schema usable as a prefix-aligned
input.
`PrefixUnresolvable` now fires only on cases where the file doesn't
advertise enough sort *names* to honor the request:
- `qh.sort_fields` absent or unparseable
- `qh.sort_fields` declares fewer sort columns than `target_prefix_len`
A column missing from the arrow schema no longer counts as
unresolvable; the adapter materialises a `NullArray` of the batch's
length in that slot and proceeds.
Tests:
- `test_target_prefix_len_zero_passes_through_as_single_rg` — explicit
N=0 fallback, no prefix KV stamped.
- `test_target_prefix_len_two_splits_by_metric_and_service` — composite
prefix (`metric_name`, `service`) → 4 RGs, KV declares prefix_len=2.
- `test_target_prefix_len_one_without_sort_fields_returns_unresolvable`
— no `qh.sort_fields` KV → `PrefixUnresolvable`.
- `test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable`
— sort schema declares 2 cols, caller asks 3 → `PrefixUnresolvable`.
- `test_missing_prefix_col_treated_as_null_satisfies_alignment` —
sort schema declares `metric_name|env|-timestamp_secs` but `env`
is absent from the arrow schema → no error, only metric_name
transitions split RGs, KV still stamps prefix_len=2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(legacy_adapter): note where reader-side SS-3 handling lands
Codex P2 on PR #6425: the adapter records `None` for missing prefix
columns and stamps `rg_partition_prefix_len = target_prefix_len`
anyway. In isolation that produces a file with an advertised prefix
the current reader (`find_prefix_parquet_col_indices` on the #6425
state) bails on.
The reader-side fix — returning `Vec<Option<PrefixColumn>>` and
synthesizing a constant `[0x00, 0x00]` byte for `None` slots —
lands in PR #6426 (the hardening slice, F12 from the adversarial
review). The only caller of this adapter is
`execute_merge_operation`, introduced in PR #6423 which sits above
#6426 in the stack, so no production caller can produce a missing-
column prefix until the reader fix is in place.
Adding the in-code pointer so a future reader bisecting the stack
doesn't have to trace the relationship from scratch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(merge): consumer honors SS-3 (move F12 forward from #6426 to #6425)
Previously the F12 fix — "consumer side honors SS-3 missing prefix
columns" — lived in the hardening PR (#6426). At the #6425
isolation level, the legacy adapter records `None` for a prefix
column absent from the parquet schema and stamps
`rg_partition_prefix_len = target_prefix_len` on the output, but
the reader's `find_prefix_parquet_col_indices` bails on any missing
column. So #6425 + #6424 alone would produce a legacy-adapter file
that the streaming-merge reader rejects mid-merge — i.e. a known-
incoherent intermediate stack state.
Move F12 into this PR so the adapter and reader agree at the same
slice:
- `find_prefix_parquet_col_indices` now returns
`Result<Vec<Option<PrefixColumn>>>`. `Some(_)` when the column
is present in the parquet schema; `None` per SS-3 when the
column is named in `qh.sort_fields` but absent from the schema.
- `extract_rg_composite_prefix_key` skips `None` slots entirely
(no ordinal byte, no value bytes for that column). The trailing
`u8(prefix_len)` sentinel introduced in the storekey refactor
keeps the resulting key well-formed across present/absent
columns.
- Callers that index into `prefix_cols` updated to use
`.as_ref().expect(…)` where they assume presence.
Existing SS-3 test
`test_missing_prefix_col_treated_as_null_satisfies_alignment` in
`legacy_adapter.rs` gets an `assert_unique_rg_prefix_keys` call
verifying the adapter's output is consumable by the reader — pins
the "stack-coherent at #6425" property the F12 hop establishes.
Also incidental nightly-fmt cleanups in `sorted_series::append_prefix_col_to_key`
and the two-input fixture in `test_all_null_prefix_rg_groups_into_separate_region_sorted_last`.
The hardening PR (#6426) will be re-cascaded to drop the now-
duplicated F12 hunks (keeping its F8 adapter-rejects-unsorted +
F2 verifier-strength changes intact).
485 lib tests pass on this slice; workspace clippy + nightly fmt
clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(legacy-adapter): strip stale rg_partition_prefix_len when target=0
Codex P2 on PR #6425: when the legacy adapter is called with
`target_prefix_len == 0` it consolidates the input into a single
RG, but the previous version preserved the input's footer KVs
unchanged. If the input itself already carried a stale nonzero
`qh.rg_partition_prefix_len` claim (e.g., a prefix-aware split
being re-encoded through the legacy fallback path), the
single-RG output would still advertise that claim. Downstream
metadata extraction would take the prefix-aware path against an
RG carrying multiple first-prefix values — failing the PA-1
min/max alignment check on read despite the caller explicitly
asking for the legacy path.
Strip `PARQUET_META_RG_PARTITION_PREFIX_LEN` from `original_kv`
in the `target_prefix_len == 0` branch. Absence of the KV is
the legacy convention for "no alignment claim", matching the
existing `test_target_prefix_len_zero_passes_through_as_single_rg`
test's `prefix_kv.is_none()` assertion.
New regression test
`test_target_prefix_len_zero_strips_stale_prefix_kv_from_input`:
inputs a 2-RG file with `qh.rg_partition_prefix_len = "1"` AND
opens through adapter with `target_prefix_len = 0`; asserts the
re-encoded output has no prefix KV. Pre-fix this test caught the
leak; post-fix the stale value is dropped.
487 lib tests pass on the slice; clippy + nightly fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d388bc9 commit 54c4872
3 files changed
Lines changed: 966 additions & 53 deletions
File tree
- quickwit/quickwit-parquet-engine/src
- merge
- streaming
- storage
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2789 | 2789 | | |
2790 | 2790 | | |
2791 | 2791 | | |
2792 | | - | |
| 2792 | + | |
| 2793 | + | |
| 2794 | + | |
| 2795 | + | |
2793 | 2796 | | |
2794 | 2797 | | |
2795 | 2798 | | |
| |||
Lines changed: 61 additions & 27 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
121 | | - | |
122 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
123 | 134 | | |
124 | 135 | | |
125 | 136 | | |
126 | 137 | | |
127 | | - | |
128 | | - | |
| 138 | + | |
| 139 | + | |
129 | 140 | | |
130 | 141 | | |
131 | 142 | | |
| |||
145 | 156 | | |
146 | 157 | | |
147 | 158 | | |
| 159 | + | |
| 160 | + | |
148 | 161 | | |
149 | 162 | | |
150 | 163 | | |
151 | 164 | | |
152 | 165 | | |
153 | 166 | | |
154 | 167 | | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
164 | 175 | | |
165 | 176 | | |
166 | 177 | | |
167 | 178 | | |
168 | 179 | | |
169 | 180 | | |
170 | | - | |
| 181 | + | |
171 | 182 | | |
172 | 183 | | |
173 | 184 | | |
174 | 185 | | |
175 | | - | |
| 186 | + | |
176 | 187 | | |
177 | 188 | | |
178 | 189 | | |
| |||
197 | 208 | | |
198 | 209 | | |
199 | 210 | | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
204 | 218 | | |
205 | 219 | | |
206 | 220 | | |
207 | 221 | | |
208 | 222 | | |
209 | 223 | | |
210 | 224 | | |
211 | | - | |
| 225 | + | |
212 | 226 | | |
213 | 227 | | |
214 | 228 | | |
215 | 229 | | |
216 | | - | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
217 | 239 | | |
218 | 240 | | |
219 | 241 | | |
| |||
575 | 597 | | |
576 | 598 | | |
577 | 599 | | |
578 | | - | |
579 | | - | |
580 | | - | |
581 | | - | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
582 | 616 | | |
583 | 617 | | |
584 | 618 | | |
| |||
600 | 634 | | |
601 | 635 | | |
602 | 636 | | |
603 | | - | |
604 | | - | |
| 637 | + | |
| 638 | + | |
605 | 639 | | |
606 | 640 | | |
607 | 641 | | |
| |||
0 commit comments