Commit b3d4971
feat(merge): legacy promotion path + body-col schema evolution (#6423)
* fix(merge): adapter rejects unsorted input; consumer honors SS-3; stronger test verifiers
Three adversarial-review findings on the prefix/RG machinery, bundled
because they touch the same producer/consumer contract:
**F8: Legacy adapter rejects SS-1-violating input upfront.**
The adapter walked rows in physical order and emitted one RG per
prefix-value run. An unsorted legacy input (rows `[A,A,B,B,A,A]`)
produced a 3-RG file where two RGs shared prefix `A`, violating PA-3.
The streaming merge engine would later reject it mid-merge — but only
after a quietly-bad file had been built. Now `compute_prefix_value_slices`
tracks each slice's composite prefix-value bytes and bails with
`LegacyAdapterError::InputNotSorted` on duplicates, surfacing the
SS-1 violation before any file lands on disk.
**F12: Consumer-side SS-3 (cross-layer divergence, discovered while
wiring F2's chunk-level verifier into the SS-3 test).** The adapter
implements SS-3 correctly (missing-from-schema → synthesized NullArray
during slice computation, file stamps `prefix_len = N`). The streaming
engine's reader did not: `find_prefix_parquet_col_indices` hard-required
every named prefix column to be physically present, so a file the
adapter produced from an SS-3 input was unreadable by the merge engine.
Now `find_prefix_parquet_col_indices` returns `Vec<Option<PrefixColumn>>`
and `extract_rg_composite_prefix_key` emits a constant null marker
(`encode_byte_array_prefix(&[])`) for None slots. The column contributes
no cross-RG ordering signal (constant everywhere) so region boundaries
are driven entirely by the present columns. Both halves of SS-3 now
agree end-to-end.
Known limitation: cross-file SS-3 — where some inputs have a sort
column and others don't — uses [0x00, 0x00] for the null contribution,
which sorts BEFORE non-null per the encoded-empty-string convention.
That weakly violates SS-2 (nulls sort last). Single-file SS-3 is
correct because every RG in such a file contributes the same constant.
If cross-file SS-3 becomes a production scenario, the encoding needs
a leading-0xff sentinel instead. Not exercised today.
**F2/F9/F11: Wire `assert_unique_rg_prefix_keys` into prefix-claiming
tests.** Tests asserting `num_row_groups == N` + KV stamped to N would
have passed even with an off-by-one in slice-boundary detection or
column-content scrambling. The verifier reads chunk-level statistics
directly: PA-1 (intra-RG `min == max`) + PA-3 (inter-RG uniqueness)
on the composite key. Wired into six tests:
- streaming engine: `test_streaming_merge_with_prefix_len_two`,
`test_multi_rg_metric_aligned_input_produces_multi_rg_output`,
`test_streaming_merge_with_desc_prefix_col`
- legacy adapter: `test_target_prefix_len_two_splits_by_metric_and_service`,
`test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`,
`test_missing_prefix_col_treated_as_null_satisfies_alignment` (now
passes thanks to F12).
Also: `assert_unique_rg_prefix_keys` no longer short-circuits on
single-RG files — they still go through PA-1 because an unsorted
single-RG file CAN have `min != max` on a prefix column.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(merge): legacy-prefix promotion path + schema-evolution body cols
Two adversarial-review follow-ups grouped because they share the
streaming engine's input-routing and union-schema seams.
## (b) Legacy-prefix promotion
A new operation type pairs a prefix_len=0 split with prefix_len>0
peers in one merge, so legacy splits can be folded into prefix-
aligned buckets instead of aging out via retention. Adds:
- `ParquetMergeOperation::promote_legacy(splits, target_prefix_len)`: relaxes MP-3 to allow mixed
`rg_partition_prefix_len` as long as every input is `<= target`. Sort_fields + window equality
unchanged.
- `ParquetMergeOperation::target_prefix_len_override: Option<u32>` field records the promotion
target; `None` is the default regular-merge form.
- `merge_parquet_split_metadata(..., mixed_prefix_ok)`: skips the input-side prefix-len equality
check in promotion mode. The output prefix_len still comes from the writer's KV stamp via
`MergeOutputFile.output_rg_partition_prefix_len` (CS-1 holds by construction post-F1).
- `merge::execute_merge_operation(op, sources, ...)`: new thin executor that opens each input as
either `LegacyInputAdapter` (when `split.rg_partition_prefix_len < target`) or
`StreamingParquetReader` (otherwise), then feeds them to the streaming engine. Becomes the seam
PR-7 will wire from above.
Tests:
- `test_promote_legacy_pairs_legacy_with_aligned_peer`, `test_promote_legacy_rejects_higher_prefix_input`,
`test_promote_legacy_still_enforces_sort_fields`, `test_promote_legacy_all_at_target_is_valid`.
- `test_mixed_prefix_ok_skips_input_equality_check`.
- `test_promote_legacy_executor_end_to_end`: legacy single-RG + aligned multi-RG → 3-RG output
passing `assert_unique_rg_prefix_keys` with `prefix_len = 1`, plus metastore CS-1.
- `test_executor_mismatched_sources_count_bails`.
## F6 + F13: Schema evolution for body columns
The merger now supports MC-4 across heterogeneous body-col schemas:
- F6: `normalize_type` collapses `Binary`/`LargeBinary` (and dict variants) to `Binary`, analogous
to the existing string-flavour collapse. Two inputs whose body col differs only by byte-array
flavour merge cleanly; before this they hit a "type conflict" at alignment time.
- F13: `streaming_writer.rs::write_list_via_serialized_column_writer` (renamed from
`..._non_nullable_...`) now handles nullable outer `List<T>` / `LargeList<T>`. MC-4 forces the
union to be nullable when a List col is present in only some inputs; before this the writer
rejected the merged output. Uses Dremel max_def_level = 2 (0 = outer null, 1 = empty list, 2 =
element present) for nullable outer; non-nullable path unchanged.
Test: `test_mc2_mixed_schemas_round_trip` builds two inputs A and B
with the same sort schema but different body cols (Utf8 vs
Dict<Utf8>, LargeBinary vs Binary, List<Float64> in A only, Int32
A-only, Int64 B-only, common Float64). The merge produces the
union schema; per-row rendering via `render_cell` matches across
flavour boundaries; List cells from B render as nulls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* style(indexing): re-fmt parquet_merge_executor to latest nightly rustfmt
Same nightly-rustfmt drift as the storekey commit on #6424
(local nightly 2026-05-11 vs CI's 2026-05-17): the `mixed_prefix_ok`
binding and the `merge_parquet_split_metadata` call now fit
single-line under the newer width heuristics. No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(merge-executor): route promotion merges through execute_merge_operation
Codex P1 on PR #6423: the executor unconditionally called the
in-memory `merge_sorted_parquet_files` path, which routes through
`extract_and_validate_input_metadata` and bails on mixed
`qh.rg_partition_prefix_len` before any output is produced. So a
real promotion merge — `prefix_len = 0` plus `prefix_len = 1` with
`target_prefix_len_override = Some(1)` — failed before reaching the
downstream `mixed_prefix_ok` plumbing in
`merge_parquet_split_metadata`. The escape hatch existed but was
unreachable for actual promotion inputs.
Fix: branch in the executor's handle on
`target_prefix_len_override.is_some()`. Promotion merges go through
the engine's streaming entry point
`quickwit_parquet_engine::merge::execute_merge_operation`, which
opens each below-target input via `LegacyInputAdapter` and each
at-target input directly. The streaming merge then sees a
homogeneous stream advertising `prefix_len = target` on every
input. Regular (non-promotion) merges keep the in-memory path.
`execute_merge_operation` expects `Vec<Arc<dyn RemoteByteSource>>`
parallel to `op.splits` — the engine deliberately doesn't depend
on `quickwit-storage` (would invert layering and pull cloud SDKs
into a pure parquet library). So this commit adds
`LocalFileByteSource`: a tiny `RemoteByteSource` impl backed by
`tokio::fs::File`, one instance per downloaded split, each bound
to its scratch-directory path. The `path: &Path` argument on the
trait surface is ignored — the downloader has already resolved
each split to a concrete local file before the executor runs.
Coverage:
- Library-level: `quickwit-parquet-engine::merge::streaming::tests::test_promote_legacy_executor_end_to_end`
already exercises `execute_merge_operation` with a
`prefix_len = 0` + `prefix_len = 1` pair, verifying the output
advertises `prefix_len = 1` and passes PA-1 + PA-3 on the
composite key. That's now the same code path the in-tree
executor takes.
- Module doc on the executor rewritten to spell out which path runs
when.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(adr): track legacy promotion planner gap as GAP-011
The streaming Parquet merge stack landing in #6424–#6428 ships the
full legacy-promotion *mechanism* (engine + adapter + executor
wiring) but not the planner-level *trigger*. In production today,
`MergePolicyState::record_split` buckets by
`CompactionScope::from_split` which includes
`rg_partition_prefix_len`, so legacy (prefix=0) and aligned
(prefix>0) splits are separated before `ParquetMergePolicy::operations`
runs. The policy only emits `ParquetMergeOperation::new`; a
repo-wide search finds `promote_legacy` only in tests. Legacy
splits therefore never migrate without an explicit trigger.
Tracking this as GAP-011 so we pick it up at the right time. The
gap doc walks three resolution options (merge buckets in the scope
key, dedicated promotion pass, or hybrid prefer-multi-input-promotion)
and the cost trade-offs between them, so the eventual implementation
PR has a starting point.
Raised by Codex review comment id 4311184497 on PR #6423.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(adr): track download-vs-streaming merge executor gap as GAP-012
The Parquet streaming merge engine is built around `RemoteByteSource`
and was designed to pull pages directly from object storage — two
GETs per input, overlap fetch with merge, no scratch disk. The
production actor pipeline doesn't take that path: a downloader actor
materializes every input on local disk first, and the executor wraps
the local files in a `LocalFileByteSource` to feed `execute_merge_operation`
(or just calls the in-memory `merge_sorted_parquet_files` path). The
streaming engine's central design benefit is unused.
This isn't a correctness bug — both paths give the same result. It's
a perf/architecture gap: every merge pays 2× I/O per input
(network → scratch + scratch → merger), serializes phases
(`max(input download time)` first-byte latency), and consumes scratch
disk that scales with concurrent merges.
Tracking as GAP-012 so we pick it up at the right time. The gap doc
walks four options (stream-directly with download fallback, stream-
by-default with circuit breaker, eliminate in-memory path only,
stream-directly for promotion merges only) and the trade-offs between
them — including the mid-merge retry surface, which is the main
reason download-first is the current default.
Surfaced during PR #6423 code walkthrough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 71e5c6a commit b3d4971
10 files changed
Lines changed: 1344 additions & 128 deletions
File tree
- docs/internals/adr/gaps
- quickwit
- quickwit-indexing/src/actors/parquet_pipeline
- quickwit-parquet-engine/src
- merge
- policy
- storage
Lines changed: 136 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
Lines changed: 164 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| 118 | + | |
| 119 | + | |
0 commit comments