feat(legacy-adapter): prefix-aware output with caller-supplied target_prefix_len (#6425)

g-talbot · claude · g-talbot · commit 0c7de766d00f · 2026-05-20T21:42:32.000-04:00
* feat(legacy-adapter): synthesize prefix-aligned row groups The legacy adapter previously consolidated multi-RG legacy inputs into a single oversized row group and left `rg_partition_prefix_len` at the original's (typically `0`). The streaming merge engine then sent these single-RG/prefix=0 inputs through the new sub-region splitting path — correct, but it forfeits the prefix-aware fast path for outputs derived from legacy inputs and gives up the row-group pruning that prefix alignment enables. After consolidating, the adapter now slices the resulting record batch at first-sort-col transitions (typically `metric_name`) and emits one parquet row group per slice, stamping the re-encoded file with `qh.rg_partition_prefix_len = 1`. The merge engine then reads it through the prefix-aware fast path: one region per metric_name, the existing duplicate-prefix invariant on read validates uniqueness. Fallback: if the original file has no `qh.sort_fields` KV, the sort-fields string fails to parse, the first column can't be resolved in the arrow schema, or the consolidated batch is empty, the adapter reverts to a single-RG re-encode without claiming any prefix alignment. That input still works — the engine's prefix_len=0 sub-region splitting path picks it up. This keeps the adapter robust for files written by very early versions of the indexer that may pre-date the standard KV layout. Implementation: `reencode_prefix_aligned` replaces `reencode_as_single_row_group` and either dispatches to the new multi-RG writer or to the legacy single-RG writer based on whether the first sort col is resolvable. `RowConverter` handles the prefix-value equality check uniformly across dictionary, utf8, and primitive types. The KV injection helper replaces (rather than appends) any existing `qh.rg_partition_prefix_len` so re-runs and files mistakenly carrying a stale value still land at the freshly synthesized prefix. Tests: - `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg` — 3 metrics × 40 rows, multi-RG input → 3 prefix-aligned output RGs and `qh.rg_partition_prefix_len = 1` KV. - `test_legacy_input_single_metric_yields_one_rg_with_prefix_kv` — one metric → one RG, prefix KV still stamped (vacuously aligned). - `test_legacy_input_without_sort_fields_falls_back_to_single_rg` — fallback path preserved when sort-fields KV is missing. - All existing tests pass unchanged (they use empty KVs or unparseable sort-fields strings, both of which exercise the fallback path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(legacy-adapter): parameterize on target_prefix_len with composite-prefix support `LegacyInputAdapter::try_open` now takes `target_prefix_len: u32` chosen by the caller, matching the merge plan's consensus prefix length. The adapter slices the consolidated batch at every transition of the first N sort columns (composite key, via `RowConverter` over all N fields) and emits one output row group per slice, stamping the output with `qh.rg_partition_prefix_len = target_prefix_len`. With `target_prefix_len = 0` the adapter takes the original single-RG passthrough path with no prefix-alignment claim. A sort column that is named in `qh.sort_fields` but missing from the file's arrow schema is treated as implicitly null at every row per SS-3. A constantly-null column trivially satisfies alignment on that column (null == null) and contributes no transitions, so the split boundaries are driven by the columns that are present. This matches the merge engine's compaction-time treatment of missing columns and keeps a legacy file with an evolved schema usable as a prefix-aligned input. `PrefixUnresolvable` now fires only on cases where the file doesn't advertise enough sort *names* to honor the request: - `qh.sort_fields` absent or unparseable - `qh.sort_fields` declares fewer sort columns than `target_prefix_len` A column missing from the arrow schema no longer counts as unresolvable; the adapter materialises a `NullArray` of the batch's length in that slot and proceeds. Tests: - `test_target_prefix_len_zero_passes_through_as_single_rg` — explicit N=0 fallback, no prefix KV stamped. - `test_target_prefix_len_two_splits_by_metric_and_service` — composite prefix (`metric_name`, `service`) → 4 RGs, KV declares prefix_len=2. - `test_target_prefix_len_one_without_sort_fields_returns_unresolvable` — no `qh.sort_fields` KV → `PrefixUnresolvable`. - `test_target_prefix_len_exceeds_declared_sort_cols_returns_unresolvable` — sort schema declares 2 cols, caller asks 3 → `PrefixUnresolvable`. - `test_missing_prefix_col_treated_as_null_satisfies_alignment` — sort schema declares `metric_name|env|-timestamp_secs` but `env` is absent from the arrow schema → no error, only metric_name transitions split RGs, KV still stamps prefix_len=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(legacy_adapter): note where reader-side SS-3 handling lands Codex P2 on PR #6425: the adapter records `None` for missing prefix columns and stamps `rg_partition_prefix_len = target_prefix_len` anyway. In isolation that produces a file with an advertised prefix the current reader (`find_prefix_parquet_col_indices` on the #6425 state) bails on. The reader-side fix — returning `Vec<Option<PrefixColumn>>` and synthesizing a constant `[0x00, 0x00]` byte for `None` slots — lands in PR #6426 (the hardening slice, F12 from the adversarial review). The only caller of this adapter is `execute_merge_operation`, introduced in PR #6423 which sits above #6426 in the stack, so no production caller can produce a missing- column prefix until the reader fix is in place. Adding the in-code pointer so a future reader bisecting the stack doesn't have to trace the relationship from scratch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(merge): consumer honors SS-3 (move F12 forward from #6426 to #6425) Previously the F12 fix — "consumer side honors SS-3 missing prefix columns" — lived in the hardening PR (#6426). At the #6425 isolation level, the legacy adapter records `None` for a prefix column absent from the parquet schema and stamps `rg_partition_prefix_len = target_prefix_len` on the output, but the reader's `find_prefix_parquet_col_indices` bails on any missing column. So #6425 + #6424 alone would produce a legacy-adapter file that the streaming-merge reader rejects mid-merge — i.e. a known- incoherent intermediate stack state. Move F12 into this PR so the adapter and reader agree at the same slice: - `find_prefix_parquet_col_indices` now returns `Result<Vec<Option<PrefixColumn>>>`. `Some(_)` when the column is present in the parquet schema; `None` per SS-3 when the column is named in `qh.sort_fields` but absent from the schema. - `extract_rg_composite_prefix_key` skips `None` slots entirely (no ordinal byte, no value bytes for that column). The trailing `u8(prefix_len)` sentinel introduced in the storekey refactor keeps the resulting key well-formed across present/absent columns. - Callers that index into `prefix_cols` updated to use `.as_ref().expect(…)` where they assume presence. Existing SS-3 test `test_missing_prefix_col_treated_as_null_satisfies_alignment` in `legacy_adapter.rs` gets an `assert_unique_rg_prefix_keys` call verifying the adapter's output is consumable by the reader — pins the "stack-coherent at #6425" property the F12 hop establishes. Also incidental nightly-fmt cleanups in `sorted_series::append_prefix_col_to_key` and the two-input fixture in `test_all_null_prefix_rg_groups_into_separate_region_sorted_last`. The hardening PR (#6426) will be re-cascaded to drop the now- duplicated F12 hunks (keeping its F8 adapter-rejects-unsorted + F2 verifier-strength changes intact). 485 lib tests pass on this slice; workspace clippy + nightly fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(legacy-adapter): strip stale rg_partition_prefix_len when target=0 Codex P2 on PR #6425: when the legacy adapter is called with `target_prefix_len == 0` it consolidates the input into a single RG, but the previous version preserved the input's footer KVs unchanged. If the input itself already carried a stale nonzero `qh.rg_partition_prefix_len` claim (e.g., a prefix-aware split being re-encoded through the legacy fallback path), the single-RG output would still advertise that claim. Downstream metadata extraction would take the prefix-aware path against an RG carrying multiple first-prefix values — failing the PA-1 min/max alignment check on read despite the caller explicitly asking for the legacy path. Strip `PARQUET_META_RG_PARTITION_PREFIX_LEN` from `original_kv` in the `target_prefix_len == 0` branch. Absence of the KV is the legacy convention for "no alignment claim", matching the existing `test_target_prefix_len_zero_passes_through_as_single_rg` test's `prefix_kv.is_none()` assertion. New regression test `test_target_prefix_len_zero_strips_stale_prefix_kv_from_input`: inputs a 2-RG file with `qh.rg_partition_prefix_len = "1"` AND opens through adapter with `target_prefix_len = 0`; asserts the re-encoded output has no prefix KV. Pre-fix this test caught the leak; post-fix the stale value is dropped. 487 lib tests pass on the slice; clippy + nightly fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming.rs
@@ -2789,7 +2789,10 @@ mod tests {
                 .expect("resolve");
         // Sanity: the second prefix column must be flagged DESC.
         assert!(
-            prefix_cols[1].descending,
+            prefix_cols[1]
+                .as_ref()
+                .expect("env present in this fixture")
+                .descending,
             "env must be parsed as DESC from sort schema",
         );
 
diff --git a/quickwit/quickwit-parquet-engine/src/merge/streaming/region_grouping.rs b/quickwit/quickwit-parquet-engine/src/merge/streaming/region_grouping.rs
@@ -118,14 +118,25 @@ pub(crate) struct PrefixColumn {
 
 /// Resolve the first `prefix_len` sort columns to parquet leaf
 /// indices. Honours the legacy `timestamp` → `timestamp_secs` alias.
-/// Errors if the sort schema has fewer columns than `prefix_len` or
-/// if any column is missing from the parquet schema.
+///
+/// Returns one entry per requested prefix column. `Some(PrefixColumn)`
+/// when the column is present in the parquet schema; `None` when the
+/// column is named in `sort_fields_str` but absent from the parquet
+/// schema. Per SS-3 the missing column is treated as constant null at
+/// every row of the file — [`extract_rg_composite_prefix_key`]
+/// synthesizes a fixed byte sequence in that slot so ordering is
+/// driven entirely by the present columns.
+///
+/// Errors only when the sort schema declares fewer columns than
+/// requested — that means we don't have a *name* for one of the
+/// prefix columns and can't claim alignment on something we can't
+/// identify.
 pub(crate) fn find_prefix_parquet_col_indices(
     metadata: &ParquetMetaData,
     sort_fields_str: &str,
     prefix_len: usize,
-    input_idx: usize,
-) -> Result<Vec<PrefixColumn>> {
+    _input_idx: usize,
+) -> Result<Vec<Option<PrefixColumn>>> {
     let sort_field_schema = parse_sort_fields(sort_fields_str)?;
     if sort_field_schema.column.len() < prefix_len {
         bail!(
@@ -145,34 +156,34 @@ pub(crate) fn find_prefix_parquet_col_indices(
         } else {
             sort_col.name.as_str()
         };
+        let descending = sort_col.sort_direction
+            == quickwit_proto::sortschema::SortColumnDirection::SortDirectionDescending as i32;
         let mut found = None;
         for (col_idx, col) in parquet_schema.columns().iter().enumerate() {
             if col.path().parts()[0] == resolved {
                 found = Some(col_idx);
                 break;
             }
         }
-        let parquet_col_idx = found.ok_or_else(|| {
-            anyhow!(
-                "input {input_idx} parquet schema is missing prefix sort column '{}' (position \
-                 {pos})",
-                sort_col.name,
-            )
-        })?;
-        let descending = sort_col.sort_direction
-            == quickwit_proto::sortschema::SortColumnDirection::SortDirectionDescending as i32;
+        // SS-3: missing column → `None`. The composite-key extractor
+        // skips this slot entirely (no ordinal byte, no value bytes);
+        // the trailing prefix-length sentinel in
+        // `extract_rg_composite_prefix_key` ensures the resulting key
+        // still sorts cleanly relative to RGs with present values
+        // (and matches sorted_series's row-level null-skip).
+        //
         // Ordinal matches the column's position in `qh.sort_fields`.
         // For prefix cols (always the first `prefix_len` entries of
         // the sort schema) the ordinal equals the iteration index
         // `pos`, which is also the ordinal `sorted_series` would
         // assign — so the per-RG prefix key composes as a literal
         // byte prefix of every sorted_series key.
-        prefix_cols.push(PrefixColumn {
+        prefix_cols.push(found.map(|parquet_col_idx| PrefixColumn {
             name: sort_col.name.clone(),
             parquet_col_idx,
             descending,
             ordinal: pos as u8,
-        });
+        }));
     }
     Ok(prefix_cols)
 }
@@ -197,23 +208,34 @@ fn parquet_has_column(
 /// in this RG.
 ///
 /// Null handling:
-/// - **All-null RG on a prefix column**: the column is skipped entirely (the next column's higher
-///   ordinal byte appears in its place), so the RG sorts after any RG carrying a non-null value for
-///   this column. This mirrors the row-level convention in `sorted_series` and gives nulls-last
-///   ordering for free.
+/// - **Column absent from schema (`None` in `prefix_cols`)**: SS-3 case. Every row of the file has
+///   a constant null in this slot, so the contribution to the composite is empty (column skipped).
+///   The trailing prefix-length sentinel keeps the resulting key well-formed.
+/// - **All-null RG on a present prefix column**: column skipped for this RG (the next column's
+///   higher ordinal byte — or the trailing sentinel — appears in its place), so the RG sorts after
+///   any RG carrying a non-null value for this column. Mirrors the row-level convention in
+///   `sorted_series` and gives nulls-last ordering for free.
 /// - **Mixed null + non-null in one RG**: rows in the RG would encode to two distinct prefix keys
 ///   (the non-null value's key and the column-skipped key), breaking the
 ///   at-most-one-prefix-value-per-RG invariant (PA-1). Reject.
 /// - **No nulls**: standard `min == max` check on stats, then encode that single value.
 pub(crate) fn extract_rg_composite_prefix_key(
     metadata: &ParquetMetaData,
     rg_idx: usize,
-    prefix_cols: &[PrefixColumn],
+    prefix_cols: &[Option<PrefixColumn>],
     input_idx: usize,
 ) -> Result<Vec<u8>> {
     let rg_meta = metadata.row_group(rg_idx);
     let mut key = Vec::new();
-    for col in prefix_cols {
+    for col_opt in prefix_cols {
+        let Some(col) = col_opt else {
+            // SS-3 implicit null: column absent from schema, so every
+            // row's value is null. Skip the slot entirely — the
+            // trailing prefix-length sentinel will keep this from
+            // colliding with present-value keys, and sorted_series
+            // applies the same "skip null cols" rule at the row level.
+            continue;
+        };
         let chunk = rg_meta.column(col.parquet_col_idx);
         let stats = chunk.statistics().ok_or_else(|| {
             anyhow!(
@@ -575,10 +597,22 @@ pub(crate) fn extract_regions_from_metadata(
         .collect())
 }
 
-/// Post-write check: verify the parquet file at `metadata` has no two
-/// row groups sharing the same composite prefix key, for the first
-/// `prefix_len` sort columns. Returns `Ok(())` immediately if
-/// `prefix_len == 0` (no alignment claim).
+/// Post-write check: verify every row group in `metadata` satisfies
+/// the prefix-alignment claim declared by `prefix_len`.
+///
+/// Enforces both halves of the prefix-alignment contract in one pass:
+/// - **PA-1 (intra-RG constancy):** within each RG, each of the first `prefix_len` sort columns has
+///   `min == max` (the column is constant across the RG). This is checked transitively by
+///   [`extract_rg_composite_prefix_key`] — it returns an error when any prefix column's chunk stats
+///   show `min != max`.
+/// - **PA-3 (inter-RG uniqueness):** no two RGs share the same composite prefix value. The
+///   streaming engine pairs at most one input RG per region per prefix value, so a duplicate would
+///   silently drop rows or corrupt the body-col / sort-col mapping.
+///
+/// Returns `Ok(())` immediately when `prefix_len == 0` (no claim to
+/// verify) or `num_rgs == 0` (no RGs to check). Single-RG files are
+/// NOT short-circuited — they still go through PA-1 because an
+/// unsorted single-RG file CAN have `min != max` on a prefix column.
 ///
 /// This is the writer-side mirror of the read-side check in
 /// `extract_regions_from_metadata` — both indexing and the compaction
@@ -600,8 +634,8 @@ pub(crate) fn assert_unique_rg_prefix_keys(
         return Ok(());
     }
     let num_rgs = metadata.num_row_groups();
-    if num_rgs <= 1 {
-        // Single-RG (or zero-RG) files vacuously satisfy the invariant.
+    if num_rgs == 0 {
+        // Zero-RG files vacuously satisfy both halves of the claim.
         return Ok(());
     }
     let prefix_cols =
diff --git a/quickwit/quickwit-parquet-engine/src/storage/legacy_adapter.rs b/quickwit/quickwit-parquet-engine/src/storage/legacy_adapter.rs