quickwit-oss
diff --git a/‎docs/internals/adr/gaps/011-no-legacy-promotion-planner.md‎
Lines changed: 136 additions & 0 deletions b/‎docs/internals/adr/gaps/011-no-legacy-promotion-planner.md‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎docs/internals/adr/gaps/012-merge-downloads-instead-of-streaming.md‎
Lines changed: 164 additions & 0 deletions b/‎docs/internals/adr/gaps/012-merge-downloads-instead-of-streaming.md‎
Lines changed: 164 additions & 0 deletions
diff --git a/‎docs/internals/adr/gaps/README.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/internals/adr/gaps/README.md‎
Lines changed: 2 additions & 0 deletions
@@ -0,0 +1,136 @@
+# GAP-011: No Planner-Level Legacy Promotion
+
+**Status**: Open
+**Discovered**: 2026-05-18
+**Context**: Codex review on PR #6423 (`feat(merge): legacy promotion path + body-col schema evolution`) flagged that the promotion path is wired end-to-end at the library + executor layer but has no production trigger at the planner / policy level.
+
+## Problem
+
+The streaming Parquet merge stack now contains a complete *legacy promotion* pipeline:
+
+- `ParquetMergeOperation::promote_legacy(splits, target_prefix_len)` constructs an operation with
+  `target_prefix_len_override = Some(target)`.
+- `merge::execute_merge_operation` routes each input through `LegacyInputAdapter` when its
+  declared `rg_partition_prefix_len < target` and through `StreamingParquetReader` otherwise. The
+  streaming engine then sees a homogeneous stream advertising `prefix_len = target` on every
+  input.
+- `ParquetMergeExecutor` (in `quickwit-indexing`) detects `target_prefix_len_override.is_some()`
+  and routes those merges through `execute_merge_operation` (with `LocalFileByteSource`) instead
+  of the in-memory `merge_sorted_parquet_files` path.
+- `merge_parquet_split_metadata` accepts a `mixed_prefix_ok: bool` flag so the post-merge
+  aggregator skips the input-side equality check.
+
+What's missing: **nothing in the planner ever creates a `promote_legacy` operation in
+production**. `MergePolicyState::record_split` buckets each split by
+`CompactionScope::from_split`, and that scope key includes `rg_partition_prefix_len`. Legacy
+splits (`prefix_len = 0`) and aligned splits (`prefix_len > 0`) therefore land in *different*
+buckets before `ParquetMergePolicy::operations` ever runs. The production policy then iterates
+each bucket independently and emits `ParquetMergeOperation::new` (regular merge). A repo-wide
+search finds `promote_legacy` only in tests.
+
+In a mixed deployment (legacy + aligned splits coexisting), legacy splits therefore stay in
+their `prefix_len = 0` bucket forever — never gaining the prefix alignment that downstream
+locality compaction depends on. The promotion plumbing is reachable only from tests.
+
+## Evidence
+
+- `quickwit-parquet-engine/src/merge/policy/mod.rs`: `ParquetMergePolicy::operations` calls
+  `ParquetMergeOperation::new(...)` only. `promote_legacy` is constructed only by tests in the
+  same file.
+- `MergePolicyState::record_split` keys its `BTreeMap` by `CompactionScope::from_split`. The
+  scope derivation includes `rg_partition_prefix_len`, so a legacy split and a prefix-aligned
+  split with otherwise identical sort fields / window / merge level are never compared by the
+  policy.
+- The executor branch added in PR #6423 (`scratch.merge_operation.target_prefix_len_override
+  .is_some()`) routes promotion through `execute_merge_operation`. Library coverage at
+  `test_promote_legacy_executor_end_to_end` exercises a `prefix_len = 0` + `prefix_len = 1` pair
+  successfully. But that operation is only ever constructed inside the test.
+
+## State of the Art
+
+- **Iceberg**: Compaction policies inspect file-level metadata (partitioning, sort order) and
+  can rewrite files to align with the latest table partitioning even when individual files
+  pre-date the change. The compaction service treats schema-evolution-style rewrites as
+  first-class operations.
+- **Husky**: Background re-organization passes that promote files into newer storage layouts.
+  Tracked separately from the size-tiered compaction policy so cost trade-offs can be tuned.
+
+In both cases, the design separates the *trigger* (decision to promote) from the *mechanism*
+(how the promotion is performed). Quickwit currently has the mechanism but not the trigger.
+
+## Potential Solutions
+
+### Option A: Merge legacy + aligned buckets in `CompactionScope::from_split`
+
+Drop `rg_partition_prefix_len` from the scope key (or normalize it to a target value before
+bucketing). The policy then sees legacy and aligned splits as candidates for the same
+compaction operation and `ParquetMergePolicy::operations` decides whether to emit a regular
+merge or a `promote_legacy` operation based on whether the bucket contains mixed prefix
+lengths.
+
+Simplest change, but requires the policy to detect mixed-prefix buckets and choose between
+`new` and `promote_legacy` per operation.
+
+### Option B: Dedicated promotion pass
+
+Run a separate pass before the regular compaction policy that scans for legacy splits and emits
+`promote_legacy` operations for them. The regular policy then sees only aligned splits.
+
+Cleaner separation of concerns, but means legacy splits are migrated *before* any opportunity
+to coalesce them with aligned neighbors in a single multi-input merge — possibly more work
+overall.
+
+### Option C: Hybrid — bucket together, prefer single-pass promotion
+
+Keep scope bucketing as in option A. Inside the policy, when a bucket contains mixed prefix
+lengths AND has enough splits to merit a multi-input merge, emit `promote_legacy`. When only
+legacy splits exist (no aligned neighbor), emit `promote_legacy` with the same target — single-
+input promotion is still valuable because it converts the file to the new format for future
+locality compaction.
+
+Most flexible; gives the policy the freedom to amortize promotion cost when there are aligned
+neighbors AND to still promote isolated legacy splits in the background.
+
+## Signal Impact
+
+Primarily affects **metrics** in the near term: the legacy split format pre-dates the
+prefix-aligned RG layout, and only metrics has both formats in flight today. Traces and logs
+on the Parquet path will eventually reach the same state if a layout change ever happens; the
+same planner machinery would cover them.
+
+## Cost Considerations
+
+Promotion is strictly more expensive than a regular merge: the legacy adapter buffers the full
+input file in memory and re-encodes it as a single-RG stream before the merge engine sees it.
+For 50 MB metrics splits this is acceptable; for larger inputs the in-memory buffer is the
+gating cost.
+
+The planner should account for this when scheduling — promotion is best amortized into a
+multi-input merge rather than performed as a standalone file rewrite. Option C's "prefer
+multi-input promotion, fall back to single-input" structure captures this.
+
+## Impact
+
+- **Severity**: Medium. Legacy splits accumulate cost (every query against them pays the
+  prefix-less scan cost) but correctness is preserved — the locality compaction stack still
+  works on aligned splits.
+- **Frequency**: Persistent. Legacy splits never migrate without an explicit trigger.
+- **Affected Areas**: `quickwit-parquet-engine/src/merge/policy/`, `quickwit-parquet-engine/src/merge/mod.rs` (`MergePolicyState::record_split` + `CompactionScope`).
+
+## Next Steps
+
+- [ ] Decide between options A / B / C based on operational priorities and benchmark data.
+- [ ] Design the policy-level "should promote?" heuristic: how many legacy splits before
+      triggering, whether to wait for aligned neighbors, how to deprioritize promotion vs
+      regular compaction.
+- [ ] Add metrics for `legacy_splits_pending_promotion` and `promotion_operations_emitted` so we
+      can observe the policy in production.
+- [ ] Wire whichever option is chosen, with an integration test that exercises the full path
+      (legacy split → planner → executor → published prefix-aligned split).
+
+## References
+
+- PR #6423 (legacy promotion path + body-col schema evolution).
+- Codex review comment id `4311184497` (raised the gap).
+- `test_promote_legacy_executor_end_to_end` in `quickwit-parquet-engine::merge::streaming` —
+  library-level coverage of the mechanism.
@@ -0,0 +1,164 @@
+# GAP-012: Parquet Merge Executor Downloads Inputs Instead of Streaming Them
+
+**Status**: Open
+**Discovered**: 2026-05-18
+**Context**: Code review of the Parquet streaming merge stack (PRs #6407–#6428) — specifically the executor wiring on #6423 — surfaced the question of why the merge actor downloads every input to local disk before merging when the streaming engine is designed around `RemoteByteSource`.
+
+## Problem
+
+The Parquet streaming merge engine in `quickwit-parquet-engine` consumes inputs through a
+minimal `RemoteByteSource` trait (`file_size`, `get_slice`, `get_slice_stream`). The trait was
+deliberately defined so the engine can pull pages column-major directly from object storage —
+two GETs per input (footer + body stream) and the merge progresses as bytes arrive, holding
+only the page-bounded engine state in memory.
+
+The actor pipeline in `quickwit-indexing` doesn't use that design. The
+`ParquetMergeSplitDownloader` actor pulls each input via `storage.copy_to_file(remote_path,
+local_path)` into a scratch directory, then hands `Vec<PathBuf>` to the
+`ParquetMergeExecutor`. The executor then either:
+
+- Calls the in-memory `merge_sorted_parquet_files(input_paths, ...)` (regular merges), which
+  reads each file fully into Arrow RecordBatches before merging, OR
+- Wraps each local path in a `LocalFileByteSource` and calls `execute_merge_operation` (added in
+  PR #6423 for promotion merges only).
+
+Either way, the streaming engine's central design benefit — overlapping the fetch with the
+merge and skipping the scratch disk entirely — is unused in production. Every merge reads each
+input twice: once over the network into scratch, once off scratch through the merger.
+
+## Evidence
+
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_split_downloader.rs`: per-split
+  loop calling `self.storage.copy_to_file(...)` to materialize every input on local disk before
+  forwarding `ParquetMergeScratch` to the executor.
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs`: receives
+  `downloaded_parquet_files: Vec<PathBuf>` and chooses between the in-memory path or
+  `execute_merge_operation` with `LocalFileByteSource` wrappers — never a `RemoteByteSource`
+  that actually streams from object storage.
+- `quickwit-parquet-engine/src/storage/streaming_reader.rs:62-67`: the `RemoteByteSource` trait
+  doc explicitly notes that callers in `quickwit-indexing` "provide a thin adapter that
+  delegates to `quickwit_storage::Storage`." The adapter exists in principle but isn't wired up
+  for the merge executor.
+
+## State of the Art
+
+- **ClickHouse `MergeTree`**: parts are accessed via the same storage abstraction whether the
+  merge runs locally or against tiered/object storage. There's no separate "download then
+  merge" actor pair — the merger reads parts where they live.
+- **Iceberg compaction**: data files are read directly from object storage by the compaction
+  job. Local scratch is used only for the output file before commit.
+- **Husky**: column-major streaming merge reads directly from blob storage. Designed around the
+  "two GETs per input" model the Quickwit streaming engine inherits.
+
+Across these systems, downloading inputs before merging is treated as a fallback for
+operational reasons (unreliable network, kernel page-cache effects), not the default.
+
+## Trade-offs
+
+### Why download-first is the current default
+- **Retry locality**: the downloader actor centralizes retry/backoff/timeout for one file at a
+  time. A mid-fetch S3 hiccup retries the download alone; the merger sees only successful
+  downloads.
+- **Pure-compute executor**: once files are on disk the executor has no network dependency.
+  Mid-merge failures are restricted to disk I/O and compute errors.
+- **Predictable disk budget**: scratch usage is bounded by `Σ input_sizes` per concurrent
+  merge. Easy to reason about; easy to cap.
+- **Legacy in-memory path**: `merge_sorted_parquet_files` predates the streaming engine and
+  requires local file paths. The download-first pattern existed before there was a streaming
+  alternative.
+
+### What download-first costs
+- **2× I/O per merge**: each input is transferred over the network into scratch AND read off
+  scratch into the merger. The kernel page cache mitigates the disk-read pass to some extent but
+  doesn't fully erase it.
+- **Serialized phases**: the merge can't start until *all* inputs are downloaded. First-byte
+  latency on the merger is `max(input download time)` instead of `min(input first-byte time)`.
+- **Scratch disk usage**: a typical compaction merging 8× 50 MB splits holds 400 MB of scratch
+  per merge, multiplied by the concurrent merge count. On lightweight indexer pods this caps
+  parallelism.
+- **Underused design**: the streaming engine's single-body-GET model + page-bounded memory was
+  built specifically for the no-scratch-disk case. Wiring through `LocalFileByteSource` works
+  but bypasses the property the design was built around.
+
+### What streaming-directly would cost
+- **Mid-merge retry surface**: a connection failure mid-body-GET kills the merge attempt
+  entirely. Single-body-GET is forward-only — no partial recovery. The retry surface becomes
+  "the merge failed after 30 % of work," not "the download failed, retry the file."
+- **Per-merge S3 connection count**: an N-way merge holds N concurrent body streams plus N
+  footer connections. On dense merger nodes this multiplies.
+- **Tail latency**: the merge progresses at the speed of the slowest input. With downloads,
+  parallel fetches average out; with streaming a slow input throttles the whole merge.
+
+## Potential Solutions
+
+### Option A: Streaming-directly when the input is reachable, download as fallback
+
+The executor receives a hint from the storage layer (or detects mid-merge failure rates) and
+chooses per merge. Splits stored on reliable, low-latency backends go through `RemoteByteSource`
+adapters that talk directly to `quickwit_storage::Storage`; on flaky or high-latency backends
+the downloader actor still materializes files first.
+
+Largest design lift but matches what mature compaction systems do.
+
+### Option B: Stream-directly by default, fall back to download on persistent failures
+
+Default to streaming; a circuit-breaker on per-merge failure rate routes the next attempt
+through download-first. Operationally simpler than Option A; tail latency is bounded by the
+circuit's reaction time.
+
+### Option C: Keep download-first but eliminate the in-memory merge path
+
+Make every merge go through `execute_merge_operation` with `LocalFileByteSource`. This doesn't
+recover the streaming engine's "no scratch disk" benefit but does remove the legacy in-memory
+codepath, simplifying the executor to a single path.
+
+Smallest change, smallest gain. Worth doing regardless of A/B as a stepping stone.
+
+### Option D: Streaming-directly only for promotion merges
+
+Promotion already routes through `execute_merge_operation`; extend it to skip the download
+phase entirely for those operations and let the regular path stay as-is. Gains: legacy-adapter-
+backed promotion merges (the in-memory-buffering-heaviest case in the pipeline) avoid double
+I/O. Costs: split executor logic into "promotion = stream" vs "regular = download."
+
+## Signal Impact
+
+All Parquet-backed signals. Metrics is the first product to ship, so the impact lands on
+metrics first; traces and logs (when they migrate to Parquet storage) will pay the same cost
+unless this is addressed by then.
+
+## Cost Considerations
+
+The streaming engine's body-col page cache is already designed for backpressure: pages stream
+in column-major order as bytes arrive, and the engine processes them as quickly as it can. The
+bottleneck for streaming-directly becomes the slowest input's transfer rate rather than the
+total input size — usually a smaller number, but a longer tail.
+
+## Impact
+
+- **Severity**: Medium. Correctness is unaffected; the streaming engine works equivalently
+  whether the source is local or remote. The cost is bandwidth, disk, and wall-clock latency.
+- **Frequency**: Every merge in production today pays the download cost.
+- **Affected Areas**: `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_split_downloader.rs`,
+  `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs`,
+  `quickwit-parquet-engine::merge::execute_merge_operation` callers.
+
+## Next Steps
+
+- [ ] Measure the current download-vs-merge phase breakdown on a representative production
+      merge load (wall-clock + bytes-read + disk-write).
+- [ ] Build a `RemoteByteSource` adapter over `quickwit_storage::Storage` and prototype
+      streaming-directly for promotion merges (Option D) to validate the engine's behavior
+      against the existing storage backends.
+- [ ] Decide between options A / B based on observed mid-merge failure rates under real S3
+      conditions.
+- [ ] Even if the default stays download-first, consider Option C as a simplification — the
+      in-memory merge path is dead weight once the streaming engine handles every case.
+
+## References
+
+- PR #6407–#6428 (Parquet streaming merge stack).
+- [PR #6423 discussion](https://github.com/quickwit-oss/quickwit/pull/6423) — surfaced the
+  question while wiring promotion through `execute_merge_operation`.
+- `quickwit-parquet-engine/src/storage/streaming_reader.rs` (`RemoteByteSource` trait).
+- `quickwit-indexing/src/actors/parquet_pipeline/parquet_merge_executor.rs::LocalFileByteSource`.
@@ -115,3 +115,5 @@ Gap files use sequential numbering: `001-short-description.md`
 | [008](./008-no-high-query-rate-optimization.md) | No High Query Rate Optimization | Open | High |
 | [009](./009-no-leading-edge-prioritization.md) | No Leading Edge Prioritization | Open | High |
 | [010](./010-no-data-caching-or-query-affinity.md) | No Multi-Level Data Caching or Query Affinity Optimization | Open | High |
+| [011](./011-no-legacy-promotion-planner.md) | No Planner-Level Legacy Promotion | Open | Medium |
+| [012](./012-merge-downloads-instead-of-streaming.md) | Parquet Merge Executor Downloads Inputs Instead of Streaming Them | Open | Medium |