Commit 24ec1af

and

committed

feat(merge): adversarial-review test coverage (F4/F5/F7) + F14 sub-region engine fix (#6428)

* feat(merge): close F4/F5/F7/F14 from the adversarial review Three test additions plus one engine fix surfaced by the F4 tests. The existing MS-7 test proved the per-input page-cache bound for one input and one region. F4 extends the coverage: - `test_ms7_per_input_bound_across_num_inputs` sweeps `num_inputs ∈ {1, 3, 8}` × `rows_per_input ∈ {3 000, 30 000}` and asserts the per-input peak stays bounded. Cross-axis growth check: going from 1 input to 8 must not push the peak up. - `test_ms7_per_input_bound_across_sub_regions_does_not_scale_with_rows` runs the prefix_len=0 multi-output sub-region path at 3 000 vs 30 000 rows and asserts peak doesn't scale with input row count. **This test surfaced F14 (below) — without the engine fix, the sub-region path's peak grew ~9× when rows grew 10×.** Tests serialize via `ms7_serial_lock` because `PEAK_BODY_COL_PAGE_CACHE_LEN` is process-global; concurrent tests would pollute each other's readings. Parquet streams emit pages in column-major order (all of col 0, then all of col 1, ...). The old sub-region-outer / col-inner ordering meant that while processing sub-region 0's col K, the stream emitted cols 0..K-1's remaining pages first to reach col K — those skipped pages got cached under their own col_idx for later sub-regions to consume, and the cache scaled with input row count. Fix: new `process_split_region_col_outer` function for the `needs_split` path. Cols iterate in the outer loop, sub-regions in the inner. Each parquet col chunk is fully consumed from the stream across all sub-regions before col K+1 starts. Cache for col K is empty before col K+1's pages arrive. Mechanics: pre-determine writer assignments for the region's sub-regions (a top-level region's sub-regions may span multiple output writers; consecutive sub-regions on the same writer get coalesced into one combined Region so each writer holds one RG concurrently — RGs on the same writer are sequential, so coalescing keeps the parquet writer's single-active-RG constraint intact). Single-region path stays on the existing `process_region`. `prop_merge_prefix_aligned_streaming` sweeps `(num_inputs ∈ 1..=3, per-input RG specs, num_outputs ∈ 1..=3)` with prefix_len=1 and asserts MC-1 (rows preserved), MC-3 (sorted_series monotone within each output), MS-3 (num_row_groups matches footer), PA-1+PA-3 (`assert_unique_rg_prefix_keys`), and CS-1 (metastore prefix_len == KV) on every generated case. 32 cases capped to keep runtime under a second. Fixture: `make_prefix_len_one_input` writes one RG per `(metric_name, rows)` entry by calling `writer.flush()` between batches. `sorted_series` encodes `metric_base + row_offset_within_metric`, mirroring production's storekey property that different metric_names produce non-overlapping `sorted_series` byte ranges. Plus a focused unit test `test_f5_single_input_two_metrics_minimal` that pins one specific case for fast iteration. `test_f7_production_shape_multi_input_multi_rg_multi_output`: 5 inputs × ~4 prefix-aligned RGs each × 4 outputs × prefix_len=1. Asserts the full invariant bundle (MC-1, MS-3, PA-1+PA-3, MS-5 cross-output sorted_series monotonicity, CS-1) — the corner the adversarial review flagged as "untested production case". MS-5 is "across adjacent outputs, sorted_series is monotone non-decreasing." A single metric CAN span outputs (the engine splits at sorted_series transitions inside an overflowing region), so the cross-output invariant is sorted_series monotonicity, not "each metric in one output." - `cargo test -p quickwit-parquet-engine --lib` — 498 unit tests pass. - `cargo clippy -p quickwit-parquet-engine --tests --all-features` with `-Dwarnings`. - `cargo doc --no-deps -p quickwit-parquet-engine` warning-free. - `cargo fmt --all -- --check` (nightly via PATH override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style(streaming): drop stray blank line before tests section header Newer nightly rustfmt (2026-05-17) flags the extra blank line that crept into the test module between the F4 fixture helper and the "Heterogeneous-output regressions" section header. Single-line gap is what nightly fmt wants. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(streaming): roll over chunk-assignment before first chunk after split Codex P1 on PR #6428: the previous "Recompute split budget after rolling over" fix (commit 56e773f, #6424) handled the split *decision* but not the split *assignment*. When the previous region fills the current output exactly and the next region enters the `needs_split` path, the chunk-assignment loop in `process_split_region_col_outer`'s setup initializes from the stale `current_output_idx` / `current_output_rows`. Its inner `needs_new_writer` check guards on `!chunk_assignments.is_empty()`, so the first iteration cannot roll over: the first sub-region is appended to the already-full output and only the second one advances. Output K ends up at 2× target while subsequent outputs are short or empty. Fix: initialize `active_output_idx` / `active_rows` from the `will_roll_over` case before the loop. The inner `needs_new_writer` check then works for both the first and subsequent iterations (on the first iteration `active_rows = 0 < target` so it correctly doesn't re-roll). The `can_reuse_current` check in the writer- materialization loop already handles "first chunk's output_idx doesn't match current_writer" by finalizing the current output (which is correct: it's full, close it) and opening a fresh writer at the next index. Regression test `test_split_chunk_assignment_rolls_over_before_first_chunk`: prefix_len=1, two metrics of 200 + 400 rows = 600 total, `num_outputs = 3` → `target_per_output = 200`. Region A fills output 0 exactly; region B needs splitting. Pre-fix the merge produced 2 outputs of 400 + 200 (output 0 overfilled, output 2 empty); post-fix it produces 3 outputs of ~200 rows each. 502 lib tests pass (+1); workspace clippy + nightly fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1 parent ead4848 commit 24ec1afCopy full SHA for 24ec1af

1 file changed

quickwit/quickwit-parquet-engine/src/merge
- streaming.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 24ec1af

File tree

0 commit comments