Commit 24ec1af
feat(merge): adversarial-review test coverage (F4/F5/F7) + F14 sub-region engine fix (#6428)
* feat(merge): close F4/F5/F7/F14 from the adversarial review
Three test additions plus one engine fix surfaced by the F4 tests.
The existing MS-7 test proved the per-input page-cache bound for one
input and one region. F4 extends the coverage:
- `test_ms7_per_input_bound_across_num_inputs` sweeps `num_inputs ∈ {1, 3, 8}` × `rows_per_input ∈
{3 000, 30 000}` and asserts the per-input peak stays bounded. Cross-axis growth check: going
from 1 input to 8 must not push the peak up.
- `test_ms7_per_input_bound_across_sub_regions_does_not_scale_with_rows` runs the prefix_len=0
multi-output sub-region path at 3 000 vs 30 000 rows and asserts peak doesn't scale with input
row count. **This test surfaced F14 (below) — without the engine fix, the sub-region path's peak
grew ~9× when rows grew 10×.**
Tests serialize via `ms7_serial_lock` because
`PEAK_BODY_COL_PAGE_CACHE_LEN` is process-global; concurrent tests
would pollute each other's readings.
Parquet streams emit pages in column-major order (all of col 0,
then all of col 1, ...). The old sub-region-outer / col-inner
ordering meant that while processing sub-region 0's col K, the
stream emitted cols 0..K-1's remaining pages first to reach col K —
those skipped pages got cached under their own col_idx for later
sub-regions to consume, and the cache scaled with input row count.
Fix: new `process_split_region_col_outer` function for the
`needs_split` path. Cols iterate in the outer loop, sub-regions in
the inner. Each parquet col chunk is fully consumed from the stream
across all sub-regions before col K+1 starts. Cache for col K is
empty before col K+1's pages arrive.
Mechanics: pre-determine writer assignments for the region's
sub-regions (a top-level region's sub-regions may span multiple
output writers; consecutive sub-regions on the same writer get
coalesced into one combined Region so each writer holds one RG
concurrently — RGs on the same writer are sequential, so coalescing
keeps the parquet writer's single-active-RG constraint intact).
Single-region path stays on the existing `process_region`.
`prop_merge_prefix_aligned_streaming` sweeps `(num_inputs ∈ 1..=3,
per-input RG specs, num_outputs ∈ 1..=3)` with prefix_len=1 and
asserts MC-1 (rows preserved), MC-3 (sorted_series monotone within
each output), MS-3 (num_row_groups matches footer), PA-1+PA-3
(`assert_unique_rg_prefix_keys`), and CS-1 (metastore prefix_len ==
KV) on every generated case. 32 cases capped to keep runtime under
a second.
Fixture: `make_prefix_len_one_input` writes one RG per
`(metric_name, rows)` entry by calling `writer.flush()` between
batches. `sorted_series` encodes
`metric_base + row_offset_within_metric`, mirroring production's
storekey property that different metric_names produce
non-overlapping `sorted_series` byte ranges.
Plus a focused unit test `test_f5_single_input_two_metrics_minimal`
that pins one specific case for fast iteration.
`test_f7_production_shape_multi_input_multi_rg_multi_output`: 5
inputs × ~4 prefix-aligned RGs each × 4 outputs × prefix_len=1.
Asserts the full invariant bundle (MC-1, MS-3, PA-1+PA-3, MS-5
cross-output sorted_series monotonicity, CS-1) — the corner the
adversarial review flagged as "untested production case".
MS-5 is "across adjacent outputs, sorted_series is monotone
non-decreasing." A single metric CAN span outputs (the engine
splits at sorted_series transitions inside an overflowing region),
so the cross-output invariant is sorted_series monotonicity, not
"each metric in one output."
- `cargo test -p quickwit-parquet-engine --lib` — 498 unit tests pass.
- `cargo clippy -p quickwit-parquet-engine --tests --all-features` with `-Dwarnings`.
- `cargo doc --no-deps -p quickwit-parquet-engine` warning-free.
- `cargo fmt --all -- --check` (nightly via PATH override).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* style(streaming): drop stray blank line before tests section header
Newer nightly rustfmt (2026-05-17) flags the extra blank line that
crept into the test module between the F4 fixture helper and the
"Heterogeneous-output regressions" section header. Single-line
gap is what nightly fmt wants.
No behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(streaming): roll over chunk-assignment before first chunk after split
Codex P1 on PR #6428: the previous "Recompute split budget after
rolling over" fix (commit 56e773f, #6424) handled the split
*decision* but not the split *assignment*. When the previous region
fills the current output exactly and the next region enters the
`needs_split` path, the chunk-assignment loop in
`process_split_region_col_outer`'s setup initializes from the stale
`current_output_idx` / `current_output_rows`. Its inner
`needs_new_writer` check guards on `!chunk_assignments.is_empty()`,
so the first iteration cannot roll over: the first sub-region is
appended to the already-full output and only the second one
advances. Output K ends up at 2× target while subsequent outputs
are short or empty.
Fix: initialize `active_output_idx` / `active_rows` from the
`will_roll_over` case before the loop. The inner `needs_new_writer`
check then works for both the first and subsequent iterations (on
the first iteration `active_rows = 0 < target` so it correctly
doesn't re-roll). The `can_reuse_current` check in the writer-
materialization loop already handles "first chunk's output_idx
doesn't match current_writer" by finalizing the current output
(which is correct: it's full, close it) and opening a fresh writer
at the next index.
Regression test
`test_split_chunk_assignment_rolls_over_before_first_chunk`:
prefix_len=1, two metrics of 200 + 400 rows = 600 total,
`num_outputs = 3` → `target_per_output = 200`. Region A fills
output 0 exactly; region B needs splitting. Pre-fix the merge
produced 2 outputs of 400 + 200 (output 0 overfilled, output 2
empty); post-fix it produces 3 outputs of ~200 rows each.
502 lib tests pass (+1); workspace clippy + nightly fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent ead4848 commit 24ec1af
1 file changed
Lines changed: 1432 additions & 16 deletions
0 commit comments