Commit 347b7bd
feat(legacy-adapter): synthesize prefix-aligned row groups
The legacy adapter previously consolidated multi-RG legacy inputs
into a single oversized row group and left `rg_partition_prefix_len`
at the original's (typically `0`). The streaming merge engine then
sent these single-RG/prefix=0 inputs through the new sub-region
splitting path — correct, but it forfeits the prefix-aware fast path
for outputs derived from legacy inputs and gives up the row-group
pruning that prefix alignment enables.
After consolidating, the adapter now slices the resulting record
batch at first-sort-col transitions (typically `metric_name`) and
emits one parquet row group per slice, stamping the re-encoded file
with `qh.rg_partition_prefix_len = 1`. The merge engine then reads
it through the prefix-aware fast path: one region per metric_name,
the existing duplicate-prefix invariant on read validates uniqueness.
Fallback: if the original file has no `qh.sort_fields` KV, the
sort-fields string fails to parse, the first column can't be
resolved in the arrow schema, or the consolidated batch is empty,
the adapter reverts to a single-RG re-encode without claiming any
prefix alignment. That input still works — the engine's
prefix_len=0 sub-region splitting path picks it up. This keeps the
adapter robust for files written by very early versions of the
indexer that may pre-date the standard KV layout.
Implementation: `reencode_prefix_aligned` replaces
`reencode_as_single_row_group` and either dispatches to the new
multi-RG writer or to the legacy single-RG writer based on whether
the first sort col is resolvable. `RowConverter` handles the
prefix-value equality check uniformly across dictionary, utf8, and
primitive types. The KV injection helper replaces (rather than
appends) any existing `qh.rg_partition_prefix_len` so re-runs and
files mistakenly carrying a stale value still land at the freshly
synthesized prefix.
Tests:
- `test_legacy_input_with_sort_fields_produces_prefix_aligned_multi_rg`
— 3 metrics × 40 rows, multi-RG input → 3 prefix-aligned output
RGs and `qh.rg_partition_prefix_len = 1` KV.
- `test_legacy_input_single_metric_yields_one_rg_with_prefix_kv` —
one metric → one RG, prefix KV still stamped (vacuously aligned).
- `test_legacy_input_without_sort_fields_falls_back_to_single_rg` —
fallback path preserved when sort-fields KV is missing.
- All existing tests pass unchanged (they use empty KVs or
unparseable sort-fields strings, both of which exercise the
fallback path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent fb85855 commit 347b7bd
1 file changed
Lines changed: 377 additions & 19 deletions
0 commit comments