Skip to content

feat: compute per-output split metadata in merge engine#6359

Open
g-talbot wants to merge 2 commits intogtt/parquet-merge-policy-configfrom
gtt/merge-output-split-metadata
Open

feat: compute per-output split metadata in merge engine#6359
g-talbot wants to merge 2 commits intogtt/parquet-merge-policy-configfrom
gtt/merge-output-split-metadata

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Stacked on #6351 (Phase 2). Should be merged before #6352 (Phase 3a).

The merge engine now extracts metric_names, time_range, and low_cardinality_tags (service) from each output file's actual rows during the merge write pass.

Problem: MergeOutputFile previously only had physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits. This is incorrect when num_outputs > 1 — each output contains a different subset of the globally sorted rows and should have metadata reflecting only its own data.

Fix: Each MergeOutputFile now carries per-output logical metadata extracted from the sorted_batch before writing. Reuses extract_metric_names, extract_service_names, extract_time_range from split_writer (made pub(crate)).

Test plan

  • New test test_merge_per_output_metadata_from_actual_rows — verifies 2-output merge has correct per-output metric_names and time_range
  • Updated test_merge_multiple_outputs with per-output metadata assertions
  • All 66 merge tests pass (including proptests)
  • cargo clippy clean

🤖 Generated with Claude Code

@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from fc6f90a to 720560d Compare April 29, 2026 20:53
@g-talbot g-talbot changed the base branch from gtt/parquet-merge-policy to gtt/parquet-merge-policy-config April 29, 2026 20:53
g-talbot and others added 2 commits April 29, 2026 22:29
The merge engine now extracts metric_names, time_range, and
low_cardinality_tags from each output file's actual rows during the
merge write pass.

Previously, MergeOutputFile only contained physical metadata (num_rows,
size_bytes, row_keys, zonemaps). The downstream metadata_aggregation
function inferred logical metadata by unioning all input splits — which
is incorrect when num_outputs > 1, since each output contains only a
subset of the globally sorted rows.

Now each MergeOutputFile carries:
- metric_names: distinct metrics in this output's rows
- time_range: min/max timestamp_secs from this output's rows
- low_cardinality_tags: service names from this output's rows

Reuses existing extract_metric_names, extract_service_names, and
extract_time_range from split_writer (made pub(crate)).

Includes test that verifies per-output metadata is computed from actual
rows when merging 2 inputs into 2 outputs with different metric names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot force-pushed the gtt/merge-output-split-metadata branch from 720560d to 3e61af4 Compare April 30, 2026 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant