You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: compute per-output split metadata in merge engine (#6359)
* feat: add configurable ParquetMergePolicyConfig to index settings
Adds `parquet_merge_policy` section to `IndexingSettings`, making the
Parquet merge policy configurable per-index via YAML. Parameters:
- merge_factor (default 10): min splits to trigger a merge
- max_merge_factor (default 12): max splits per merge
- max_merge_ops (default 4): bounds write amplification
- target_split_size_bytes (default 256 MiB): target output size
- maturation_period (default 48h): split maturity timeout
- max_finalize_merge_operations (default 3): cold-window shutdown limit
Mirrors the existing merge_policy config pattern for logs/traces.
Updates index-config.md documentation with the new section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add ParquetIndexingConfig with sort_fields and window_duration_secs
Adds `parquet_indexing` section to `IndexingSettings` for per-index
Parquet pipeline configuration:
- `sort_fields`: sort schema override (Husky-style pipe-delimited
syntax with /V2 suffix). Controls row ordering, query pruning,
compression locality, and compaction scope. When omitted, uses
the product-type default.
- `window_duration_secs`: time window for split partitioning
(default 900s / 15 min). Must divide 3600.
Updates docs/configuration/index-config.md with:
- "Parquet indexing settings" section explaining both parameters
- Full sort schema syntax reference (column types, direction
overrides, & LSM cutoff marker)
- Examples showing minimal, custom, and advanced configurations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: update indexing service fingerprint constants and nightly fmt
Adding ParquetMergePolicyConfig and ParquetIndexingConfig to
IndexingSettings changes the Hash output, which changes the pipeline
params fingerprints. Updated the hardcoded test constants.
Added a comment explaining how to recompute them when IndexingSettings
fields change.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: compute per-output split metadata in merge engine
The merge engine now extracts metric_names, time_range, and
low_cardinality_tags from each output file's actual rows during the
merge write pass.
Previously, MergeOutputFile only contained physical metadata (num_rows,
size_bytes, row_keys, zonemaps). The downstream metadata_aggregation
function inferred logical metadata by unioning all input splits — which
is incorrect when num_outputs > 1, since each output contains only a
subset of the globally sorted rows.
Now each MergeOutputFile carries:
- metric_names: distinct metrics in this output's rows
- time_range: min/max timestamp_secs from this output's rows
- low_cardinality_tags: service names from this output's rows
Reuses existing extract_metric_names, extract_service_names, and
extract_time_range from split_writer (made pub(crate)).
Includes test that verifies per-output metadata is computed from actual
rows when merging 2 inputs into 2 outputs with different metric names.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: nightly rustfmt import ordering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments