Skip to content

feat: add configurable ParquetMergePolicyConfig to index settings#6362

Open
g-talbot wants to merge 10 commits intomatthew.kim/metrics-partitioningfrom
gtt/parquet-merge-policy-config
Open

feat: add configurable ParquetMergePolicyConfig to index settings#6362
g-talbot wants to merge 10 commits intomatthew.kim/metrics-partitioningfrom
gtt/parquet-merge-policy-config

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

Summary

Stacked on #6351 (Phase 2 merge policy). Addresses review feedback to make the Parquet merge policy configurable per-index.

Adds parquet_merge_policy section to IndexingSettings, exposing all merge policy parameters via YAML:

indexing_settings:
  parquet_merge_policy:
    merge_factor: 10
    max_merge_factor: 12
    max_merge_ops: 4
    target_split_size_bytes: 268435456  # 256 MiB
    maturation_period: 48h
    max_finalize_merge_operations: 3

Mirrors the existing merge_policy config pattern for logs/traces (StableLog/ConstWriteAmplification).

Changes

  • quickwit-config: New ParquetMergePolicyConfig struct with serde + human-readable durations
  • IndexingSettings: New parquet_merge_policy field (defaults match current hardcoded values)
  • docs/configuration/index-config.md: New "Parquet merge policy" documentation section

Follow-up

The downstream wiring (converting this config to Arc<dyn ParquetMergePolicy> in quickwit-indexing) is done in the Phase 3 pipeline PRs.

Test plan

  • 95 existing config tests pass
  • cargo clippy clean
  • Documentation updated with parameter table and YAML example

🤖 Generated with Claude Code

g-talbot and others added 9 commits April 28, 2026 11:47
Adds a constant write amplification merge policy for Parquet splits,
adapted from the existing ConstWriteAmplificationMergePolicy but using
byte size instead of document count as the primary size metric. This is
Phase 2 of the Parquet compaction project — the decision layer that
determines which splits to merge within each compaction scope.

Key components:
- ParquetMergePolicy trait mirroring the MergePolicy interface
- CompactionScope grouping by (index_uid, sort_fields, window_start)
- ConstWriteAmplificationParquetMergePolicy with bounded write amp
- finalize_operations() for cold window compaction
- 33 tests: unit, proptest (MC-CONSERVE/LEVEL/WA/IDEMPOTENT), simulation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340),
include it in CompactionScope so splits with different partitions are
never merged together. Adds 2 new scope tests for partition isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
finalize_operations() was running single_merge_operation over all young
splits without grouping by num_merge_ops level. This could merge level-0
and level-1 splits together, stamping the output with max(levels) + 1
and prematurely maturing lower-level data.

Fix: group by num_merge_ops in finalize just like operations() does,
then apply the lower merge factor within each level independently.

Added test_finalize_respects_mc_level_invariant (unit) and
proptest_finalize_respects_mc_level (property test) — both caught the
bug before the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CompactionScope only used window_start_secs, so splits with the same
start but different durations (e.g. after a window config change) would
be grouped together. The merge engine requires all inputs to agree on
both window_start and window_duration, so merging across durations would
fail validation.

Added test_different_window_duration which caught the bug before the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add three merge-policy invariants to the shared verification layer
(quickwit-dst) with check_invariant! enforcement in
ParquetMergeOperation::new():

- MP-1: all splits share the same num_merge_ops level
- MP-2: merge op has at least 2 input splits
- MP-3: all splits share the same compaction scope (sort_fields + window)

Shared pure functions in quickwit_dst::invariants::merge_policy are the
single source of truth, usable by Stateright models and production code.
Debug builds panic on violation; all builds emit metrics via the
invariant recorder.

Tests written first (4 should_panic tests failed before adding checks,
pass after). Plus 1 positive test and 3 unit tests for shared functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…line

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `parquet_merge_policy` section to `IndexingSettings`, making the
Parquet merge policy configurable per-index via YAML. Parameters:

- merge_factor (default 10): min splits to trigger a merge
- max_merge_factor (default 12): max splits per merge
- max_merge_ops (default 4): bounds write amplification
- target_split_size_bytes (default 256 MiB): target output size
- maturation_period (default 48h): split maturity timeout
- max_finalize_merge_operations (default 3): cold-window shutdown limit

Mirrors the existing merge_policy config pattern for logs/traces.
Updates index-config.md documentation with the new section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…secs

Adds `parquet_indexing` section to `IndexingSettings` for per-index
Parquet pipeline configuration:

- `sort_fields`: sort schema override (Husky-style pipe-delimited
  syntax with /V2 suffix). Controls row ordering, query pruning,
  compression locality, and compaction scope. When omitted, uses
  the product-type default.
- `window_duration_secs`: time window for split partitioning
  (default 900s / 15 min). Must divide 3600.

Updates docs/configuration/index-config.md with:
- "Parquet indexing settings" section explaining both parameters
- Full sort schema syntax reference (column types, direction
  overrides, & LSM cutoff marker)
- Examples showing minimal, custom, and advanced configurations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base automatically changed from gtt/parquet-merge-policy to matthew.kim/metrics-partitioning April 29, 2026 21:04
Adding ParquetMergePolicyConfig and ParquetIndexingConfig to
IndexingSettings changes the Hash output, which changes the pipeline
params fingerprints. Updated the hardcoded test constants.

Added a comment explaining how to recompute them when IndexingSettings
fields change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@g-talbot g-talbot requested review from guilload and mattmkim and removed request for guilload and mattmkim April 30, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant