Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fdafdcb627
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DataType::Dictionary(_, value_type) if value_type.as_ref() == &DataType::Utf8 => { | ||
| let dict = column | ||
| .as_any() | ||
| .downcast_ref::<arrow::array::DictionaryArray<Int32Type>>() | ||
| .expect("dictionary column should be DictionaryArray<Int32>"); |
There was a problem hiding this comment.
Avoid panicking on non-Int32 dictionary keys
When a referenced routing column arrives as a valid Arrow Dictionary with a non-Int32 key type, this match accepts it because it only checks the value type, but the subsequent downcast to DictionaryArray<Int32Type> returns None and the expect panics. This can take down the doc processor for otherwise string-valued tag columns encoded with UInt8, Int16, etc.; either match DataType::Dictionary(Int32, Utf8) explicitly or handle the supported key types instead of treating all dictionaries as Int32.
Useful? React with 👍 / 👎.
g-talbot
left a comment
There was a problem hiding this comment.
LGTM'ing to keep you unblocked, but there's a codex comment and it's a real one that I had to deal with in the merger too.
Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340), include it in CompactionScope so splits with different partitions are never merged together. Adds 2 new scope tests for partition isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-partitioning' into matthew.kim/metrics-partitioning
There was a problem hiding this comment.
💡 Codex Review
https://github.com/quickwit-oss/quickwit/blob/0ec77d339ba7d04f2f6466a2b029421167aa0124/quickwit-indexing/src/actors/metrics_pipeline/parquet_doc_processor.rs#L414-L418
Preserve partition ids across Arrow batches
When one workbench receives multiple Arrow IPC batches, applying this max_num_partitions cap independently inside partition_batch can misroute rows for partitions that are already open in the indexer. For example with max_num_partitions = 2, if the first IPC batch opens partitions A and B, and a later IPC batch first sees C, D, then A, the A rows are converted to OTHER_PARTITION_ID here before the indexer sees them; the indexer can no longer recover that they belonged to A, so the same routing key is split between its real partition and OTHER. The cap needs to be enforced with workbench/global partition state, or the doc processor should forward the true partition ids and let the indexer do the overflow routing.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
* feat: add Parquet merge policy for compaction (Phase 2) Adds a constant write amplification merge policy for Parquet splits, adapted from the existing ConstWriteAmplificationMergePolicy but using byte size instead of document count as the primary size metric. This is Phase 2 of the Parquet compaction project — the decision layer that determines which splits to merge within each compaction scope. Key components: - ParquetMergePolicy trait mirroring the MergePolicy interface - CompactionScope grouping by (index_uid, sort_fields, window_start) - ConstWriteAmplificationParquetMergePolicy with bounded write amp - finalize_operations() for cold window compaction - 33 tests: unit, proptest (MC-CONSERVE/LEVEL/WA/IDEMPOTENT), simulation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add partition_id to CompactionScope, rebase on #6340 Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340), include it in CompactionScope so splits with different partitions are never merged together. Adds 2 new scope tests for partition isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: rustfmt nightly comment wrapping and import ordering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: finalize_operations must respect MC-LEVEL invariant finalize_operations() was running single_merge_operation over all young splits without grouping by num_merge_ops level. This could merge level-0 and level-1 splits together, stamping the output with max(levels) + 1 and prematurely maturing lower-level data. Fix: group by num_merge_ops in finalize just like operations() does, then apply the lower merge factor within each level independently. Added test_finalize_respects_mc_level_invariant (unit) and proptest_finalize_respects_mc_level (property test) — both caught the bug before the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: include window_duration_secs in CompactionScope CompactionScope only used window_start_secs, so splits with the same start but different durations (e.g. after a window config change) would be grouped together. The merge engine requires all inputs to agree on both window_start and window_duration, so merging across durations would fail validation. Added test_different_window_duration which caught the bug before the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add MP-1/MP-2/MP-3 runtime invariant checks for merge operations Add three merge-policy invariants to the shared verification layer (quickwit-dst) with check_invariant! enforcement in ParquetMergeOperation::new(): - MP-1: all splits share the same num_merge_ops level - MP-2: merge op has at least 2 input splits - MP-3: all splits share the same compaction scope (sort_fields + window) Shared pure functions in quickwit_dst::invariants::merge_policy are the single source of truth, usable by Stateright models and production code. Debug builds panic on violation; all builds emit metrics via the invariant recorder. Tests written first (4 should_panic tests failed before adding checks, pass after). Plus 1 positive test and 3 unit tests for shared functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: nightly rustfmt — merge imports, unwrap short line, trailing newline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
Enables partitioning of metric (points and sketches) data. A
partition_idwill be added to split metadata (thesplit_metadata_jsonfield).Summary of functionality (basically mirrors logs):
partition_keyrouting expression against RecordBatch rows.This PR keeps the new Arrow-backed routing and metrics pipeline changes behind the existing
metricsfeature.quickwit-doc-mappernow exposes its ArrowRoutingExprContextonly whenquickwit-doc-mapper/metricsis enabled, andquickwit-indexing/metricswires that feature in alongside its existing optionalarrowandquickwit-parquet-enginedependencies.for @g-talbot :
partition_idis not a column in postgres, it's in a JSON column, so we can't effectively filter/group by partition_id in postgres. but looking at how logs does compaction, they don't filter bypartition_idfrom postgres, they just group file sin memory bypartition_idafter getting all the metadata back.How was this PR tested?
Describe how you tested this PR.