Skip to content

Commit 1fbece5

Browse files
g-talbotclaude
andcommitted
feat: add ParquetIndexingConfig with sort_fields and window_duration_secs
Adds `parquet_indexing` section to `IndexingSettings` for per-index Parquet pipeline configuration: - `sort_fields`: sort schema override (Husky-style pipe-delimited syntax with /V2 suffix). Controls row ordering, query pruning, compression locality, and compaction scope. When omitted, uses the product-type default. - `window_duration_secs`: time window for split partitioning (default 900s / 15 min). Must divide 3600. Updates docs/configuration/index-config.md with: - "Parquet indexing settings" section explaining both parameters - Full sort schema syntax reference (column types, direction overrides, & LSM cutoff marker) - Examples showing minimal, custom, and advanced configurations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ea9adea commit 1fbece5

3 files changed

Lines changed: 133 additions & 3 deletions

File tree

docs/configuration/index-config.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -596,6 +596,7 @@ This section describes indexing settings for a given index.
596596
| `split_num_docs_target` | Target number of docs per split. | `10000000` |
597597
| `merge_policy` | Describes the strategy used to trigger split merge operations for logs/traces (see [Merge policies](#merge-policies) section below). |
598598
| `parquet_merge_policy` | Describes the merge policy for Parquet (metrics/sketches) splits (see [Parquet merge policy](#parquet-merge-policy) section below). |
599+
| `parquet_indexing` | Parquet-specific indexing settings: sort schema, window duration (see [Parquet indexing settings](#parquet-indexing-settings) section below). |
599600
| `resources.heap_size` | Indexer heap size per source per index. | `2000000000` |
600601
| `docstore_compression_level` | Level of compression used by zstd for the docstore. Lower values may increase ingest speed, at the cost of index size | `8` |
601602
| `docstore_blocksize` | Size of blocks in the docstore, in bytes. Lower values may improve doc retrieval speed, at the cost of index size | `1000000` |
@@ -688,6 +689,57 @@ indexing_settings:
688689
type: "no_merge"
689690
```
690691

692+
### Parquet indexing settings
693+
694+
*For indexes using the Parquet indexing pipeline (metrics, sketches).*
695+
696+
These settings control how the Parquet pipeline sorts, windows, and writes incoming data. They affect both ingest-time performance and downstream query/compaction efficiency.
697+
698+
```yaml
699+
version: 0.7
700+
index_id: "my-metrics-index"
701+
# ...
702+
indexing_settings:
703+
parquet_indexing:
704+
sort_fields: "metric_name|service|env|host|timeseries_id|timestamp_secs/V2"
705+
window_duration_secs: 900
706+
```
707+
708+
| Variable | Description | Default value |
709+
| ------------- | ------------- | ------------- |
710+
| `sort_fields` | Sort schema for row ordering in Parquet files (see syntax below). When omitted, the product-type default is used. | `metric_name\|service\|env\|datacenter\|region\|host\|timeseries_id\|timestamp_secs/V2` |
711+
| `window_duration_secs` | Time window duration in seconds for split partitioning. Must evenly divide 3600. Larger values = fewer splits but coarser time pruning. | `900` (15 minutes) |
712+
713+
#### Sort schema syntax
714+
715+
The sort schema uses pipe-delimited column names with a `/V2` version suffix:
716+
717+
```text
718+
column1|column2|...|timestamp_secs/V2
719+
```
720+
721+
**Column types** are inferred from name suffixes:
722+
- `__s` → string (e.g., `custom_tag__s`)
723+
- `__i` → int64 (e.g., `priority__i`)
724+
- Well-known names like `metric_name`, `service`, `env`, `host`, `timestamp_secs`, and `timeseries_id` have built-in type mappings and don't need suffixes.
725+
726+
**Sort direction** defaults to ascending for most columns and descending for timestamp columns. Override with `+` (ascending) or `-` (descending) as a prefix or suffix on the column name:
727+
728+
```text
729+
# Explicit descending timestamp
730+
metric_name|host|-timestamp_secs/V2
731+
732+
# Ascending host (default), descending timestamp (default)
733+
metric_name|host|timestamp_secs/V2
734+
```
735+
736+
**How the sort schema affects behavior:**
737+
- **Query pruning**: queries filtering on leading columns (e.g., `metric_name`) can skip entire splits whose row key ranges don't match.
738+
- **Compression**: grouping similar values together (e.g., all rows for the same metric name) improves columnar compression ratios.
739+
- **Compaction scope**: splits with different sort schemas are never merged together. Changing the sort schema on an existing index creates a new compaction scope — old splits are not re-sorted.
740+
741+
**The `&` marker** (advanced) sets the LSM comparison cutoff: columns after `&` are used for sort order but not for compaction locality decisions. For example, `metric_name|&host|timestamp_secs/V2` sorts by metric_name then host, but only metric_name determines which splits can be merged.
742+
691743
#### Parquet merge policy
692744

693745
*For indexes using the Parquet indexing pipeline (metrics, sketches).*

quickwit/quickwit-config/src/index_config/mod.rs

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,86 @@ pub struct IndexingSettings {
123123
/// indexes that use the Parquet indexing pipeline.
124124
#[serde(default)]
125125
pub parquet_merge_policy: ParquetMergePolicyConfig,
126+
/// Parquet-specific indexing settings (sort schema, window duration,
127+
/// compression). Only used by indexes that use the Parquet pipeline.
128+
#[serde(default)]
129+
pub parquet_indexing: ParquetIndexingConfig,
126130
#[serde(default)]
127131
pub resources: IndexingResources,
128132
}
129133

134+
/// Configuration for the Parquet indexing pipeline (metrics, sketches).
135+
///
136+
/// Controls how incoming data is sorted, windowed, and compressed before
137+
/// writing to Parquet split files. These settings affect both ingest-time
138+
/// performance and downstream query/compaction efficiency.
139+
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Hash, utoipa::ToSchema)]
140+
#[serde(deny_unknown_fields)]
141+
pub struct ParquetIndexingConfig {
142+
/// Sort schema defining the physical sort order of rows in Parquet files.
143+
///
144+
/// Uses Husky-style pipe-delimited syntax with a `/V2` version suffix.
145+
/// Each column is sorted ascending by default; use `+` or `-` prefix/suffix
146+
/// to override. Column types are inferred from well-known suffixes
147+
/// (`__s` = string, `__i` = int64, `_secs` = uint64 timestamp).
148+
///
149+
/// The sort order determines:
150+
/// - **Query pruning**: queries that filter on leading sort columns can
151+
/// skip entire splits whose row key ranges don't match.
152+
/// - **Compression**: columns with good locality (e.g., metric_name first)
153+
/// compress better in Parquet's columnar format.
154+
/// - **Compaction scope**: splits with different sort schemas are never
155+
/// merged together.
156+
///
157+
/// When `None`, the product-type default is used (see below).
158+
///
159+
/// # Default (metrics/sketches)
160+
/// ```text
161+
/// metric_name|service|env|datacenter|region|host|timeseries_id|timestamp_secs/V2
162+
/// ```
163+
///
164+
/// # Examples
165+
/// ```text
166+
/// # Minimal: just metric name and timestamp
167+
/// metric_name|timestamp_secs/V2
168+
///
169+
/// # Custom tags in sort order
170+
/// metric_name|service|cluster|host|timestamp_secs/V2
171+
///
172+
/// # Explicit descending timestamp
173+
/// metric_name|host|-timestamp_secs/V2
174+
/// ```
175+
#[serde(default, skip_serializing_if = "Option::is_none")]
176+
pub sort_fields: Option<String>,
177+
178+
/// Time window duration in seconds for split partitioning.
179+
///
180+
/// Incoming data is partitioned into time windows of this duration.
181+
/// Splits within the same window may be compacted together; splits in
182+
/// different windows are never merged. Must evenly divide 3600 (one hour).
183+
///
184+
/// Larger values produce fewer, larger splits (better for bulk queries)
185+
/// but coarser time-based pruning. Smaller values give finer pruning
186+
/// but more splits to manage.
187+
#[serde(default = "ParquetIndexingConfig::default_window_duration_secs")]
188+
pub window_duration_secs: u32,
189+
}
190+
191+
impl ParquetIndexingConfig {
192+
fn default_window_duration_secs() -> u32 {
193+
900
194+
}
195+
}
196+
197+
impl Default for ParquetIndexingConfig {
198+
fn default() -> Self {
199+
Self {
200+
sort_fields: None,
201+
window_duration_secs: Self::default_window_duration_secs(),
202+
}
203+
}
204+
}
205+
130206
impl IndexingSettings {
131207
pub fn commit_timeout(&self) -> Duration {
132208
Duration::from_secs(self.commit_timeout_secs as u64)
@@ -166,6 +242,7 @@ impl Default for IndexingSettings {
166242
split_num_docs_target: Self::default_split_num_docs_target(),
167243
merge_policy: MergePolicyConfig::default(),
168244
parquet_merge_policy: ParquetMergePolicyConfig::default(),
245+
parquet_indexing: ParquetIndexingConfig::default(),
169246
resources: IndexingResources::default(),
170247
}
171248
}

quickwit/quickwit-config/src/lib.rs

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ pub use cluster_config::ClusterConfig;
4545
// See #2048
4646
use index_config::serialize::{IndexConfigV0_8, VersionedIndexConfig};
4747
pub use index_config::{
48-
IndexConfig, IndexingResources, IndexingSettings, IngestSettings, RetentionPolicy,
49-
SearchSettings, build_doc_mapper, load_index_config_from_user_config, load_index_config_update,
50-
prepare_doc_mapping_update,
48+
IndexConfig, IndexingResources, IndexingSettings, IngestSettings, ParquetIndexingConfig,
49+
RetentionPolicy, SearchSettings, build_doc_mapper, load_index_config_from_user_config,
50+
load_index_config_update, prepare_doc_mapping_update,
5151
};
5252
pub use quickwit_doc_mapper::DocMapping;
5353
use serde::Serialize;
@@ -114,6 +114,7 @@ pub fn disable_ingest_v1() -> bool {
114114
KafkaSourceParams,
115115
KinesisSourceParams,
116116
MergePolicyConfig,
117+
ParquetIndexingConfig,
117118
ParquetMergePolicyConfig,
118119
PubSubSourceParams,
119120
PulsarSourceAuth,

0 commit comments

Comments
 (0)