You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
memchr,https://github.com/BurntSushi/memchr,Unlicense OR MIT,"Andrew Gallant <jamslam@gmail.com>, bluss"
470
469
memmap2,https://github.com/RazrFalcon/memmap2-rs,MIT OR Apache-2.0,"Dan Burkert <dan@danburkert.com>, Yevhenii Reizner <razrfalcon@gmail.com>, The Contributors"
471
470
metrics,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
472
-
metrics-exporter-dogstatsd,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
473
471
metrics-exporter-otel,https://github.com/palindrom615/metrics,MIT,Whoemoon Jang <palindrom615@gmail.com>
474
472
metrics-exporter-prometheus,https://github.com/metrics-rs/metrics,MIT AND Apache-2.0,Toby Lawrence <toby@nuclearfurnace.com>
475
473
metrics-util,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
rand,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers"
658
654
rand_chacha,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers, The CryptoCorrosion Contributors"
659
655
rand_core,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers"
Copy file name to clipboardExpand all lines: docs/internals/adr/001-parquet-data-model.md
+5-7Lines changed: 5 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -93,11 +93,9 @@ This is a **schema-on-read** approach: the storage layer stores data in whatever
93
93
94
94
**Transition.** The current OTel map-based ingestion format is the starting point. The indexing pipeline can extract attributes into columns at write time, presenting the original OTel map interface at the API boundary while storing columnar data internally. This is transparent to ingest clients — they continue sending OTel-format data. Queries can access attributes either by the original map path (for compatibility) or by direct column access (for performance). The storage representation is an internal optimization, not a change to the external data model.
95
95
96
-
### 6. RLE/Dictionary Encoding and the Flurry Project
96
+
### 6. RLE/Dictionary Encoding Preservation
97
97
98
-
The point-per-row model's performance depends on columnar encodings being preserved through the query pipeline. Currently, RLE and dictionary encoding are decoded to plain arrays early in DataFusion's execution. There is significant ongoing investment in **Flurry** (the metrics equivalent of Bolt) to preserve these encodings through more operators.
99
-
100
-
As Flurry matures, the performance benefits of sorted point-per-row data increase: longer runs in sorted columns translate directly to better RLE compression ratios that are maintained through query execution. This makes point-per-row a bet that improves over time rather than a static trade-off.
98
+
The point-per-row model's performance depends on columnar encodings being preserved through the query pipeline. Currently, RLE and dictionary encoding are decoded to plain arrays early in DataFusion's execution. As DataFusion grows operator-level support for these encodings, the performance benefits of sorted point-per-row data increase: longer runs in sorted columns translate directly to better RLE compression ratios that are maintained through query execution. This makes point-per-row a bet that improves over time rather than a static trade-off.
101
99
102
100
## Invariants
103
101
@@ -123,14 +121,14 @@ These invariants must hold across all code paths (ingestion, compaction, query).
123
121
124
122
### Negative
125
123
126
-
-**Tag redundancy.** Every row for the same timeseries repeats all tag values. In timeseries-per-row, tags are stored once per series. With good columnar encoding on sorted data, this redundancy compresses away, but it is still present in the uncompressed representation and affects memory usage during query execution until Flurry-style encoding preservation is complete.
124
+
-**Tag redundancy.** Every row for the same timeseries repeats all tag values. In timeseries-per-row, tags are stored once per series. With good columnar encoding on sorted data, this redundancy compresses away, but it is still present in the uncompressed representation and affects memory usage during query execution until DataFusion preserves dictionary/RLE encoding through more operators.
127
125
-**OTel map attributes defeat columnar benefits.** The current OTel ingest schema stores attributes as key-value maps. Until schema-on-read column extraction is implemented, attributes cannot participate in sorting, page-level pruning, or efficient columnar compression. This is the most significant near-term limitation of the data model.
128
126
-**No intra-series locality guarantee.** Without `timeseries_id` in the sort schema, points from the same series may be interleaved with points from other series that share the same sort-column values. This is a configuration choice, not an inherent limitation.
129
127
-**Duplicate points are stored.** Without LWW or per-point dedup, retried ingestion or overlapping sources can produce duplicate points. Existing batch-level dedup (WAL checkpoints, file-level tracking) prevents most duplicates, but cross-request duplicates are possible. See [GAP-005](./gaps/005-no-per-point-deduplication.md).
130
128
131
129
### Risks
132
130
133
-
-**Flurry dependency for performance parity.** Until RLE/dictionary encoding is preserved through DataFusion, point-per-row may scan more data than timeseries-per-row for series-centric queries (e.g., "plot CPU for host X"). The magnitude depends on the encodingpreservation timeline.
131
+
-**Encoding-preservation dependency for performance parity.** Until RLE/dictionary encoding is preserved through DataFusion, point-per-row may scan more data than timeseries-per-row for series-centric queries (e.g., "plot CPU for host X"). The magnitude depends on the encoding-preservation timeline.
134
132
-**Wide tables (future research).** Metrics from the same source share nearly identical tags. Multiple metric names could be stored as separate value columns in a single wide row (e.g., `k8s.cpu.usage`, `k8s.cpu.limit`, `k8s.mem.usage` as columns sharing one tag set). This is the approach taken by TimescaleDB's hypertables. It would amortize tag storage further but requires significant compactor changes. Worth investigating as future research; it is compatible with point-per-row as an evolution, not a replacement.
135
133
136
134
## Signal Generalization
@@ -147,7 +145,7 @@ The no-LWW and no-storage-interpolation decisions are universal across signals.
147
145
| Date | Decision | Rationale |
148
146
|------|----------|-----------|
149
147
| 2026-02-19 | Initial ADR created | Establish foundational data model for Parquet metrics pipeline |
150
-
| 2026-02-19 | Point-per-row chosen over timeseries-per-row | Simpler compaction, no LWW, standard DataFusion operators. Performance parity via columnar encoding + Flurry|
148
+
| 2026-02-19 | Point-per-row chosen over timeseries-per-row | Simpler compaction, no LWW, standard DataFusion operators. Performance parity via columnar encoding and dictionary/RLE preservation through more operators|
151
149
| 2026-02-19 | No LWW semantics | Eliminates sticky routing and series-level dedup. Simplifies ingestion and compaction |
152
150
| 2026-02-19 | Dedup clarified: batch-level exists, per-point does not | WAL checkpoints provide exactly-once at the batch level. File-level dedup for queue sources. Per-point dedup not implemented; identified as GAP-005 if needed |
153
151
| 2026-02-19 | timeseries_id defined as optional synthetic column | Provides intra-group locality tiebreaker without adding complexity to the core data model |
Copy file name to clipboardExpand all lines: docs/internals/adr/002-sort-schema-parquet-splits.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ Sorting rows within each split by a schema aligned with common query predicates
20
20
1.**Compression improvement.** Columnar formats like Parquet compress data by encoding runs of similar values. When rows are sorted by metric name and tags, the columns for those fields contain long runs of identical or similar values, benefiting RLE, dictionary encoding, and general-purpose compression (ZSTD). In Husky Phase 1, this yielded ~33% size reduction for APM data and ~25% for Logs data.
21
21
2.**Query efficiency.** Parquet's column index (format v2) stores min/max statistics per page within each column chunk. When data is sorted, pages within each column naturally have non-overlapping value ranges for the sort columns. DataFusion supports page index pruning, allowing it to skip pages that cannot match a query predicate.
22
22
23
-
Matthew Kim's implementation added a fixed sort on `(MetricName, TagService, TagEnv, TagDatacenter, TagRegion, TagHost, TimestampSecs)` in the Parquet writer (`quickwit-parquet-engine/src/storage/writer.rs`), demonstrating that sorting is feasible and inexpensive. However, this sort order is hardcoded in `ParquetField::sort_order()` and cannot be customized per index or deployment. Different workloads have different high-value columns; a metrics index tracking Kubernetes containers benefits from sorting by `pod` and `namespace`, while an infrastructure metrics index benefits from `host` and `datacenter`.
23
+
An initial implementation added a fixed sort on `(MetricName, TagService, TagEnv, TagDatacenter, TagRegion, TagHost, TimestampSecs)` in the Parquet writer (`quickwit-parquet-engine/src/storage/writer.rs`), demonstrating that sorting is feasible and inexpensive. However, this sort order is hardcoded in `ParquetField::sort_order()` and cannot be customized per index or deployment. Different workloads have different high-value columns; a metrics index tracking Kubernetes containers benefits from sorting by `pod` and `namespace`, while an infrastructure metrics index benefits from `host` and `datacenter`.
24
24
25
25
This ADR formalizes the sort schema as a configurable, per-index property stored in the metastore.
26
26
@@ -169,7 +169,7 @@ Phase 4 of the locality compaction roadmap extends sorting to the Tantivy pipeli
169
169
170
170
| Component | Location | Status |
171
171
|-----------|----------|--------|
172
-
| Fixed sort at ingestion |`quickwit-parquet-engine/src/storage/writer.rs`| Done (Matthew Kim). Replaced by configurable sort in PR #6287|
172
+
| Fixed sort at ingestion |`quickwit-parquet-engine/src/storage/writer.rs`| Done. Replaced by configurable sort in PR #6287|
@@ -49,7 +48,7 @@ Quickwit tracks architectural change through three lenses. See **[EVOLUTION.md](
49
48
50
49
### Characteristics (What we need)
51
50
52
-
Product requirements and capabilities we must have. See [ADR-004](./004-cloud-native-storage-characteristics.md) for the full characteristic status matrix.
51
+
Product requirements and capabilities we must have.
Copy file name to clipboardExpand all lines: docs/internals/adr/gaps/002-fixed-sort-schema.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
**Status**: Partially resolved
4
4
**Discovered**: 2026-02-19
5
-
**Context**: Codebase analysis during Phase 1 locality compaction design. Sort implementation by Matthew Kim provides the foundation but is not configurable.
5
+
**Context**: Codebase analysis during Phase 1 locality compaction design. The initial sort implementation provides the foundation but is not configurable.
6
6
**Resolution**: PRs #6287–#6292 replaced the hardcoded sort with a configurable `TableConfig` + sort schema parser. Remaining: per-index metastore storage, pipeline propagation, null ordering fix.
0 commit comments