quickwit-oss
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 16 additions & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎LICENSE-3rdparty.csv‎
Lines changed: 0 additions & 4 deletions b/‎LICENSE-3rdparty.csv‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎docs/internals/UPSTREAM-CANDIDATES.md‎
Lines changed: 0 additions & 30 deletions b/‎docs/internals/UPSTREAM-CANDIDATES.md‎
Lines changed: 0 additions & 30 deletions
diff --git a/‎docs/internals/adr/001-parquet-data-model.md‎
Lines changed: 5 additions & 7 deletions b/‎docs/internals/adr/001-parquet-data-model.md‎
Lines changed: 5 additions & 7 deletions
diff --git a/‎docs/internals/adr/002-sort-schema-parquet-splits.md‎
Lines changed: 3 additions & 4 deletions b/‎docs/internals/adr/002-sort-schema-parquet-splits.md‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎docs/internals/adr/003-time-windowed-sorted-compaction.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/internals/adr/003-time-windowed-sorted-compaction.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/internals/adr/README.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/internals/adr/README.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/internals/adr/gaps/002-fixed-sort-schema.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/internals/adr/gaps/002-fixed-sort-schema.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/internals/adr/gaps/006-no-independent-auto-scaling.md‎
Lines changed: 1 addition & 5 deletions b/‎docs/internals/adr/gaps/006-no-independent-auto-scaling.md‎
Lines changed: 1 addition & 5 deletions
@@ -0,0 +1,16 @@
+# CODEOWNERS — see https://docs.github.com/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
+#
+# Last matching rule per file wins. Approval from any listed owner satisfies
+# the requirement. quickwit-dev is listed on every rule so it can always
+# approve; an additional team is listed on the metrics Parquet pipeline paths
+# so PRs scoped to those paths can be approved by either team.
+
+# Default: quickwit-core owns everything
+*                                                        @quickwit-oss/quickwit-core
+
+# byoc-metrics paths — owned by byoc-metrics
+/quickwit/quickwit-parquet-engine/                       @quickwit-oss/byoc-metrics
+/quickwit/quickwit-datafusion/                           @quickwit-oss/byoc-metrics
+/quickwit/quickwit-df-core/                              @quickwit-oss/byoc-metrics
+/quickwit/quickwit-dst/                                  @quickwit-oss/byoc-metrics
+/quickwit/quickwit-indexing/src/actors/metrics_pipeline/ @quickwit-oss/byoc-metrics
@@ -284,7 +284,6 @@ embedded-io,https://github.com/rust-embedded/embedded-hal,MIT OR Apache-2.0,The
 ena,https://github.com/rust-lang/ena,MIT OR Apache-2.0,Niko Matsakis <niko@alum.mit.edu>
 encode_unicode,https://github.com/tormol/encode_unicode,Apache-2.0 OR MIT,Torbjørn Birch Moltu <t.b.moltu@lyse.net>
 encoding_rs,https://github.com/hsivonen/encoding_rs,(Apache-2.0 OR MIT) AND BSD-3-Clause,Henri Sivonen <hsivonen@hsivonen.fi>
-endian-type,https://github.com/Lolirofle/endian-type,MIT,Lolirofle <lolipopple@hotmail.com>
 enum-iterator,https://github.com/stephaneyfx/enum-iterator,0BSD,Stephane Raux <stephaneyfx@gmail.com>
 enum-iterator-derive,https://github.com/stephaneyfx/enum-iterator,0BSD,Stephane Raux <stephaneyfx@gmail.com>
 env_filter,https://github.com/rust-cli/env_logger,MIT OR Apache-2.0,The env_filter Authors
@@ -469,7 +468,6 @@ measure_time,https://github.com/PSeitz/rust_measure_time,MIT,Pascal Seitz <pasca
 memchr,https://github.com/BurntSushi/memchr,Unlicense OR MIT,"Andrew Gallant <jamslam@gmail.com>, bluss"
 memmap2,https://github.com/RazrFalcon/memmap2-rs,MIT OR Apache-2.0,"Dan Burkert <dan@danburkert.com>, Yevhenii Reizner <razrfalcon@gmail.com>, The Contributors"
 metrics,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
-metrics-exporter-dogstatsd,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
 metrics-exporter-otel,https://github.com/palindrom615/metrics,MIT,Whoemoon Jang <palindrom615@gmail.com>
 metrics-exporter-prometheus,https://github.com/metrics-rs/metrics,MIT AND Apache-2.0,Toby Lawrence <toby@nuclearfurnace.com>
 metrics-util,https://github.com/metrics-rs/metrics,MIT,Toby Lawrence <toby@nuclearfurnace.com>
@@ -491,7 +489,6 @@ murmurhash32,https://github.com/quickwit-inc/murmurhash32,MIT,Paul Masurel <paul
 native-tls,https://github.com/rust-native-tls/rust-native-tls,MIT OR Apache-2.0,Steven Fackler <sfackler@gmail.com>
 new_debug_unreachable,https://github.com/mbrubeck/rust-debug-unreachable,MIT,"Matt Brubeck <mbrubeck@limpet.net>, Jonathan Reem <jonathan.reem@gmail.com>"
 new_string_template,https://github.com/hasezoey/new_string_template,MIT,hasezoey <hasezoey@gmail.com>
-nibble_vec,https://github.com/michaelsproul/rust_nibble_vec,MIT,Michael Sproul <micsproul@gmail.com>
 nix,https://github.com/nix-rust/nix,MIT,The nix-rust Project Developers
 no-std-net,https://github.com/dunmatt/no-std-net,MIT,M@ Dunlap <mattdunlap@gmail.com>
 nohash-hasher,https://github.com/paritytech/nohash-hasher,Apache-2.0 OR MIT,Parity Technologies <admin@parity.io>
@@ -653,7 +650,6 @@ quinn-udp,https://github.com/quinn-rs/quinn,MIT OR Apache-2.0,The quinn-udp Auth
 quote,https://github.com/dtolnay/quote,MIT OR Apache-2.0,David Tolnay <dtolnay@gmail.com>
 quoted_printable,https://github.com/staktrace/quoted-printable,0BSD,Kartikaya Gupta <kats@seldon.staktrace.com>
 r-efi,https://github.com/r-efi/r-efi,MIT OR Apache-2.0 OR LGPL-2.1-or-later,The r-efi Authors
-radix_trie,https://github.com/michaelsproul/rust_radix_trie,MIT,Michael Sproul <micsproul@gmail.com>
 rand,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers"
 rand_chacha,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers, The CryptoCorrosion Contributors"
 rand_core,https://github.com/rust-random/rand,MIT OR Apache-2.0,"The Rand Project Developers, The Rust Project Developers"
 
@@ -93,11 +93,9 @@ This is a **schema-on-read** approach: the storage layer stores data in whatever
 
 **Transition.** The current OTel map-based ingestion format is the starting point. The indexing pipeline can extract attributes into columns at write time, presenting the original OTel map interface at the API boundary while storing columnar data internally. This is transparent to ingest clients — they continue sending OTel-format data. Queries can access attributes either by the original map path (for compatibility) or by direct column access (for performance). The storage representation is an internal optimization, not a change to the external data model.
 
-### 6. RLE/Dictionary Encoding and the Flurry Project
+### 6. RLE/Dictionary Encoding Preservation
 
-The point-per-row model's performance depends on columnar encodings being preserved through the query pipeline. Currently, RLE and dictionary encoding are decoded to plain arrays early in DataFusion's execution. There is significant ongoing investment in **Flurry** (the metrics equivalent of Bolt) to preserve these encodings through more operators.
-
-As Flurry matures, the performance benefits of sorted point-per-row data increase: longer runs in sorted columns translate directly to better RLE compression ratios that are maintained through query execution. This makes point-per-row a bet that improves over time rather than a static trade-off.
+The point-per-row model's performance depends on columnar encodings being preserved through the query pipeline. Currently, RLE and dictionary encoding are decoded to plain arrays early in DataFusion's execution. As DataFusion grows operator-level support for these encodings, the performance benefits of sorted point-per-row data increase: longer runs in sorted columns translate directly to better RLE compression ratios that are maintained through query execution. This makes point-per-row a bet that improves over time rather than a static trade-off.
 
 ## Invariants
 
@@ -123,14 +121,14 @@ These invariants must hold across all code paths (ingestion, compaction, query).
 
 ### Negative
 
-- **Tag redundancy.** Every row for the same timeseries repeats all tag values. In timeseries-per-row, tags are stored once per series. With good columnar encoding on sorted data, this redundancy compresses away, but it is still present in the uncompressed representation and affects memory usage during query execution until Flurry-style encoding preservation is complete.
+- **Tag redundancy.** Every row for the same timeseries repeats all tag values. In timeseries-per-row, tags are stored once per series. With good columnar encoding on sorted data, this redundancy compresses away, but it is still present in the uncompressed representation and affects memory usage during query execution until DataFusion preserves dictionary/RLE encoding through more operators.
 - **OTel map attributes defeat columnar benefits.** The current OTel ingest schema stores attributes as key-value maps. Until schema-on-read column extraction is implemented, attributes cannot participate in sorting, page-level pruning, or efficient columnar compression. This is the most significant near-term limitation of the data model.
 - **No intra-series locality guarantee.** Without `timeseries_id` in the sort schema, points from the same series may be interleaved with points from other series that share the same sort-column values. This is a configuration choice, not an inherent limitation.
 - **Duplicate points are stored.** Without LWW or per-point dedup, retried ingestion or overlapping sources can produce duplicate points. Existing batch-level dedup (WAL checkpoints, file-level tracking) prevents most duplicates, but cross-request duplicates are possible. See [GAP-005](./gaps/005-no-per-point-deduplication.md).
 
 ### Risks
 
-- **Flurry dependency for performance parity.** Until RLE/dictionary encoding is preserved through DataFusion, point-per-row may scan more data than timeseries-per-row for series-centric queries (e.g., "plot CPU for host X"). The magnitude depends on the encoding preservation timeline.
+- **Encoding-preservation dependency for performance parity.** Until RLE/dictionary encoding is preserved through DataFusion, point-per-row may scan more data than timeseries-per-row for series-centric queries (e.g., "plot CPU for host X"). The magnitude depends on the encoding-preservation timeline.
 - **Wide tables (future research).** Metrics from the same source share nearly identical tags. Multiple metric names could be stored as separate value columns in a single wide row (e.g., `k8s.cpu.usage`, `k8s.cpu.limit`, `k8s.mem.usage` as columns sharing one tag set). This is the approach taken by TimescaleDB's hypertables. It would amortize tag storage further but requires significant compactor changes. Worth investigating as future research; it is compatible with point-per-row as an evolution, not a replacement.
 
 ## Signal Generalization
@@ -147,7 +145,7 @@ The no-LWW and no-storage-interpolation decisions are universal across signals.
 | Date | Decision | Rationale |
 |------|----------|-----------|
 | 2026-02-19 | Initial ADR created | Establish foundational data model for Parquet metrics pipeline |
-| 2026-02-19 | Point-per-row chosen over timeseries-per-row | Simpler compaction, no LWW, standard DataFusion operators. Performance parity via columnar encoding + Flurry |
+| 2026-02-19 | Point-per-row chosen over timeseries-per-row | Simpler compaction, no LWW, standard DataFusion operators. Performance parity via columnar encoding and dictionary/RLE preservation through more operators |
 | 2026-02-19 | No LWW semantics | Eliminates sticky routing and series-level dedup. Simplifies ingestion and compaction |
 | 2026-02-19 | Dedup clarified: batch-level exists, per-point does not | WAL checkpoints provide exactly-once at the batch level. File-level dedup for queue sources. Per-point dedup not implemented; identified as GAP-005 if needed |
 | 2026-02-19 | timeseries_id defined as optional synthetic column | Provides intra-group locality tiebreaker without adding complexity to the core data model |
 
@@ -20,7 +20,7 @@ Sorting rows within each split by a schema aligned with common query predicates
 1. **Compression improvement.** Columnar formats like Parquet compress data by encoding runs of similar values. When rows are sorted by metric name and tags, the columns for those fields contain long runs of identical or similar values, benefiting RLE, dictionary encoding, and general-purpose compression (ZSTD). In Husky Phase 1, this yielded ~33% size reduction for APM data and ~25% for Logs data.
 2. **Query efficiency.** Parquet's column index (format v2) stores min/max statistics per page within each column chunk. When data is sorted, pages within each column naturally have non-overlapping value ranges for the sort columns. DataFusion supports page index pruning, allowing it to skip pages that cannot match a query predicate.
 
-Matthew Kim's implementation added a fixed sort on `(MetricName, TagService, TagEnv, TagDatacenter, TagRegion, TagHost, TimestampSecs)` in the Parquet writer (`quickwit-parquet-engine/src/storage/writer.rs`), demonstrating that sorting is feasible and inexpensive. However, this sort order is hardcoded in `ParquetField::sort_order()` and cannot be customized per index or deployment. Different workloads have different high-value columns; a metrics index tracking Kubernetes containers benefits from sorting by `pod` and `namespace`, while an infrastructure metrics index benefits from `host` and `datacenter`.
+An initial implementation added a fixed sort on `(MetricName, TagService, TagEnv, TagDatacenter, TagRegion, TagHost, TimestampSecs)` in the Parquet writer (`quickwit-parquet-engine/src/storage/writer.rs`), demonstrating that sorting is feasible and inexpensive. However, this sort order is hardcoded in `ParquetField::sort_order()` and cannot be customized per index or deployment. Different workloads have different high-value columns; a metrics index tracking Kubernetes containers benefits from sorting by `pod` and `namespace`, while an infrastructure metrics index benefits from `host` and `datacenter`.
 
 This ADR formalizes the sort schema as a configurable, per-index property stored in the metastore.
 
@@ -169,7 +169,7 @@ Phase 4 of the locality compaction roadmap extends sorting to the Tantivy pipeli
 
 | Component | Location | Status |
 |-----------|----------|--------|
-| Fixed sort at ingestion | `quickwit-parquet-engine/src/storage/writer.rs` | Done (Matthew Kim). Replaced by configurable sort in PR #6287 |
+| Fixed sort at ingestion | `quickwit-parquet-engine/src/storage/writer.rs` | Done. Replaced by configurable sort in PR #6287 |
 | Configurable sort schema | `quickwit-parquet-engine/src/table_config.rs` | Done (PR #6287). `TableConfig` with `effective_sort_fields()` override; `ParquetWriter` resolves sort fields dynamically |
 | Sort schema parser | `quickwit-parquet-engine/src/sort_fields/parser.rs` | Done (PR #6290). Parses `column\|...\|&metadata\|timestamp/V2` with directions, LSM cutoff, version |
 | Per-column sort direction | `sort_fields/parser.rs` + `storage/writer.rs` | Done (PR #6290 + #6287). Parser extracts `+`/`-` suffix; writer respects `descending` flag |
@@ -197,5 +197,4 @@ Phase 4 of the locality compaction roadmap extends sorting to the Tantivy pipeli
 - [Compaction Architecture](../compaction-architecture.md) — current compaction system description
 - [ADR-001: Parquet Data Model](./001-parquet-data-model.md) — point-per-row data model and timeseries_id
 - [ADR-003: Time-Windowed Sorted Compaction](./003-time-windowed-sorted-compaction.md) — compaction that depends on sort schema
-- [Husky Phase 1: Locality of Reference](https://docs.google.com/document/d/1x9BO1muCTo1TmfhPYBdIxZ-59aU0ECSiEaGPUcDZkPs/edit) — prior art
-- [Husky Storage Compaction Blog Post](https://www.datadoghq.com/blog/engineering/husky-storage-compaction/)
+- [Husky Storage Compaction Blog Post](https://www.datadoghq.com/blog/engineering/husky-storage-compaction/) — prior art
@@ -302,4 +302,3 @@ Phase 4 of the locality compaction roadmap extends time-windowed sorted compacti
 - [StableLogMergePolicy](../../quickwit/quickwit-indexing/src/merge_policy/stable_log_merge_policy.rs) — existing merge policy
 - [Merge Planner](../../quickwit/quickwit-indexing/src/actors/merge_planner.rs) — existing merge planner (Tantivy)
 - [Husky Storage Compaction Blog Post](https://www.datadoghq.com/blog/engineering/husky-storage-compaction/)
-- [Husky Phase 2: Locality of Reference](https://docs.google.com/document/d/1vax-vv0wbhfddo4n5obhlVJxsmUa9N_62tKs5ZmYC6k/edit)
@@ -24,7 +24,6 @@ ADRs will be created here as we implement new systems. Start with the metrics pi
 | [001](./001-parquet-data-model.md) | Parquet Metrics Data Model | Proposed | `storage`, `metrics`, `parquet`, `data-model` | quickwit-parquet-engine |
 | [002](./002-sort-schema-parquet-splits.md) | Configurable Sort Schema for Parquet Splits | Proposed | `storage`, `metrics`, `compaction`, `parquet`, `sorting` | quickwit-parquet-engine, quickwit-indexing |
 | [003](./003-time-windowed-sorted-compaction.md) | Time-Windowed Sorted Compaction for Parquet | Proposed | `storage`, `metrics`, `compaction`, `parquet`, `time-windowing` | quickwit-parquet-engine, quickwit-indexing, quickwit-metastore |
-| [004](./004-cloud-native-storage-characteristics.md) | Cloud-Native Storage Characteristics | Proposed | `architecture`, `storage`, `cloud-native`, `observability` | all |
 
 ## Supplements & Roadmaps
 
@@ -49,7 +48,7 @@ Quickwit tracks architectural change through three lenses. See **[EVOLUTION.md](
 
 ### Characteristics (What we need)
 
-Product requirements and capabilities we must have. See [ADR-004](./004-cloud-native-storage-characteristics.md) for the full characteristic status matrix.
+Product requirements and capabilities we must have.
 
 ### Gaps (What we learned)
 
 
@@ -2,7 +2,7 @@
 
 **Status**: Partially resolved
 **Discovered**: 2026-02-19
-**Context**: Codebase analysis during Phase 1 locality compaction design. Sort implementation by Matthew Kim provides the foundation but is not configurable.
+**Context**: Codebase analysis during Phase 1 locality compaction design. The initial sort implementation provides the foundation but is not configurable.
 **Resolution**: PRs #6287–#6292 replaced the hardcoded sort with a configurable `TableConfig` + sort schema parser. Remaining: per-index metastore storage, pipeline propagation, null ordering fix.
 
 ## Problem
 
@@ -2,7 +2,7 @@
 
 **Status**: Open
 **Discovered**: 2026-02-19
-**Context**: Cloud-native storage characteristics analysis ([ADR-004](../004-cloud-native-storage-characteristics.md), characteristics C1, C17)
+**Context**: Cloud-native storage characteristics analysis (independent scaling, burst handling)
 
 ## Problem
 
@@ -45,7 +45,3 @@ All signals equally affected. Independent scaling is signal-agnostic.
 - [ ] Evaluate separating the merge pipeline into a standalone compactor service
 - [ ] Design auto-scaling policies for each workload type (ingest QPS, query QPS, file backlog)
 - [ ] Investigate burst handling for ingest (overflow buffer, backpressure, burst lane)
-
-## References
-
-- [ADR-004: Cloud-Native Storage Characteristics](../004-cloud-native-storage-characteristics.md)