Skip to content

feat: compute deterministic timeseries_id column at ingest#6286

Open
g-talbot wants to merge 3 commits intogtt/parquet-column-orderingfrom
gtt/sorted-series-column
Open

feat: compute deterministic timeseries_id column at ingest#6286
g-talbot wants to merge 3 commits intogtt/parquet-column-orderingfrom
gtt/sorted-series-column

Conversation

@g-talbot
Copy link
Copy Markdown
Contributor

@g-talbot g-talbot commented Apr 9, 2026

Summary

  • Adds a timeseries_id column (Int64) to the metrics Arrow batch, computed as a deterministic SipHash-2-4 of the series identity columns
  • Hash includes metric_name, metric_type, and all tags — excludes temporal columns (timestamp_secs, start_timestamp_secs, timestamp) and value columns (value, plus DDSketch components from [metrics] Support DDSketch in the parquet pipeline #6257: count, sum, min, max, flags, keys, counts)
  • Column is already declared in the metrics default sort schema (metric_name|service|env|datacenter|region|host|timeseries_id|timestamp_secs/V2), so the writer automatically sorts by it and places it in the correct physical position
  • Adds TimeseriesId variant to ParquetField enum and updates SORT_ORDER

Design reference

Sorted Series Column for QW Parquet Pipeline — this PR implements the Timeseries ID component; the full Sorted Series composite key is a follow-up.

Test plan

  • 8 unit tests for compute_timeseries_id (determinism, exclusions, order independence, key/value non-interchangeability)
  • All 200 existing tests in quickwit-parquet-engine and quickwit-opentelemetry pass with updated column counts
  • Clippy clean (pre-existing warning in reorder_columns not introduced by this PR)

🤖 Generated with Claude Code

g-talbot and others added 3 commits April 9, 2026 17:15
Add a timeseries_id column (Int64) to the metrics Arrow batch,
computed as a SipHash-2-4 of the series identity columns (metric_name,
metric_type, and all tags excluding temporal/value columns). The hash
uses fixed keys for cross-process determinism.

The column is already declared in the metrics default sort schema
(between host and timestamp_secs), so the parquet writer now
automatically sorts by it and places it in the correct physical
position.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The timeseries_id hash is persisted to Parquet files — any change
silently corrupts compaction and queries. Add:

- 3 pinned stability tests with hardcoded expected hash values
- 3 proptest properties (order independence, excluded tag immunity,
  extra-tag discrimination) each running 256 random cases
- Boundary ambiguity test ({"ab":"c"} vs {"a":"bc"})
- Same-series-different-timestamp invariant test
- All-excluded-tags coverage (every EXCLUDED_TAGS entry verified)
- Edge cases: empty strings, unicode, 100-tag cardinality
- Module-level doc explaining the stability contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant