Skip to content

Commit 225bf62

Browse files
authored
Merge branch 'master' into query-eviction
Signed-off-by: Essam Eldaly <60357054+eeldaly@users.noreply.github.com>
2 parents b4b217b + 0ecd21b commit 225bf62

52 files changed

Lines changed: 1768 additions & 199 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test-build-deploy.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
lint:
1818
runs-on: ubuntu-24.04
1919
container:
20-
image: quay.io/cortexproject/build-image:master-ee0b97cc37
20+
image: quay.io/cortexproject/build-image:master-5f643d518c
2121
steps:
2222
- name: Checkout Repo
2323
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
@@ -53,7 +53,7 @@ jobs:
5353
name: test (${{ matrix.name }})
5454
runs-on: ${{ matrix.runner }}
5555
container:
56-
image: quay.io/cortexproject/build-image:master-ee0b97cc37
56+
image: quay.io/cortexproject/build-image:master-5f643d518c
5757
steps:
5858
- name: Checkout Repo
5959
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
@@ -81,7 +81,7 @@ jobs:
8181
name: test-no-race (${{ matrix.name }})
8282
runs-on: ${{ matrix.runner }}
8383
container:
84-
image: quay.io/cortexproject/build-image:master-ee0b97cc37
84+
image: quay.io/cortexproject/build-image:master-5f643d518c
8585
steps:
8686
- name: Checkout Repo
8787
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
@@ -125,7 +125,7 @@ jobs:
125125
build:
126126
runs-on: ubuntu-24.04
127127
container:
128-
image: quay.io/cortexproject/build-image:master-ee0b97cc37
128+
image: quay.io/cortexproject/build-image:master-5f643d518c
129129
steps:
130130
- name: Checkout Repo
131131
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
@@ -318,14 +318,14 @@ jobs:
318318
touch build-image/.uptodate
319319
MIGRATIONS_DIR=$(pwd)/cmd/cortex/migrations
320320
echo "Running configs integration tests on ${{ matrix.arch }}"
321-
make BUILD_IMAGE=quay.io/cortexproject/build-image:master-ee0b97cc37 TTY='' configs-integration-test
321+
make BUILD_IMAGE=quay.io/cortexproject/build-image:master-5f643d518c TTY='' configs-integration-test
322322
323323
deploy:
324324
needs: [build, test, lint, integration, integration-configs-db]
325325
if: (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/')) && github.repository == 'cortexproject/cortex'
326326
runs-on: ubuntu-24.04
327327
container:
328-
image: quay.io/cortexproject/build-image:master-ee0b97cc37
328+
image: quay.io/cortexproject/build-image:master-5f643d518c
329329
steps:
330330
- name: Checkout Repo
331331
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,15 @@
33
## master / unreleased
44
* [CHANGE] Querier: Make query time range configurations per-tenant: `query_ingesters_within`, `query_store_after`, and `shuffle_sharding_ingesters_lookback_period`. Uses `model.Duration` instead of `time.Duration` to support serialization but has minimum unit of 1ms (nanoseconds/microseconds not supported). #7160
55
* [CHANGE] Cache: Setting `-blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl` to 0 will disable the bucket-index cache. #7446
6+
* [CHANGE] HA Tracker: Move `-distributor.ha-tracker.failover-timeout` from a global config to a per-tenant runtime config. The flag name and default value (30s) remain the same. #7481
7+
* [FEATURE] Ingester: Add experimental active series tracker that counts active series by configurable label matchers (including regex) per tenant and exposes `cortex_ingester_active_series_per_tracker` metric. Configured via `active_series_trackers` in runtime config overrides. #7476
68
* [FEATURE] Ruler: Add per-tenant `ruler_alert_generator_url_template` runtime config option to customize alert generator URLs using Go templates. Supports Grafana Explore, Perses, and other UIs. #7302
79
* [FEATURE] Distributor: Add experimental `-distributor.enable-start-timestamp` flag for Prometheus Remote Write 2.0. When enabled, `StartTimestamp (ST)` is ingested. #7371
810
* [FEATURE] Memberlist: Add `-memberlist.cluster-label` and `-memberlist.cluster-label-verification-disabled` to prevent accidental cross-cluster gossip joins and support rolling label rollout. #7385
911
* [FEATURE] Querier: Add timeout classification to classify query timeouts as 4XX (user error) or 5XX (system error) based on phase timing. When enabled, queries that spend most of their time in PromQL evaluation return `422 Unprocessable Entity` instead of `503 Service Unavailable`. #7374
1012
* [FEATURE] Querier: Implement Resource Based Throttling in Querier. #7442
1113
* [FEATURE] Querier: Add resource-based query eviction that automatically cancels the heaviest running query when CPU or heap utilization exceeds configured thresholds. #7488
14+
* [ENHANCEMENT] Tenant Federation: Avoid purging the regex resolver LRU cache on user-sync ticks when the set of known users has not changed. #7489
1215
* [ENHANCEMENT] Parquet Converter: Add a ring status page to expose the ring status. #7455
1316
* [ENHANCEMENT] Ingester: Add WAL record metrics to help evaluate the effectiveness of WAL compression type (e.g. snappy, zstd): `cortex_ingester_tsdb_wal_record_part_writes_total`, `cortex_ingester_tsdb_wal_record_parts_bytes_written_total`, and `cortex_ingester_tsdb_wal_record_bytes_saved_total`. #7420
1417
* [ENHANCEMENT] Distributor: Introduce dynamic `Symbols` slice capacity pooling. #7398 #7401
@@ -19,6 +22,9 @@
1922
* [ENHANCEMENT] Compactor: Prevent partition compaction to compact any blocks marked for deletion. #7391
2023
* [ENHANCEMENT] Distributor: Optimize memory allocations by reusing the existing capacity of these pooled slices in the Prometheus Remote Write 2.0 path. #7392
2124
* [ENHANCEMENT] Upgrade gRPC from v1.71.2 to v1.79.3 to address CVE-2026-33186. #7460
25+
* [ENHANCEMENT] Query Frontend: Add `query_too_expensive` reason to QFE and `reason` field to query stats. #7479
26+
* [ENHANCEMENT] Distributor: Add HMAC-SHA256 stream authentication for `PushStream` via `-distributor.sign-write-requests-keys`. #7475
27+
* [ENHANCEMENT] Instrument Ingester CPU profile with source for read APIs. #7494
2228
* [BUGFIX] Querier: Fix queryWithRetry and labelsWithRetry returning (nil, nil) on cancelled context by propagating ctx.Err(). #7370
2329
* [BUGFIX] Metrics Helper: Fix non-deterministic bucket order in merged histograms by sorting buckets after map iteration, matching Prometheus client library behavior. #7380
2430
* [BUGFIX] Distributor: Return HTTP 401 Unauthorized when tenant ID resolution fails in the Prometheus Remote Write 2.0 path. #7389

build-image/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ RUN GOARCH=$(go env GOARCH) && \
1717
chmod +x shfmt && \
1818
mv shfmt /usr/bin
1919

20-
RUN curl -sfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh| sh -s -- -b /usr/bin v2.10.1
20+
RUN curl -sfL https://golangci-lint.run/install.sh| sh -s -- -b /usr/bin v2.12.2
2121

2222
RUN go install github.com/client9/misspell/cmd/misspell@v0.3.4 &&\
2323
go install github.com/golang/protobuf/protoc-gen-go@v1.3.1 &&\

docs/configuration/config-file-reference.md

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3101,12 +3101,6 @@ ha_tracker:
31013101
# CLI flag: -distributor.ha-tracker.update-timeout-jitter-max
31023102
[ha_tracker_update_timeout_jitter_max: <duration> | default = 5s]
31033103
3104-
# The timeout after which a new replica will be accepted if the currently
3105-
# elected replica stops sending data. This value must be greater than the
3106-
# update timeout plus the maximum jitter.
3107-
# CLI flag: -distributor.ha-tracker.failover-timeout
3108-
[ha_tracker_failover_timeout: <duration> | default = 30s]
3109-
31103104
# [Experimental] If enabled, fetches all tracked keys on startup to populate
31113105
# the local cache. This prevents duplicate GET calls for the same key while
31123106
# the cache is cold, but could cause a spike in GET requests during
@@ -3213,6 +3207,17 @@ ha_tracker:
32133207
# CLI flag: -distributor.sign-write-requests
32143208
[sign_write_requests: <boolean> | default = false]
32153209
3210+
# EXPERIMENTAL: Comma-separated list of HMAC-SHA256 keys authenticating
3211+
# PushStream connections between distributors and ingesters. The first key is
3212+
# used by the distributor to sign; all keys are accepted by the ingester. It
3213+
# only takes effect when the -distributor.sign-write-requests is true. The key
3214+
# change procedure for zero downtime is: (1) redeploy ingesters first with
3215+
# 'newkey,oldkey' — ingester accepts both keys; (2) redeploy distributors with
3216+
# 'newkey,oldkey' — distributor signs with newkey; (3) once stable, redeploy
3217+
# both with 'newkey' to drop the old key.
3218+
# CLI flag: -distributor.sign-write-requests-keys
3219+
[sign_write_requests_keys: <string> | default = ""]
3220+
32163221
# EXPERIMENTAL: If enabled, distributor would use stream connection to send
32173222
# requests to ingesters.
32183223
# CLI flag: -distributor.use-stream-push
@@ -4069,6 +4074,12 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
40694074
# CLI flag: -distributor.ha-tracker.max-clusters
40704075
[ha_max_clusters: <int> | default = 0]
40714076
4077+
# If the elected replica doesn't send samples in this time, the HA tracker will
4078+
# accept a new replica. This value must be greater than the update timeout plus
4079+
# the maximum jitter.
4080+
# CLI flag: -distributor.ha-tracker.failover-timeout
4081+
[ha_tracker_failover_timeout: <duration> | default = 30s]
4082+
40724083
# This flag can be used to specify label names that to drop during sample
40734084
# ingestion within the distributor and can be repeated in order to drop multiple
40744085
# labels.
@@ -4195,6 +4206,10 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
41954206
# [max_series]
41964207
[limits_per_label_set: <list of LimitsPerLabelSet> | default = []]
41974208
4209+
# List of active series tracker configurations. Each tracker counts active
4210+
# series matching its matchers and exposes the count as a metric.
4211+
[active_series_trackers: <list of ActiveSeriesTrackerConfig> | default = []]
4212+
41984213
# [EXPERIMENTAL] True to enable native histogram.
41994214
# CLI flag: -blocks-storage.tsdb.enable-native-histograms
42004215
[enable_native_histograms: <boolean> | default = false]
@@ -7007,6 +7022,17 @@ limits:
70077022
[label_set: <map of string (labelName) to string (labelValue)> | default = []]
70087023
```
70097024

7025+
### `ActiveSeriesTrackerConfig`
7026+
7027+
```yaml
7028+
# Name of the tracker, used as a label value in the emitted metric.
7029+
[name: <string> | default = ""]
7030+
7031+
# PromQL series selector (e.g. {__name__=~"api_.*"}). All matchers must match
7032+
# for a series to be counted.
7033+
[matchers: <string> | default = ""]
7034+
```
7035+
70107036
### `PriorityDef`
70117037

70127038
```yaml

docs/configuration/v1-guarantees.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ Currently experimental features are:
114114
- `-store-gateway.query-protection.rejection`
115115
- Distributor/Ingester: Stream push connection
116116
- Enable stream push connection between distributor and ingester by setting `-distributor.use-stream-push=true` on Distributor.
117+
- Enable stream push authentication on Distributor/Ingester. (`-distributor.sign-write-requests-keys`)
117118
- Add `__type__` and `__unit__` labels to OTLP and remote write v2 requests (`-distributor.enable-type-and-unit-labels`)
118119
- Handle StartTimestampMs (ST) for remote write v2 samples and histograms, using CreatedTimestamp (CT) as a fallback when ST is not set (`-distributor.enable-start-timestamp`)
119120
- Ingester: Series Queried Metric
@@ -136,3 +137,6 @@ Currently experimental features are:
136137
- `-querier.query-protection.eviction.cooldown-period` (int)
137138
- `-querier.query-protection.eviction.eviction-metric` (string)
138139
- `-querier.query-protection.eviction.min-query-age` (duration)
140+
- Ingester: Active Series Tracker
141+
- Per-tenant `active_series_trackers` configuration in runtime config overrides
142+
- Counts active series matching PromQL label matchers and exposes `cortex_ingester_active_series_per_tracker` metric
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Active Series Tracker
2+
3+
## Problem
4+
5+
AMP needs to monitor active series counts by configurable patterns (e.g., all series with `__name__=~"api_.*"`) for internal observability. The existing `LimitsPerLabelSet` feature is unsuitable because:
6+
7+
1. **No regex matching** — only supports exact `label=value` matching.
8+
2. **Default partition side-effects** — adding labelset buckets reduces the default partition count.
9+
3. **Coupled to limit enforcement** — designed for enforcing series limits, not pure monitoring.
10+
11+
## Requirements
12+
13+
- Track active series counts by configurable label matchers (including regex).
14+
- Expose counts as Prometheus metrics on the ingester (internal only, not vended to customers).
15+
- Configuration supports **per-tenant overrides** with a **default** fallback (same pattern as all other Limits fields).
16+
- **Runtime hot-reloadable** via the existing runtime config file mechanism.
17+
- **No limit enforcement** — purely observational.
18+
- **No default partition** — unmatched series are simply not tracked.
19+
- A series can match multiple tracker entries simultaneously.
20+
21+
## Design
22+
23+
### Configuration
24+
25+
Tracker config lives in the `Limits` struct, following the same per-tenant override pattern as `LimitsPerLabelSet`:
26+
27+
```yaml
28+
# Default trackers (applied to all tenants without overrides)
29+
limits:
30+
active_series_trackers:
31+
- name: api_metrics
32+
matchers: '{__name__=~"api_.*"}'
33+
34+
# Per-tenant overrides via runtime config
35+
overrides:
36+
tenant-123:
37+
active_series_trackers:
38+
- name: api_metrics
39+
matchers: '{__name__=~"api_.*"}'
40+
- name: system_metrics
41+
matchers: '{__name__=~"node_.*|process_.*"}'
42+
```
43+
44+
The `matchers` field uses standard PromQL matcher syntax parsed via `parser.ParseMetricSelector`.
45+
46+
### Runtime Reload
47+
48+
Tracker config is part of `Limits`, which is reloaded via the runtime config manager every `runtime-config.reload-period` (default 10s). Matchers are parsed and validated during YAML/JSON unmarshalling. Invalid configs are rejected (existing config stays active).
49+
50+
### Metrics
51+
52+
A new gauge metric emitted per ingester:
53+
54+
```
55+
cortex_ingester_active_series_per_tracker{user="<tenant>", name="<tracker_name>"} <count>
56+
```
57+
58+
### Matching Logic
59+
60+
On each active series metrics update tick (default 1min), for each tenant:
61+
1. Read the tenant's tracker config via `i.limits.ActiveSeriesTrackers(userID)`
62+
2. For each tracker, count active series whose labels satisfy all matchers
63+
3. Emit the gauge metric
64+
65+
A series can match multiple trackers. Tenants without configured trackers emit no tracker metrics.
66+
67+
### Performance Considerations
68+
69+
- Matching runs once per update period (default 1min), not on every sample ingestion.
70+
- The number of trackers is expected to be small (< 10).
71+
- Compiled matchers are cached in the parsed Limits and only recompiled on config change.

0 commit comments

Comments
 (0)