Skip to content

Commit 8ec34d8

Browse files
authored
feat(ingester): Add active series tracker for pattern-based monitoring (#7476)
Add a new active series tracker feature that counts active series by configurable label matchers (including regex) and exposes the counts as Prometheus metrics. This is designed for internal monitoring without enforcing any limits. Key changes: - Add ActiveSeriesTrackersConfig type in validation package with PromQL matcher syntax support - Add ActiveSeriesTrackers field to Limits struct for per-tenant config with default fallback - Add ActiveForMatchers() method to ActiveSeries for counting matching series across all stripes - Add cortex_ingester_active_series_per_tracker gauge metric - Integrate into updateActiveSeries() periodic tick - Matchers are validated and compiled during config unmarshalling - Runtime hot-reloadable via existing runtime config overrides Configuration example: overrides: tenant-123: active_series_trackers: - name: api_metrics matchers: '{__name__=~"api_.*"}' Metric emitted: cortex_ingester_active_series_per_tracker{user="tenant", name="api_metrics"} 42 Signed-off-by: Ben Ye <benye@amazon.com>
1 parent ce9e5dd commit 8ec34d8

14 files changed

Lines changed: 838 additions & 8 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
* [CHANGE] Querier: Make query time range configurations per-tenant: `query_ingesters_within`, `query_store_after`, and `shuffle_sharding_ingesters_lookback_period`. Uses `model.Duration` instead of `time.Duration` to support serialization but has minimum unit of 1ms (nanoseconds/microseconds not supported). #7160
55
* [CHANGE] Cache: Setting `-blocks-storage.bucket-store.metadata-cache.bucket-index-content-ttl` to 0 will disable the bucket-index cache. #7446
66
* [CHANGE] HA Tracker: Move `-distributor.ha-tracker.failover-timeout` from a global config to a per-tenant runtime config. The flag name and default value (30s) remain the same. #7481
7+
* [FEATURE] Ingester: Add experimental active series tracker that counts active series by configurable label matchers (including regex) per tenant and exposes `cortex_ingester_active_series_per_tracker` metric. Configured via `active_series_trackers` in runtime config overrides. #7476
78
* [FEATURE] Ruler: Add per-tenant `ruler_alert_generator_url_template` runtime config option to customize alert generator URLs using Go templates. Supports Grafana Explore, Perses, and other UIs. #7302
89
* [FEATURE] Distributor: Add experimental `-distributor.enable-start-timestamp` flag for Prometheus Remote Write 2.0. When enabled, `StartTimestamp (ST)` is ingested. #7371
910
* [FEATURE] Memberlist: Add `-memberlist.cluster-label` and `-memberlist.cluster-label-verification-disabled` to prevent accidental cross-cluster gossip joins and support rolling label rollout. #7385

docs/configuration/config-file-reference.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4164,6 +4164,10 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
41644164
# [max_series]
41654165
[limits_per_label_set: <list of LimitsPerLabelSet> | default = []]
41664166
4167+
# List of active series tracker configurations. Each tracker counts active
4168+
# series matching its matchers and exposes the count as a metric.
4169+
[active_series_trackers: <list of ActiveSeriesTrackerConfig> | default = []]
4170+
41674171
# [EXPERIMENTAL] True to enable native histogram.
41684172
# CLI flag: -blocks-storage.tsdb.enable-native-histograms
41694173
[enable_native_histograms: <boolean> | default = false]
@@ -6892,6 +6896,17 @@ limits:
68926896
[label_set: <map of string (labelName) to string (labelValue)> | default = []]
68936897
```
68946898

6899+
### `ActiveSeriesTrackerConfig`
6900+
6901+
```yaml
6902+
# Name of the tracker, used as a label value in the emitted metric.
6903+
[name: <string> | default = ""]
6904+
6905+
# PromQL series selector (e.g. {__name__=~"api_.*"}). All matchers must match
6906+
# for a series to be counted.
6907+
[matchers: <string> | default = ""]
6908+
```
6909+
68956910
### `PriorityDef`
68966911

68976912
```yaml

docs/configuration/v1-guarantees.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,3 +130,6 @@ Currently experimental features are:
130130
- `-validation.max-label-cardinality-for-unoptimized-regex` (int) - maximum label cardinality
131131
- `-validation.max-total-label-value-length-for-unoptimized-regex` (int) - maximum total length of all label values in bytes
132132
- HATracker: `-distributor.ha-tracker.enable-startup-sync` (bool) - If enabled, fetches all tracked keys on startup to populate the local cache.
133+
- Ingester: Active Series Tracker
134+
- Per-tenant `active_series_trackers` configuration in runtime config overrides
135+
- Counts active series matching PromQL label matchers and exposes `cortex_ingester_active_series_per_tracker` metric
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Active Series Tracker
2+
3+
## Problem
4+
5+
AMP needs to monitor active series counts by configurable patterns (e.g., all series with `__name__=~"api_.*"`) for internal observability. The existing `LimitsPerLabelSet` feature is unsuitable because:
6+
7+
1. **No regex matching** — only supports exact `label=value` matching.
8+
2. **Default partition side-effects** — adding labelset buckets reduces the default partition count.
9+
3. **Coupled to limit enforcement** — designed for enforcing series limits, not pure monitoring.
10+
11+
## Requirements
12+
13+
- Track active series counts by configurable label matchers (including regex).
14+
- Expose counts as Prometheus metrics on the ingester (internal only, not vended to customers).
15+
- Configuration supports **per-tenant overrides** with a **default** fallback (same pattern as all other Limits fields).
16+
- **Runtime hot-reloadable** via the existing runtime config file mechanism.
17+
- **No limit enforcement** — purely observational.
18+
- **No default partition** — unmatched series are simply not tracked.
19+
- A series can match multiple tracker entries simultaneously.
20+
21+
## Design
22+
23+
### Configuration
24+
25+
Tracker config lives in the `Limits` struct, following the same per-tenant override pattern as `LimitsPerLabelSet`:
26+
27+
```yaml
28+
# Default trackers (applied to all tenants without overrides)
29+
limits:
30+
active_series_trackers:
31+
- name: api_metrics
32+
matchers: '{__name__=~"api_.*"}'
33+
34+
# Per-tenant overrides via runtime config
35+
overrides:
36+
tenant-123:
37+
active_series_trackers:
38+
- name: api_metrics
39+
matchers: '{__name__=~"api_.*"}'
40+
- name: system_metrics
41+
matchers: '{__name__=~"node_.*|process_.*"}'
42+
```
43+
44+
The `matchers` field uses standard PromQL matcher syntax parsed via `parser.ParseMetricSelector`.
45+
46+
### Runtime Reload
47+
48+
Tracker config is part of `Limits`, which is reloaded via the runtime config manager every `runtime-config.reload-period` (default 10s). Matchers are parsed and validated during YAML/JSON unmarshalling. Invalid configs are rejected (existing config stays active).
49+
50+
### Metrics
51+
52+
A new gauge metric emitted per ingester:
53+
54+
```
55+
cortex_ingester_active_series_per_tracker{user="<tenant>", name="<tracker_name>"} <count>
56+
```
57+
58+
### Matching Logic
59+
60+
On each active series metrics update tick (default 1min), for each tenant:
61+
1. Read the tenant's tracker config via `i.limits.ActiveSeriesTrackers(userID)`
62+
2. For each tracker, count active series whose labels satisfy all matchers
63+
3. Emit the gauge metric
64+
65+
A series can match multiple trackers. Tenants without configured trackers emit no tracker metrics.
66+
67+
### Performance Considerations
68+
69+
- Matching runs once per update period (default 1min), not on every sample ingestion.
70+
- The number of trackers is expected to be small (< 10).
71+
- Compiled matchers are cached in the parsed Limits and only recompiled on config change.
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
//go:build requires_docker
2+
3+
package integration
4+
5+
import (
6+
"fmt"
7+
"path/filepath"
8+
"testing"
9+
"time"
10+
11+
"github.com/prometheus/prometheus/model/labels"
12+
"github.com/prometheus/prometheus/prompb"
13+
"github.com/stretchr/testify/require"
14+
"gopkg.in/yaml.v3"
15+
16+
"github.com/cortexproject/cortex/integration/e2e"
17+
e2edb "github.com/cortexproject/cortex/integration/e2e/db"
18+
"github.com/cortexproject/cortex/integration/e2ecortex"
19+
)
20+
21+
func TestActiveSeriesTrackerPerTenant(t *testing.T) {
22+
s, err := e2e.NewScenario(networkName)
23+
require.NoError(t, err)
24+
defer s.Close()
25+
26+
// Write runtime config with per-tenant active series trackers.
27+
runtimeConfig := map[string]interface{}{
28+
"overrides": map[string]interface{}{
29+
"user-1": map[string]interface{}{
30+
"active_series_trackers": []map[string]string{
31+
{"name": "api_metrics", "matchers": `{__name__=~"api_.*"}`},
32+
{"name": "node_metrics", "matchers": `{__name__=~"node_.*"}`},
33+
},
34+
},
35+
},
36+
}
37+
runtimeCfgYAML, err := yaml.Marshal(runtimeConfig)
38+
require.NoError(t, err)
39+
require.NoError(t, writeFileToSharedDir(s, runtimeConfigFile, runtimeCfgYAML))
40+
41+
flags := BlocksStorageFlags()
42+
flags["-distributor.shard-by-all-labels"] = "true"
43+
flags["-ingester.active-series-metrics-enabled"] = "true"
44+
flags["-ingester.active-series-metrics-update-period"] = "2s"
45+
flags["-ingester.active-series-metrics-idle-timeout"] = "5m"
46+
flags["-runtime-config.file"] = filepath.Join(e2e.ContainerSharedDir, runtimeConfigFile)
47+
flags["-runtime-config.reload-period"] = "1s"
48+
flags["-alertmanager.web.external-url"] = "http://localhost/alertmanager"
49+
flags["-alertmanager-storage.backend"] = "local"
50+
flags["-alertmanager-storage.local.path"] = filepath.Join(e2e.ContainerSharedDir, "alertmanager_configs")
51+
52+
require.NoError(t, writeFileToSharedDir(s, "alertmanager_configs", []byte{}))
53+
54+
consul := e2edb.NewConsul()
55+
minio := e2edb.NewMinio(9000, flags["-blocks-storage.s3.bucket-name"])
56+
require.NoError(t, s.StartAndWaitReady(consul, minio))
57+
58+
flags["-ring.store"] = "consul"
59+
flags["-consul.hostname"] = consul.NetworkHTTPEndpoint()
60+
61+
cortex := e2ecortex.NewSingleBinary("cortex-1", flags, "")
62+
require.NoError(t, s.StartAndWaitReady(cortex))
63+
64+
// Wait until the ring is ready.
65+
require.NoError(t, cortex.WaitSumMetrics(e2e.Equals(float64(512)), "cortex_ring_tokens_total"))
66+
67+
c, err := e2ecortex.NewClient(cortex.HTTPEndpoint(), cortex.HTTPEndpoint(), "", "", "user-1")
68+
require.NoError(t, err)
69+
70+
now := time.Now()
71+
for _, name := range []string{"api_requests_total", "api_errors_total", "node_cpu_seconds", "process_memory_bytes"} {
72+
series, _ := generateSeries(name, now, prompb.Label{Name: "job", Value: "test"})
73+
res, err := c.Push(series)
74+
require.NoError(t, err)
75+
require.Equal(t, 200, res.StatusCode, fmt.Sprintf("push %s failed", name))
76+
}
77+
78+
// user-1 has trackers: api_metrics (matches 2), node_metrics (matches 1).
79+
require.NoError(t, cortex.WaitSumMetricsWithOptions(
80+
e2e.Equals(2),
81+
[]string{"cortex_ingester_active_series_per_tracker"},
82+
e2e.WithLabelMatchers(
83+
labels.MustNewMatcher(labels.MatchEqual, "user", "user-1"),
84+
labels.MustNewMatcher(labels.MatchEqual, "name", "api_metrics"),
85+
),
86+
e2e.WaitMissingMetrics,
87+
))
88+
89+
require.NoError(t, cortex.WaitSumMetricsWithOptions(
90+
e2e.Equals(1),
91+
[]string{"cortex_ingester_active_series_per_tracker"},
92+
e2e.WithLabelMatchers(
93+
labels.MustNewMatcher(labels.MatchEqual, "user", "user-1"),
94+
labels.MustNewMatcher(labels.MatchEqual, "name", "node_metrics"),
95+
),
96+
e2e.WaitMissingMetrics,
97+
))
98+
99+
// user-2 has no trackers configured — should have no tracker metrics.
100+
c2, err := e2ecortex.NewClient(cortex.HTTPEndpoint(), cortex.HTTPEndpoint(), "", "", "user-2")
101+
require.NoError(t, err)
102+
103+
series2, _ := generateSeries("api_requests_total", now, prompb.Label{Name: "job", Value: "test"})
104+
res, err := c2.Push(series2)
105+
require.NoError(t, err)
106+
require.Equal(t, 200, res.StatusCode)
107+
108+
// Wait for user-2 active series to be counted.
109+
require.NoError(t, cortex.WaitSumMetricsWithOptions(
110+
e2e.Equals(1),
111+
[]string{"cortex_ingester_active_series"},
112+
e2e.WithLabelMatchers(labels.MustNewMatcher(labels.MatchEqual, "user", "user-2")),
113+
e2e.WaitMissingMetrics,
114+
))
115+
116+
// user-2 should have no tracker metrics.
117+
sum, err := cortex.SumMetrics(
118+
[]string{"cortex_ingester_active_series_per_tracker"},
119+
e2e.WithLabelMatchers(labels.MustNewMatcher(labels.MatchEqual, "user", "user-2")),
120+
e2e.SkipMissingMetrics,
121+
)
122+
require.NoError(t, err)
123+
require.Equal(t, 0.0, sum[0])
124+
125+
// Now update runtime config: remove node_metrics tracker for user-1.
126+
runtimeConfig2 := map[string]interface{}{
127+
"overrides": map[string]interface{}{
128+
"user-1": map[string]interface{}{
129+
"active_series_trackers": []map[string]string{
130+
{"name": "api_metrics", "matchers": `{__name__=~"api_.*"}`},
131+
},
132+
},
133+
},
134+
}
135+
runtimeCfgYAML2, err := yaml.Marshal(runtimeConfig2)
136+
require.NoError(t, err)
137+
require.NoError(t, writeFileToSharedDir(s, runtimeConfigFile, runtimeCfgYAML2))
138+
139+
// Wait for the stale node_metrics tracker metric to be removed.
140+
require.NoError(t, cortex.WaitSumMetricsWithOptions(
141+
e2e.Equals(0),
142+
[]string{"cortex_ingester_active_series_per_tracker"},
143+
e2e.WithLabelMatchers(
144+
labels.MustNewMatcher(labels.MatchEqual, "user", "user-1"),
145+
labels.MustNewMatcher(labels.MatchEqual, "name", "node_metrics"),
146+
),
147+
e2e.SkipMissingMetrics,
148+
))
149+
150+
// api_metrics tracker should still work.
151+
require.NoError(t, cortex.WaitSumMetricsWithOptions(
152+
e2e.Equals(2),
153+
[]string{"cortex_ingester_active_series_per_tracker"},
154+
e2e.WithLabelMatchers(
155+
labels.MustNewMatcher(labels.MatchEqual, "user", "user-1"),
156+
labels.MustNewMatcher(labels.MatchEqual, "name", "api_metrics"),
157+
),
158+
e2e.WaitMissingMetrics,
159+
))
160+
}

pkg/ingester/active_series.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,3 +248,13 @@ func (s *activeSeriesStripe) getActiveNativeHistogram() int {
248248

249249
return s.activeNativeHistogram
250250
}
251+
252+
// matchesAll returns true if the labels satisfy all given matchers.
253+
func matchesAll(lbs labels.Labels, matchers []*labels.Matcher) bool {
254+
for _, m := range matchers {
255+
if !m.Matches(lbs.Get(m.Name)) {
256+
return false
257+
}
258+
}
259+
return true
260+
}

0 commit comments

Comments
 (0)