You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,7 @@
3
3
## master / unreleased
4
4
*[FEATURE] Distributor: Add experimental `-distributor.enable-start-timestamp` flag for Prometheus Remote Write 2.0. When enabled, `StartTimestamp (ST)` is ingested. #7371
5
5
*[FEATURE] Memberlist: Add `-memberlist.cluster-label` and `-memberlist.cluster-label-verification-disabled` to prevent accidental cross-cluster gossip joins and support rolling label rollout. #7385
6
+
*[FEATURE] Querier: Add timeout classification to classify query timeouts as 4XX (user error) or 5XX (system error) based on phase timing. When enabled, queries that spend most of their time in PromQL evaluation return `422 Unprocessable Entity` instead of `503 Service Unavailable`. #7374
6
7
*[ENHANCEMENT] Ingester: Add WAL record metrics to help evaluate the effectiveness of WAL compression type (e.g. snappy, zstd): `cortex_ingester_tsdb_wal_record_part_writes_total`, `cortex_ingester_tsdb_wal_record_parts_bytes_written_total`, and `cortex_ingester_tsdb_wal_record_bytes_saved_total`. #7420
*[ENHANCEMENT] Metrics Helper: Add native histogram support for aggregating and merging, including dual-format histogram handling that exposes both native and classic bucket formats. #7359
Timeout classification lets Cortex distinguish between query timeouts caused by expensive user queries (4XX) and those caused by system issues (5XX). When enabled, queries that spend most of their time in PromQL evaluation are returned as `422 Unprocessable Entity` instead of `503 Service Unavailable`, giving callers a clear signal to simplify the query rather than retry.
4
+
5
+
## How It Works
6
+
7
+
When a query (instant/ranged, other apis are unchanged) arrives at the querier, the feature:
8
+
9
+
1. Subtracts any time the query spent waiting in the scheduler queue from the configured deadline.
10
+
2. Sets a proactive context timeout using the remaining budget, so the querier cancels the query slightly before the PromQL engine's own timeout fires.
11
+
3. On timeout, inspects phase timings (storage fetch time vs. total time) to compute eval time.
12
+
4. If eval time exceeds the configured threshold, the timeout is classified as a user error (4XX). Otherwise it remains a system error (5XX).
13
+
14
+
This means expensive queries that burn their budget in PromQL evaluation get a `422`, while other queries remain a `5XX`.
15
+
16
+
* Note that due to different query shards not returning at the same time, the first returned timed out shard gets to decide whether the query will be converted to 4XX.
17
+
18
+
## Configuration
19
+
20
+
Enable the feature and set the three related flags:
21
+
22
+
```yaml
23
+
querier:
24
+
timeout_classification_enabled: true
25
+
timeout_classification_deadline: 1m59s
26
+
timeout_classification_eval_threshold: 1m30s
27
+
```
28
+
29
+
| Flag | Default | Description |
30
+
|---|---|---|
31
+
| `timeout_classification_enabled` | `false` | Enable 5XX-to-4XX conversion based on phase timing. |
32
+
| `timeout_classification_deadline` | `1m59s` | Proactive cancellation deadline. Set this a few seconds less than the querier timeout. |
33
+
| `timeout_classification_eval_threshold` | `1m30s` | Eval time above which a timeout is classified as user error (4XX). Must be ≤ the deadline. |
34
+
35
+
### Constraints
36
+
37
+
- `timeout_classification_deadline`must be positive and strictly less than `querier.timeout`.
38
+
- `timeout_classification_eval_threshold`must be positive and ≤ `timeout_classification_deadline`.
39
+
- Query stats must be enabled (`query_stats_enabled: true` on the frontend handler) for classification to work.
40
+
41
+
## Tuning
42
+
43
+
- The deadline should be close to but below the querier timeout so the proactive cancellation fires first. A gap of 1–2 seconds is typical.
44
+
- The eval threshold controls sensitivity. A lower threshold classifies more timeouts as user errors; a higher threshold is more conservative. Start with the default and adjust based on your workload.
45
+
- Monitor the `decision` field in the timeout classification log line (`query shard timed out with classification`) to see how queries are being classified before enabling the conversion.
46
+
47
+
## Observability
48
+
49
+
When a query times out and query stats is active, the querier emits a warning-level log line containing:
50
+
51
+
- `queue_wait_time`— time spent in the scheduler queue
52
+
- `query_storage_wall_time`— time spent fetching data from storage
53
+
- `eval_time`— computed as `total_time - query_storage_wall_time`
54
+
- `decision`— `0` for 5XX (system), `1` for 4XX (user)
55
+
- `conversion_enabled`— whether the status code conversion is active
56
+
57
+
These fields are logged regardless of whether conversion is enabled, so you can observe classification behavior in dry-run mode by setting `timeout_classification_enabled: false` and reviewing the logs.
Copy file name to clipboardExpand all lines: pkg/api/queryapi/util.go
+4-3Lines changed: 4 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -16,9 +16,10 @@ import (
16
16
)
17
17
18
18
var (
19
-
ErrEndBeforeStart=httpgrpc.Errorf(http.StatusBadRequest, "%s", "end timestamp must not be before start time")
20
-
ErrNegativeStep=httpgrpc.Errorf(http.StatusBadRequest, "%s", "zero or negative query resolution step widths are not accepted. Try a positive integer")
21
-
ErrStepTooSmall=httpgrpc.Errorf(http.StatusBadRequest, "%s", "exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX)")
19
+
ErrEndBeforeStart=httpgrpc.Errorf(http.StatusBadRequest, "%s", "end timestamp must not be before start time")
20
+
ErrNegativeStep=httpgrpc.Errorf(http.StatusBadRequest, "%s", "zero or negative query resolution step widths are not accepted. Try a positive integer")
21
+
ErrStepTooSmall=httpgrpc.Errorf(http.StatusBadRequest, "%s", "exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution (?step=XX)")
0 commit comments