Commit 5c996e0
fix: prevent gunicorn worker recycling from corrupting histogram aggregation
Gunicorn worker recycling causes in-memory Prometheus counters to reset.
The OTel aggregation pipeline strips worker.name and sums all workers
into a single cumulative counter via groupbyattrs. When a worker recycles,
its counter resets to 0, decreasing the aggregate.
This manifests as a "hidden counter reset" in Prometheus: if the recycled
worker's final le=+Inf value coincidentally equals the new worker's
starting value (e.g. both are 1 because the new worker immediately handled
a slow request), Prometheus does not detect the reset for le=+Inf. But
le=1000 resets visibly. This inflates rate(le=1000) relative to
rate(le=+Inf), producing SLI ratios greater than 1.
Fix: insert cumulativetodelta before worker aggregation so we sum
per-worker deltas (always non-negative) instead of cumulative totals.
Worker recycles produce a 0-delta rather than a negative value that
corrupts the aggregate. Add deltatorumulative after groupbyattrs to
convert the aggregate delta back to a cumulative counter for the
Prometheus exporter.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent 7b1841c commit 5c996e0
1 file changed
Lines changed: 11 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
46 | 54 | | |
47 | 55 | | |
48 | 56 | | |
| |||
61 | 69 | | |
62 | 70 | | |
63 | 71 | | |
64 | | - | |
| 72 | + | |
65 | 73 | | |
66 | 74 | | |
| 75 | + | |
67 | 76 | | |
68 | 77 | | |
69 | 78 | | |
| 79 | + | |
70 | 80 | | |
71 | 81 | | |
72 | 82 | | |
| |||
0 commit comments