Conversation
The OTel metrics/aggregation pipeline stripped worker.name and summed per-worker cumulative counters via groupbyattrs. When gunicorn recycled workers (--max-requests), the aggregate could decrease, causing hidden counter resets that made rate(le=1000)/rate(le=+Inf) exceed 1 in SLO queries. Two prior fixes failed: - cumulativetodelta+deltatocumulative: deltatocumulative not in Red Hat OTel 0.107.0 build - SDK delta temporality: the Prometheus exporter does not accumulate delta histograms (known upstream limitation, issue #19153) This fix removes worker aggregation entirely so each worker is its own Prometheus time series. Prometheus rate() handles per-worker counter resets correctly, and sum(rate(...)) in SLO queries aggregates across workers after rate computation. To offset the cardinality increase from keeping worker_name, a transform processor normalizes http_target from hundreds of URL patterns to just "upload" or "other". SLO queries in app-interface need a corresponding update from regex filters to equality filters on http_target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRemoves worker-level metric aggregation and OTEL delta temporality to fix invalid API latency SLI ratios, and introduces an OTEL transform to coarsen http_target labels into low-cardinality buckets while wiring it into the metrics pipeline and cleaning up now-unneeded processors and env vars. Sequence diagram for metrics flow with worker_name and simplified http_targetsequenceDiagram
participant PulpWorker1
participant PulpWorker2
participant OTelSDK
participant OTelCollector as OTelCollectorProcessors
participant Prometheus
participant SLOQueryEngine
PulpWorker1->>OTelSDK: record api.request_duration with worker_name and http_target
PulpWorker2->>OTelSDK: record api.request_duration with worker_name and http_target
OTelSDK->>OTelCollector: export metrics with cumulative histogram
OTelCollector->>OTelCollector: apply memory_limiter
OTelCollector->>OTelCollector: apply filter_filter_pulp_api_request_duration
OTelCollector->>OTelCollector: apply transform_simplify_http_target
Note over OTelCollector: http_target set to upload or other
OTelCollector->>OTelCollector: apply batch_api_aggregation
OTelCollector->>Prometheus: expose metrics with worker_name and simplified http_target
SLOQueryEngine->>Prometheus: execute rate(api.request_duration) per worker_name and http_target
Prometheus-->>SLOQueryEngine: per worker series
SLOQueryEngine->>SLOQueryEngine: sum per worker rates to compute SLI ratio
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The
transform/simplify_http_targetprocessor currently runs against all metrics; consider scoping themetric_statementsto just theapi.request_durationseries (or relevant metrics) to avoid unintentionally rewritinghttp.targeton unrelated data. - The
IsMatch(attributes["http.target"], ".*upload.*")pattern is very broad; if only specific upload endpoints should be grouped, consider tightening the regex to avoid accidentally classifying unrelated paths containing the substringuploadas"upload".
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `transform/simplify_http_target` processor currently runs against all metrics; consider scoping the `metric_statements` to just the `api.request_duration` series (or relevant metrics) to avoid unintentionally rewriting `http.target` on unrelated data.
- The `IsMatch(attributes["http.target"], ".*upload.*")` pattern is very broad; if only specific upload endpoints should be grouped, consider tightening the regex to avoid accidentally classifying unrelated paths containing the substring `upload` as `"upload"`.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
APIWriteRequestLatencySLO reports values > 1 (impossible for a valid histogram ratio). The OTel pipeline'sattributes/remove_worker_name+groupbyattrs/api_aggregationsums per-worker cumulative counters into a single series. When gunicorn recycles workers (--max-requests), the aggregate drops, causing hidden counter resets that inflaterate(le=1000)relative torate(le=+Inf).Two prior fixes failed:
deltatocumulativeis not included in the Red Hat OTel Collector 0.107.0 buildFix
Remove the worker aggregation entirely. Each worker keeps its own
worker_namelabel in Prometheus, sorate()handles per-worker counter resets correctly (this is what it's designed for). The existingsum(rate(...))pattern in SLO queries and recording rules computesrate()first (per worker), thensumaggregates — which is mathematically correct.To offset the cardinality increase from keeping
worker_name, atransformprocessor normalizeshttp_targetfrom hundreds of unique URL patterns to just"upload"or"other".Changes
attributes/remove_worker_nameprocessor definition and pipeline referencegroupbyattrs/api_aggregationprocessor definition and pipeline referenceOTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=deltafrom all three deployments (reverts fix: prevent gunicorn worker recycling from corrupting histogram aggregation #1068)transform/simplify_http_targetprocessor that normalizeshttp_targetto"upload"or"other"Companion change required
SLO queries in app-interface need updating to match the new
http_targetvalues:http_target!~".*/upload/"→http_target!="upload"http_target=~".*/upload/"→http_target="upload"Test plan
worker_namelabel present,http_targetvalues are"upload"or"other"🤖 Generated with Claude Code
Summary by Sourcery
Adjust metrics pipeline configuration to avoid incorrect API latency SLI ratios by preserving per-worker series and simplifying HTTP target labels.
New Features:
Enhancements: