Skip to content

fix: remove worker aggregation and normalize http_target to fix SLI > 1#1127

Merged
dkliban merged 1 commit intopulp:mainfrom
dkliban:fix-otel-histogram-worker-aggregation
Apr 26, 2026
Merged

fix: remove worker aggregation and normalize http_target to fix SLI > 1#1127
dkliban merged 1 commit intopulp:mainfrom
dkliban:fix-otel-histogram-worker-aggregation

Conversation

@dkliban
Copy link
Copy Markdown
Member

@dkliban dkliban commented Apr 26, 2026

Problem

The APIWriteRequestLatency SLO reports values > 1 (impossible for a valid histogram ratio). The OTel pipeline's attributes/remove_worker_name + groupbyattrs/api_aggregation sums per-worker cumulative counters into a single series. When gunicorn recycles workers (--max-requests), the aggregate drops, causing hidden counter resets that inflate rate(le=1000) relative to rate(le=+Inf).

Two prior fixes failed:

Fix

Remove the worker aggregation entirely. Each worker keeps its own worker_name label in Prometheus, so rate() handles per-worker counter resets correctly (this is what it's designed for). The existing sum(rate(...)) pattern in SLO queries and recording rules computes rate() first (per worker), then sum aggregates — which is mathematically correct.

To offset the cardinality increase from keeping worker_name, a transform processor normalizes http_target from hundreds of unique URL patterns to just "upload" or "other".

Changes

  • Removed: attributes/remove_worker_name processor definition and pipeline reference
  • Removed: groupbyattrs/api_aggregation processor definition and pipeline reference
  • Removed: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta from all three deployments (reverts fix: prevent gunicorn worker recycling from corrupting histogram aggregation #1068)
  • Added: transform/simplify_http_target processor that normalizes http_target to "upload" or "other"

Companion change required

SLO queries in app-interface need updating to match the new http_target values:

  • http_target!~".*/upload/"http_target!="upload"
  • http_target=~".*/upload/"http_target="upload"

Test plan

  • Deploy to ephemeral — confirm collector starts, worker_name label present, http_target values are "upload" or "other"
  • Deploy to stage — confirm SLI ratio ≤ 1 with 1h window
  • Deploy to production — monitor for 24h

🤖 Generated with Claude Code

Summary by Sourcery

Adjust metrics pipeline configuration to avoid incorrect API latency SLI ratios by preserving per-worker series and simplifying HTTP target labels.

New Features:

  • Introduce a transform processor that normalizes http.target metric attributes into coarse "upload" or "other" categories.

Enhancements:

  • Remove worker-name stripping and aggregation processors from the metrics pipeline so Prometheus retains per-worker time series.
  • Revert use of delta temporality for OTLP metrics export by dropping the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable from all deployments.

The OTel metrics/aggregation pipeline stripped worker.name and summed
per-worker cumulative counters via groupbyattrs. When gunicorn recycled
workers (--max-requests), the aggregate could decrease, causing hidden
counter resets that made rate(le=1000)/rate(le=+Inf) exceed 1 in SLO
queries.

Two prior fixes failed:
- cumulativetodelta+deltatocumulative: deltatocumulative not in Red Hat
  OTel 0.107.0 build
- SDK delta temporality: the Prometheus exporter does not accumulate
  delta histograms (known upstream limitation, issue #19153)

This fix removes worker aggregation entirely so each worker is its own
Prometheus time series. Prometheus rate() handles per-worker counter
resets correctly, and sum(rate(...)) in SLO queries aggregates across
workers after rate computation.

To offset the cardinality increase from keeping worker_name, a transform
processor normalizes http_target from hundreds of URL patterns to just
"upload" or "other". SLO queries in app-interface need a corresponding
update from regex filters to equality filters on http_target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Apr 26, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Removes worker-level metric aggregation and OTEL delta temporality to fix invalid API latency SLI ratios, and introduces an OTEL transform to coarsen http_target labels into low-cardinality buckets while wiring it into the metrics pipeline and cleaning up now-unneeded processors and env vars.

Sequence diagram for metrics flow with worker_name and simplified http_target

sequenceDiagram
  participant PulpWorker1
  participant PulpWorker2
  participant OTelSDK
  participant OTelCollector as OTelCollectorProcessors
  participant Prometheus
  participant SLOQueryEngine

  PulpWorker1->>OTelSDK: record api.request_duration with worker_name and http_target
  PulpWorker2->>OTelSDK: record api.request_duration with worker_name and http_target

  OTelSDK->>OTelCollector: export metrics with cumulative histogram

  OTelCollector->>OTelCollector: apply memory_limiter
  OTelCollector->>OTelCollector: apply filter_filter_pulp_api_request_duration
  OTelCollector->>OTelCollector: apply transform_simplify_http_target
  Note over OTelCollector: http_target set to upload or other
  OTelCollector->>OTelCollector: apply batch_api_aggregation

  OTelCollector->>Prometheus: expose metrics with worker_name and simplified http_target

  SLOQueryEngine->>Prometheus: execute rate(api.request_duration) per worker_name and http_target
  Prometheus-->>SLOQueryEngine: per worker series
  SLOQueryEngine->>SLOQueryEngine: sum per worker rates to compute SLI ratio
Loading

File-Level Changes

Change Details Files
Replace worker-name stripping and group-by aggregation with a transform that normalizes http_target and keeps per-worker series for correct rate() behavior.
  • Remove attributes/remove_worker_name processor definition and its attachment to the metrics pipeline so worker.name labels are preserved in Prometheus.
  • Delete groupbyattrs/api_aggregation processor and stop using it in the metrics pipeline, leaving only batch/api_aggregation for batching.
  • Add transform/simplify_http_target processor that rewrites http.target datapoint attributes to either "upload" or "other" using IsMatch-based conditions.
  • Wire the new transform/simplify_http_target processor into the metrics/collector pipeline in place of attributes/remove_worker_name and groupbyattrs/api_aggregation.
deploy/clowdapp.yaml
Revert the OTLP metrics temporality override so the exporter uses its default behavior instead of forced delta temporality that is not correctly handled by the Prometheus exporter.
  • Remove OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta from the main application deployment environment.
  • Remove OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta from the additional worker/beat deployments that send metrics.
  • Ensure all three deployments rely on the collector/exporter defaults for histogram temporality.
deploy/clowdapp.yaml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The transform/simplify_http_target processor currently runs against all metrics; consider scoping the metric_statements to just the api.request_duration series (or relevant metrics) to avoid unintentionally rewriting http.target on unrelated data.
  • The IsMatch(attributes["http.target"], ".*upload.*") pattern is very broad; if only specific upload endpoints should be grouped, consider tightening the regex to avoid accidentally classifying unrelated paths containing the substring upload as "upload".
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `transform/simplify_http_target` processor currently runs against all metrics; consider scoping the `metric_statements` to just the `api.request_duration` series (or relevant metrics) to avoid unintentionally rewriting `http.target` on unrelated data.
- The `IsMatch(attributes["http.target"], ".*upload.*")` pattern is very broad; if only specific upload endpoints should be grouped, consider tightening the regex to avoid accidentally classifying unrelated paths containing the substring `upload` as `"upload"`.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dkliban dkliban merged commit 958e514 into pulp:main Apr 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant