Skip to content

fix(otel): custom metrics coexistence, FastAPI HTTP metrics, and export fixes#302

Draft
james-cardenas wants to merge 7 commits into
mainfrom
jamesc-agentex-otel-fixes
Draft

fix(otel): custom metrics coexistence, FastAPI HTTP metrics, and export fixes#302
james-cardenas wants to merge 7 commits into
mainfrom
jamesc-agentex-otel-fixes

Conversation

@james-cardenas

@james-cardenas james-cardenas commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Extracts the OTel fixes from #282 into a clean branch on current main, without the grant-perf changes or the reverted grant removal.

  • Shared MeterProvider coexistence: custom app metrics (auth_cache.*, db.client.*) attach to the OTel Operator's existing provider instead of replacing it (fixes missing auto-instrumentation HTTP metrics).
  • Opt-in FastAPI HTTP metrics: in-process instrumentation for http_server_* metrics when the operator auto-instrumentation path is unavailable, instrumented at import time.
  • Resource attributes: use OTELResourceDetector for custom metrics resource attributes; append PID to service.instance.id on standalone MeterProvider.
  • Export behavior: cumulative export for metrics; improved push/export logging.

Commits (cherry-picked from #282)

  1. Push metrics logging and auto-instrument fix
  2. Add opt-in in-process FastAPI OTel HTTP metrics instrumentation
  3. Fix FastAPI OTel HTTP metrics by instrumenting at import time
  4. Use OTELResourceDetector for custom metrics resource attributes
  5. Set Cumulative export for Metrics
  6. Append PID to service.instance.id on standalone MeterProvider resource

Excluded from #282: grant perf removal/revert commits and the duplicate coexistence commit already on main (#287).

Test plan

  • uv run pytest tests/unit/utils/test_otel_metrics.py — 35 passed
  • CI green
  • Deploy to staging and verify http_server_request_duration_seconds + custom metrics in Grafana

Made with Cursor

Greptile Summary

This PR extracts OTel observability fixes from #282 into a clean branch. It addresses three related problems: custom app metrics now attach to an existing OTel Operator MeterProvider without replacing it; in-process FastAPI HTTP metrics instrumentation is available as an opt-in feature and is applied at module-import time (before Starlette caches its middleware stack); and OTLP histogram export is switched to cumulative temporality for correct Mimir rate()/increase() queries.

  • Coexistence + standalone: init_otel_metrics checks for an existing SDK MeterProvider first; if absent and an OTLP endpoint is configured it installs one, gracefully handling a post-set_meter_provider race by shutting down the orphan.
  • FastAPI instrumentation: configure_app_metrics(fastapi_app) is now called at module-import time (replacing the lifespan call), with idempotency checks and a guard against double-instrumentation by the OTel Operator.
  • Resource detection: OTELResourceDetector replaces hand-coded Resource.create(...), and a per-process PID suffix is appended to service.instance.id to disambiguate uvicorn workers.

Confidence Score: 4/5

Safe to merge for the coexistence, standalone, and temporality changes; the FastAPI opt-in gate has a logic gap worth addressing before production rollout with AGENTEX_OTEL_HTTP_METRICS_ENABLED enabled.

The coexistence/standalone provider logic, orphan-shutdown race handling, and cumulative temporality fix are well-tested and sound. The deny-list in _http_metrics_enabled means an empty env-var value (e.g. a blank ConfigMap entry) would silently enable in-process HTTP instrumentation and cause double-instrumentation with the OTel Operator. The service_name/service_version/environment params are now dead in the resource-building path, which creates API confusion but no immediate runtime breakage since no current caller passes them.

agentex/src/utils/otel_metrics.py — specifically _http_metrics_enabled (deny-list) and the now-unused init_otel_metrics keyword arguments.

Important Files Changed

Filename Overview
agentex/src/utils/otel_metrics.py Core OTel metrics module significantly reworked: adds coexistence/standalone paths, cumulative temporality, OTELResourceDetector, FastAPI instrumentation. Two issues: deny-list in _http_metrics_enabled can unexpectedly enable the opt-in feature (e.g. on empty env var); service_name/service_version/environment params are no longer reflected in the resource but remain in the signature.
agentex/src/api/app.py Moves init_otel_metrics() call from lifespan to module-import time via configure_app_metrics(fastapi_app), correctly ensuring FastAPI middleware is instrumented before the ASGI server's first message.
agentex/tests/unit/utils/test_otel_metrics.py Comprehensive new tests covering coexistence, standalone, orphan-shutdown, resource detection, cumulative temporality, and FastAPI HTTP metrics. Notably adds reset of _METER_PROVIDER_SET_ONCE for proper inter-test isolation of the SDK proxy state.
agentex/pyproject.toml Adds opentelemetry-instrumentation-fastapi>=0.49b0 as a direct dependency for in-process HTTP metrics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[configure_app_metrics called at module import] --> B[init_otel_metrics]
    B --> C{SDK MeterProvider already global?}
    C -->|Yes| D[coexistence mode: attach via get_meter]
    C -->|No| E{OTLP endpoint configured?}
    E -->|No| F[disabled mode: return None]
    E -->|Yes| G[Build resource via OTELResourceDetector + PID]
    G --> H[Create MeterProvider with cumulative temporality]
    H --> I[set_meter_provider]
    I --> J{Global now points to our provider?}
    J -->|Yes| K[standalone mode: _meter_provider = provider]
    J -->|No - race condition| L[orphan shutdown: use existing global]
    A --> M{AGENTEX_OTEL_HTTP_METRICS_ENABLED?}
    M -->|false - default| N[skip FastAPI instrumentation]
    M -->|true| O{App already instrumented?}
    O -->|Yes| P[skip - idempotent]
    O -->|No| Q[FastAPIInstrumentor.instrument_app]
    Q --> R[http.server.request.duration metrics recorded]
Loading

Comments Outside Diff (1)

  1. agentex/src/utils/otel_metrics.py, line 194-200 (link)

    P2 service_name, service_version, and environment args are now dead parameters

    Since the resource is built entirely by _build_resource() (which reads from OTELResourceDetector / env vars), the service_name, service_version, and environment arguments are no longer reflected in the metrics resource. resolved_service_name only appears in the log string, not in the Resource. A caller who passes init_otel_metrics(service_name="my-svc") will see "my-svc" in logs but not in Prometheus labels — a subtle mismatch. Either wire these args into _build_resource() or remove them from the signature (with a deprecation note if they're part of a public API).

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: agentex/src/utils/otel_metrics.py
    Line: 194-200
    
    Comment:
    **`service_name`, `service_version`, and `environment` args are now dead parameters**
    
    Since the resource is built entirely by `_build_resource()` (which reads from `OTELResourceDetector` / env vars), the `service_name`, `service_version`, and `environment` arguments are no longer reflected in the metrics resource. `resolved_service_name` only appears in the log string, not in the `Resource`. A caller who passes `init_otel_metrics(service_name="my-svc")` will see "my-svc" in logs but not in Prometheus labels — a subtle mismatch. Either wire these args into `_build_resource()` or remove them from the signature (with a deprecation note if they're part of a public API).
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/utils/otel_metrics.py:345-348
**Deny-list silently enables opt-in feature on unexpected values**

Any value not in `{"0", "false", "no", "off"}` — including an empty string — returns `True` and enables HTTP metrics. In Kubernetes it is common to inherit env vars from a `ConfigMap` where a key exists but its value is empty (e.g. `AGENTEX_OTEL_HTTP_METRICS_ENABLED: ""`). That would silently enable in-process HTTP instrumentation and trigger double-instrumentation with the OTel Operator, producing duplicate `http_server_*` timeseries and potential cardinality issues.

An allow-list approach is safer for an opt-in flag: `return flag in {"1", "true", "yes", "on"}`.

```suggestion
def _http_metrics_enabled() -> bool:
    """Return whether in-process FastAPI HTTP metrics should be installed."""
    flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
    return flag in {"1", "true", "yes", "on"}
```

### Issue 2 of 2
agentex/src/utils/otel_metrics.py:194-200
**`service_name`, `service_version`, and `environment` args are now dead parameters**

Since the resource is built entirely by `_build_resource()` (which reads from `OTELResourceDetector` / env vars), the `service_name`, `service_version`, and `environment` arguments are no longer reflected in the metrics resource. `resolved_service_name` only appears in the log string, not in the `Resource`. A caller who passes `init_otel_metrics(service_name="my-svc")` will see "my-svc" in logs but not in Prometheus labels — a subtle mismatch. Either wire these args into `_build_resource()` or remove them from the signature (with a deprecation note if they're part of a public API).

Reviews (1): Last reviewed commit: "Append PID to service.instance.id on sta..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@james-cardenas james-cardenas requested a review from a team as a code owner June 11, 2026 19:58
@socket-security

socket-security Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpypi/​opentelemetry-instrumentation-fastapi@​0.60b1100100100100100

View full report

@james-cardenas james-cardenas marked this pull request as draft June 11, 2026 20:00
Comment on lines +345 to +348
def _http_metrics_enabled() -> bool:
"""Return whether in-process FastAPI HTTP metrics should be installed."""
flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
return flag not in {"0", "false", "no", "off"}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Deny-list silently enables opt-in feature on unexpected values

Any value not in {"0", "false", "no", "off"} — including an empty string — returns True and enables HTTP metrics. In Kubernetes it is common to inherit env vars from a ConfigMap where a key exists but its value is empty (e.g. AGENTEX_OTEL_HTTP_METRICS_ENABLED: ""). That would silently enable in-process HTTP instrumentation and trigger double-instrumentation with the OTel Operator, producing duplicate http_server_* timeseries and potential cardinality issues.

An allow-list approach is safer for an opt-in flag: return flag in {"1", "true", "yes", "on"}.

Suggested change
def _http_metrics_enabled() -> bool:
"""Return whether in-process FastAPI HTTP metrics should be installed."""
flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
return flag not in {"0", "false", "no", "off"}
def _http_metrics_enabled() -> bool:
"""Return whether in-process FastAPI HTTP metrics should be installed."""
flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
return flag in {"1", "true", "yes", "on"}
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/utils/otel_metrics.py
Line: 345-348

Comment:
**Deny-list silently enables opt-in feature on unexpected values**

Any value not in `{"0", "false", "no", "off"}` — including an empty string — returns `True` and enables HTTP metrics. In Kubernetes it is common to inherit env vars from a `ConfigMap` where a key exists but its value is empty (e.g. `AGENTEX_OTEL_HTTP_METRICS_ENABLED: ""`). That would silently enable in-process HTTP instrumentation and trigger double-instrumentation with the OTel Operator, producing duplicate `http_server_*` timeseries and potential cardinality issues.

An allow-list approach is safer for an opt-in flag: `return flag in {"1", "true", "yes", "on"}`.

```suggestion
def _http_metrics_enabled() -> bool:
    """Return whether in-process FastAPI HTTP metrics should be installed."""
    flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
    return flag in {"1", "true", "yes", "on"}
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant