Skip to content

Commit d65011a

Browse files
fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers (#305)
## Summary - Bootstrap OpenTelemetry auto-instrumentation at import time so uvicorn **spawn** workers get HTTP/library instrumentation and custom metrics, not just the parent process - Assign a per-worker `service.instance.id` by patching `OTEL_RESOURCE_ATTRIBUTES` before `initialize()`, fixing shared timeseries when multiple workers inherit the same pod-level resource attrs ([opentelemetry-python#4390](open-telemetry/opentelemetry-python#4390)) - Move `otel_metrics` to the first import in `app.py` so instrumentors patch FastAPI/httpx/SQLAlchemy at import time; `init_otel_metrics()` at lifespan startup attaches custom instruments to the existing global `MeterProvider` ## Problem The OTel Operator injects auto-instrumentation via `sitecustomize`, which runs `initialize()` in the **parent** process and then strips auto-instrumentation from `PYTHONPATH`. Uvicorn **spawn** workers are fresh Python processes without `sitecustomize`, so they previously served plain `FastAPI` with no OTel HTTP middleware or metrics. Separately, spawn workers share the pod-level `OTEL_RESOURCE_ATTRIBUTES` env var. Auto-instrumentation builds provider resources from env at `initialize()` time via `Resource.create()` — so all workers would emit on the same `service.instance.id` without a per-process suffix ([opentelemetry-python#4390](open-telemetry/opentelemetry-python#4390)). ## Solution 1. **`bootstrap_auto_instrumentation()`** — runs on `otel_metrics` import; syncs `service.instance.id.<pid>` into env, then calls `initialize()` to create global `TracerProvider`/`MeterProvider` and load instrumentors ([auto-instrumentation reference](https://opentelemetry.io/docs/languages/python/automatic/)) 2. **`init_otel_metrics()`** — unchanged coexistence model: attaches custom app metrics (`auth_cache_*`, `db_*`) to the bootstrap provider when present; standalone OTLP pipeline only when no global provider exists 3. **Import order** — `app.py` imports `otel_metrics` first, before FastAPI and other auto-instrumented libraries ## References - [OTel Python auto-instrumentation](https://opentelemetry.io/docs/languages/python/automatic/) - [OTel Python manual SDK configuration](https://opentelemetry.io/docs/languages/python/getting-started/) — `initialize()` creates providers via `Resource.create()` from env, not patching-only - [opentelemetry-python#4390](open-telemetry/opentelemetry-python#4390) — per-worker `service.instance.id` for multi-worker processes - [Uvicorn deployment / workers](https://www.uvicorn.org/deployment/) — spawn workers are separate processes ## Notes - **Helm (single worker):** bootstrap runs in the worker process; operator k8s resource labels are preserved; duplicate `initialize()` on `--workers 1` only produces set-once warnings - **Dockerfile (`--workers 4`):** each spawn worker bootstraps independently with a distinct pid-suffixed `service.instance.id` - **ddtrace coexistence:** documented in module docstring; Helm uses `ddtrace-run` only when `datadog.env` is set - Removed ineffective `NoOpMeterProvider` reset on shutdown — OTel global `MeterProvider` is set-once and cannot be replaced ## Test plan - [x] `pytest agentex/tests/unit/utils/test_otel_metrics.py` (24 tests) - [ ] Deploy to cluster; confirm `_InstrumentedFastAPI` in spawn workers - [ ] Confirm distinct `service.instance.id` per worker when `--workers > 1` - [ ] Confirm FastAPI HTTP metrics include `http_route` and k8s resource labels from operator env - [ ] Confirm custom metrics (`auth_cache_*`, `db_*`) export on the same provider resource Made with [Cursor](https://cursor.com) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR bootstraps OTel auto-instrumentation in uvicorn spawn workers by importing `otel_metrics` first in `app.py` and calling `bootstrap_auto_instrumentation()` before any auto-instrumented library is imported. It also assigns a per-worker `service.instance.id` by appending the process PID to avoid shared timeseries across workers. - **Bootstrap flow**: `bootstrap_auto_instrumentation()` syncs a PID-suffixed `service.instance.id` into `OTEL_RESOURCE_ATTRIBUTES`, then calls `initialize()` to install global `TracerProvider`/`MeterProvider` and patch libraries; failures are caught and logged without crashing the service. - **Custom metrics coexistence**: `init_otel_metrics()` attaches to the bootstrap provider when present, or creates a standalone OTLP pipeline when no global provider exists; the `NoOpMeterProvider` reset on shutdown is removed because OTel globals are set-once. - **Test coverage**: 24 unit tests added for bootstrap, idempotency, failure recovery, and resource attribute construction. <details><summary><h3>Confidence Score: 5/5</h3></summary> Safe to merge — the bootstrap path is well-guarded with try/except and idempotency checks; the import-order change in app.py achieves the intended patching window. Bootstrap failures are caught and logged without crashing the service. The per-worker service.instance.id logic correctly handles the idempotency case and the env-mutation is scoped to resource attribute construction before initialize(). The two observations noted are edge cases with no current practical impact given how init_otel_metrics is called today. agentex/src/utils/otel_metrics.py — _sync_instance_id_to_env comma splitting and the resource merge precedence in init_otel_metrics standalone mode. </details> <h3>Important Files Changed</h3> | Filename | Overview | |----------|----------| | agentex/src/utils/otel_metrics.py | Adds bootstrap_auto_instrumentation() with proper exception handling and idempotency; introduces _sync_instance_id_to_env that splits OTEL_RESOURCE_ATTRIBUTES by bare comma without handling quoted values, and a resource merge pattern where env-detected attributes in the pid_resource can silently override explicit service_name/service_version/environment args passed to init_otel_metrics in standalone mode. | | agentex/src/api/app.py | otel_metrics moved to first import with bootstrap_auto_instrumentation() call before FastAPI/httpx/SQLAlchemy; ruff E402 suppressed with a clear comment; no functional regressions. | | agentex/tests/unit/utils/test_otel_metrics.py | Comprehensive bootstrap test coverage: import blocking, failure/retry semantics, idempotency, env mutation, and PID-suffixed resource attributes; autouse fixture correctly resets bootstrap flag between tests. | </details> <details><summary><h3>Sequence Diagram</h3></summary> ```mermaid sequenceDiagram participant UV as uvicorn spawn worker participant APP as app.py participant OTEL as otel_metrics.py participant SDK as OTel SDK participant LS as FastAPI lifespan UV->>APP: import app APP->>OTEL: import otel_metrics APP->>OTEL: bootstrap_auto_instrumentation() OTEL->>OTEL: _sync_instance_id_to_env with pid suffix OTEL->>SDK: initialize() SDK-->>OTEL: TracerProvider + MeterProvider set globally Note over SDK: FastAPI, httpx, SQLAlchemy patched OTEL-->>APP: True APP->>APP: import FastAPI and other libraries UV->>LS: startup lifespan LS->>OTEL: init_otel_metrics() alt Bootstrap provider exists OTEL-->>LS: attach custom instruments to bootstrap provider else No global provider standalone mode OTEL->>SDK: MeterProvider with pid-suffixed resource OTEL-->>LS: new standalone MeterProvider end UV->>LS: shutdown lifespan LS->>OTEL: shutdown_otel_metrics() Note over OTEL: Only shuts down module-created provider ``` </details> <a href="https://app.greptile.com/api/ide/cursor?prompt=Fix%20the%20following%202%20code%20review%20issues.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A110-114%0AThe%20comma-split%20on%20%60OTEL_RESOURCE_ATTRIBUTES%60%20doesn't%20account%20for%20quoted%20values%20%28e.g.%2C%20%60some.key%3D%22hello%2Cworld%22%60%29.%20The%20OTel%20spec%20allows%20commas%20inside%20quoted%20values%2C%20and%20the%20%60OTELResourceDetector%60%20handles%20them%20correctly.%20This%20bare%20%60split%28%22%2C%22%29%60%20would%20shatter%20a%20quoted%20entry%20into%20fragments%20and%20reassemble%20a%20corrupted%20string.%20If%20the%20operator%20ever%20injects%20a%20label%20whose%20value%20contains%20a%20comma%2C%20the%20attribute%20will%20be%20silently%20mangled%20in%20the%20environment%20for%20every%20subsequent%20call%20that%20reads%20it%20%28including%20%60initialize%28%29%60%20itself%29.%0A%0A%60%60%60suggestion%0A%20%20%20%20import%20re%0A%0A%20%20%20%20%23%20Split%20only%20on%20commas%20that%20are%20NOT%20inside%20double-quoted%20values.%0A%20%20%20%20raw%20%3D%20os.environ.get%28%22OTEL_RESOURCE_ATTRIBUTES%22%2C%20%22%22%29%0A%20%20%20%20parts%20%3D%20%5B%0A%20%20%20%20%20%20%20%20part.strip%28%29%0A%20%20%20%20%20%20%20%20for%20part%20in%20re.split%28r'%2C%28%3F%3D%28%3F%3A%5B%5E%22%5D*%22%5B%5E%22%5D*%22%29*%5B%5E%22%5D*%24%29'%2C%20raw%29%0A%20%20%20%20%20%20%20%20if%20part.strip%28%29%20and%20not%20part.strip%28%29.startswith%28f%22%7Bkey%7D%3D%22%29%0A%20%20%20%20%5D%0A%60%60%60%0A%0A%23%23%23%20Issue%202%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A273-277%0A**Resource%20merge%20silently%20drops%20explicit%20constructor%20args**%0A%0A%60Resource.create%28%7B%22service.instance.id%22%3A%20...%7D%29%60%20runs%20%60OTELResourceDetector%60%20internally%2C%20so%20the%20second%20resource%20carries%20ALL%20env%20attributes%20at%20%22env%22%20priority.%20Because%20%60Resource.merge%60%20gives%20the%20%60other%60%20argument%20%28the%20pid%20resource%29%20precedence%2C%20any%20env%20value%20for%20%60service.name%60%2C%20%60service.version%60%2C%20or%20%60deployment.environment%60%20in%20%60OTEL_RESOURCE_ATTRIBUTES%60%20will%20silently%20override%20the%20%60service_name%60%2C%20%60service_version%60%2C%20and%20%60environment%60%20keyword%20arguments%20passed%20to%20%60init_otel_metrics%60.%20Today's%20call%20site%20passes%20no%20args%20so%20the%20values%20are%20identical%2C%20but%20a%20caller%20passing%20%60service_name%3D%22custom%22%60%20with%20%60OTEL_SERVICE_NAME%3Denv-name%60%20set%20would%20find%20%60custom%60%20discarded.%20Consider%20creating%20the%20pid%20resource%20with%20only%20the%20one%20explicit%20attribute%20and%20merging%20the%20detected%20env%20resource%20separately%20so%20precedence%20is%20unambiguous.%0A%0A&pr=305&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursorDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3"><img alt="Fix All in Cursor" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/ide/claude-code?prompt=Fix%20the%20following%202%20code%20review%20issues.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A110-114%0AThe%20comma-split%20on%20%60OTEL_RESOURCE_ATTRIBUTES%60%20doesn't%20account%20for%20quoted%20values%20%28e.g.%2C%20%60some.key%3D%22hello%2Cworld%22%60%29.%20The%20OTel%20spec%20allows%20commas%20inside%20quoted%20values%2C%20and%20the%20%60OTELResourceDetector%60%20handles%20them%20correctly.%20This%20bare%20%60split%28%22%2C%22%29%60%20would%20shatter%20a%20quoted%20entry%20into%20fragments%20and%20reassemble%20a%20corrupted%20string.%20If%20the%20operator%20ever%20injects%20a%20label%20whose%20value%20contains%20a%20comma%2C%20the%20attribute%20will%20be%20silently%20mangled%20in%20the%20environment%20for%20every%20subsequent%20call%20that%20reads%20it%20%28including%20%60initialize%28%29%60%20itself%29.%0A%0A%60%60%60suggestion%0A%20%20%20%20import%20re%0A%0A%20%20%20%20%23%20Split%20only%20on%20commas%20that%20are%20NOT%20inside%20double-quoted%20values.%0A%20%20%20%20raw%20%3D%20os.environ.get%28%22OTEL_RESOURCE_ATTRIBUTES%22%2C%20%22%22%29%0A%20%20%20%20parts%20%3D%20%5B%0A%20%20%20%20%20%20%20%20part.strip%28%29%0A%20%20%20%20%20%20%20%20for%20part%20in%20re.split%28r'%2C%28%3F%3D%28%3F%3A%5B%5E%22%5D*%22%5B%5E%22%5D*%22%29*%5B%5E%22%5D*%24%29'%2C%20raw%29%0A%20%20%20%20%20%20%20%20if%20part.strip%28%29%20and%20not%20part.strip%28%29.startswith%28f%22%7Bkey%7D%3D%22%29%0A%20%20%20%20%5D%0A%60%60%60%0A%0A%23%23%23%20Issue%202%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A273-277%0A**Resource%20merge%20silently%20drops%20explicit%20constructor%20args**%0A%0A%60Resource.create%28%7B%22service.instance.id%22%3A%20...%7D%29%60%20runs%20%60OTELResourceDetector%60%20internally%2C%20so%20the%20second%20resource%20carries%20ALL%20env%20attributes%20at%20%22env%22%20priority.%20Because%20%60Resource.merge%60%20gives%20the%20%60other%60%20argument%20%28the%20pid%20resource%29%20precedence%2C%20any%20env%20value%20for%20%60service.name%60%2C%20%60service.version%60%2C%20or%20%60deployment.environment%60%20in%20%60OTEL_RESOURCE_ATTRIBUTES%60%20will%20silently%20override%20the%20%60service_name%60%2C%20%60service_version%60%2C%20and%20%60environment%60%20keyword%20arguments%20passed%20to%20%60init_otel_metrics%60.%20Today's%20call%20site%20passes%20no%20args%20so%20the%20values%20are%20identical%2C%20but%20a%20caller%20passing%20%60service_name%3D%22custom%22%60%20with%20%60OTEL_SERVICE_NAME%3Denv-name%60%20set%20would%20find%20%60custom%60%20discarded.%20Consider%20creating%20the%20pid%20resource%20with%20only%20the%20one%20explicit%20attribute%20and%20merging%20the%20detected%20env%20resource%20separately%20so%20precedence%20is%20unambiguous.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=305&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaudeDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3"><img alt="Fix All in Claude Code" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/api/ide/codex?prompt=IMPORTANT%3A%20Work%20in%20the%20repository%20%22scaleapi%2Fscale-agentex%22%20on%20the%20existing%20branch%20%22jamesc-fix-auto-intrumentation%22.%20Checkout%20that%20branch%20%E2%80%94%20do%20NOT%20create%20a%20new%20branch%20or%20open%20a%20new%20PR.%20Push%20your%20changes%20to%20%22jamesc-fix-auto-intrumentation%22.%0A%0AFix%20the%20following%202%20code%20review%20issues.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A110-114%0AThe%20comma-split%20on%20%60OTEL_RESOURCE_ATTRIBUTES%60%20doesn't%20account%20for%20quoted%20values%20%28e.g.%2C%20%60some.key%3D%22hello%2Cworld%22%60%29.%20The%20OTel%20spec%20allows%20commas%20inside%20quoted%20values%2C%20and%20the%20%60OTELResourceDetector%60%20handles%20them%20correctly.%20This%20bare%20%60split%28%22%2C%22%29%60%20would%20shatter%20a%20quoted%20entry%20into%20fragments%20and%20reassemble%20a%20corrupted%20string.%20If%20the%20operator%20ever%20injects%20a%20label%20whose%20value%20contains%20a%20comma%2C%20the%20attribute%20will%20be%20silently%20mangled%20in%20the%20environment%20for%20every%20subsequent%20call%20that%20reads%20it%20%28including%20%60initialize%28%29%60%20itself%29.%0A%0A%60%60%60suggestion%0A%20%20%20%20import%20re%0A%0A%20%20%20%20%23%20Split%20only%20on%20commas%20that%20are%20NOT%20inside%20double-quoted%20values.%0A%20%20%20%20raw%20%3D%20os.environ.get%28%22OTEL_RESOURCE_ATTRIBUTES%22%2C%20%22%22%29%0A%20%20%20%20parts%20%3D%20%5B%0A%20%20%20%20%20%20%20%20part.strip%28%29%0A%20%20%20%20%20%20%20%20for%20part%20in%20re.split%28r'%2C%28%3F%3D%28%3F%3A%5B%5E%22%5D*%22%5B%5E%22%5D*%22%29*%5B%5E%22%5D*%24%29'%2C%20raw%29%0A%20%20%20%20%20%20%20%20if%20part.strip%28%29%20and%20not%20part.strip%28%29.startswith%28f%22%7Bkey%7D%3D%22%29%0A%20%20%20%20%5D%0A%60%60%60%0A%0A%23%23%23%20Issue%202%20of%202%0Aagentex%2Fsrc%2Futils%2Fotel_metrics.py%3A273-277%0A**Resource%20merge%20silently%20drops%20explicit%20constructor%20args**%0A%0A%60Resource.create%28%7B%22service.instance.id%22%3A%20...%7D%29%60%20runs%20%60OTELResourceDetector%60%20internally%2C%20so%20the%20second%20resource%20carries%20ALL%20env%20attributes%20at%20%22env%22%20priority.%20Because%20%60Resource.merge%60%20gives%20the%20%60other%60%20argument%20%28the%20pid%20resource%29%20precedence%2C%20any%20env%20value%20for%20%60service.name%60%2C%20%60service.version%60%2C%20or%20%60deployment.environment%60%20in%20%60OTEL_RESOURCE_ATTRIBUTES%60%20will%20silently%20override%20the%20%60service_name%60%2C%20%60service_version%60%2C%20and%20%60environment%60%20keyword%20arguments%20passed%20to%20%60init_otel_metrics%60.%20Today's%20call%20site%20passes%20no%20args%20so%20the%20values%20are%20identical%2C%20but%20a%20caller%20passing%20%60service_name%3D%22custom%22%60%20with%20%60OTEL_SERVICE_NAME%3Denv-name%60%20set%20would%20find%20%60custom%60%20discarded.%20Consider%20creating%20the%20pid%20resource%20with%20only%20the%20one%20explicit%20attribute%20and%20merging%20the%20detected%20env%20resource%20separately%20so%20precedence%20is%20unambiguous.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=305&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodexDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3"><img alt="Fix All in Codex" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3" height="20"></picture></a> <details><summary>Prompt To Fix All With AI</summary> `````markdown Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes. --- ### Issue 1 of 2 agentex/src/utils/otel_metrics.py:110-114 The comma-split on `OTEL_RESOURCE_ATTRIBUTES` doesn't account for quoted values (e.g., `some.key="hello,world"`). The OTel spec allows commas inside quoted values, and the `OTELResourceDetector` handles them correctly. This bare `split(",")` would shatter a quoted entry into fragments and reassemble a corrupted string. If the operator ever injects a label whose value contains a comma, the attribute will be silently mangled in the environment for every subsequent call that reads it (including `initialize()` itself). ```suggestion import re # Split only on commas that are NOT inside double-quoted values. raw = os.environ.get("OTEL_RESOURCE_ATTRIBUTES", "") parts = [ part.strip() for part in re.split(r',(?=(?:[^"]*"[^"]*")*[^"]*$)', raw) if part.strip() and not part.strip().startswith(f"{key}=") ] ``` ### Issue 2 of 2 agentex/src/utils/otel_metrics.py:273-277 **Resource merge silently drops explicit constructor args** `Resource.create({"service.instance.id": ...})` runs `OTELResourceDetector` internally, so the second resource carries ALL env attributes at "env" priority. Because `Resource.merge` gives the `other` argument (the pid resource) precedence, any env value for `service.name`, `service.version`, or `deployment.environment` in `OTEL_RESOURCE_ATTRIBUTES` will silently override the `service_name`, `service_version`, and `environment` keyword arguments passed to `init_otel_metrics`. Today's call site passes no args so the values are identical, but a caller passing `service_name="custom"` with `OTEL_SERVICE_NAME=env-name` set would find `custom` discarded. Consider creating the pid resource with only the one explicit attribute and merging the detected env resource separately so precedence is unambiguous. ````` </details> <sub>Reviews (5): Last reviewed commit: ["Merge branch &#39;main&#39; into jamesc-fix-auto..."](490a715) | [Re-trigger Greptile](https://app.greptile.com/api/retrigger?id=37157496)</sub> <!-- /greptile_comment -->
1 parent a039fc3 commit d65011a

3 files changed

Lines changed: 328 additions & 31 deletions

File tree

agentex/src/api/app.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
# ruff: noqa: E402
2+
# E402 suppressed: bootstrap_auto_instrumentation() must run before imports of
3+
# auto-instrumented libraries (FastAPI, httpx, SQLAlchemy, etc.).
4+
5+
from src.utils.otel_metrics import (
6+
bootstrap_auto_instrumentation,
7+
init_otel_metrics,
8+
shutdown_otel_metrics,
9+
)
10+
11+
bootstrap_auto_instrumentation()
12+
113
import os
214
from contextlib import asynccontextmanager
315
from pathlib import Path
@@ -38,7 +50,6 @@
3850
from src.config.environment_variables import EnvVarKeys
3951
from src.domain.exceptions import GenericException
4052
from src.utils.logging import make_logger
41-
from src.utils.otel_metrics import init_otel_metrics, shutdown_otel_metrics
4253

4354
logger = make_logger(__name__)
4455

agentex/src/utils/otel_metrics.py

Lines changed: 144 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,45 @@
11
"""
2-
OpenTelemetry metrics configuration for Agentex.
3-
4-
When auto-instrumentation (e.g. OTel Operator) has already installed a global
5-
MeterProvider, custom app metrics attach to it instead of replacing it.
6-
Otherwise this module creates its own provider with OTLP export when an endpoint
7-
is configured.
8-
9-
Environment Variables:
10-
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: Metrics OTLP endpoint (falls back to
2+
OpenTelemetry bootstrap and custom metrics for Agentex.
3+
4+
Two responsibilities:
5+
6+
1. **Auto-instrumentation** — call ``bootstrap_auto_instrumentation()`` from
7+
``app.py`` before importing FastAPI or other auto-instrumented libraries so
8+
``initialize()`` runs in each uvicorn spawn worker when contrib packages
9+
are installed.
10+
11+
2. **Custom app metrics** — ``init_otel_metrics()`` registers Agentex instruments
12+
(``auth_cache_*``, ``db_*``, etc.). Attaches to an existing global
13+
``MeterProvider`` from bootstrap/operator when present; otherwise creates a
14+
standalone OTLP pipeline when an endpoint is configured.
15+
16+
**Datadog ``ddtrace-run`` coexistence:** Neither OTel nor ddtrace detects the other's
17+
FastAPI patches. If both run in one process, ddtrace wraps the middleware stack
18+
first; OTel skips ``OpenTelemetryMiddleware`` with "unexpected middleware stack"
19+
and HTTP OTel metrics/traces are not emitted. Helm avoids this by using
20+
``ddtrace-run`` only when ``datadog.env`` is set (OTel-only otherwise). If both
21+
are required, set ``DD_TRACE_FASTAPI_ENABLED=false`` and
22+
``DD_TRACE_STARLETTE_ENABLED=false`` so OTel owns HTTP instrumentation.
23+
24+
**Per-worker ``service.instance.id``:** Uvicorn spawn workers share pod-level
25+
``OTEL_RESOURCE_ATTRIBUTES``, so auto-instrumentation would otherwise emit all
26+
workers on the same metric timeseries (see `OTel #4390
27+
<https://github.com/open-telemetry/opentelemetry-python/issues/4390>`_).
28+
``bootstrap_auto_instrumentation()`` appends ``.<pid>`` to ``service.instance.id``
29+
in ``OTEL_RESOURCE_ATTRIBUTES`` before ``initialize()``; standalone
30+
``init_otel_metrics()`` applies the same via ``Resource.merge``. With
31+
``--workers 1``, operator ``sitecustomize`` may have already called
32+
``initialize()``; bootstrap calls it again (OTel providers and instrumentors
33+
are set-once; duplicate calls only produce log warnings).
34+
35+
Environment variables (custom metrics / standalone mode):
36+
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: Metrics endpoint (falls back to
1137
OTEL_EXPORTER_OTLP_ENDPOINT)
12-
OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: Metrics export protocol (falls back to
38+
OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: Export protocol (falls back to
1339
OTEL_EXPORTER_OTLP_PROTOCOL; default: grpc)
1440
OTEL_EXPORTER_OTLP_ENDPOINT: General OTLP endpoint URL
15-
OTEL_EXPORTER_OTLP_HEADERS: Optional headers for authentication
16-
OTEL_SERVICE_NAME: Service name for metrics (default: agentex)
41+
OTEL_EXPORTER_OTLP_HEADERS: Passed through by OTLP exporters when set
42+
OTEL_SERVICE_NAME: Service name (default: agentex)
1743
OTEL_METRICS_EXPORT_INTERVAL_MS: Export interval in ms (default: 30000)
1844
"""
1945

@@ -23,7 +49,6 @@
2349
from typing import TYPE_CHECKING
2450

2551
from opentelemetry import metrics
26-
from opentelemetry.metrics import NoOpMeterProvider
2752
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
2853
OTLPMetricExporter as OTLPGrpcMetricExporter,
2954
)
@@ -32,7 +57,13 @@
3257
)
3358
from opentelemetry.sdk.metrics import MeterProvider
3459
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
35-
from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_VERSION, Resource
60+
from opentelemetry.sdk.resources import (
61+
SERVICE_NAME,
62+
SERVICE_VERSION,
63+
OTELResourceDetector,
64+
Resource,
65+
get_aggregated_resources,
66+
)
3667

3768
from src.utils.logging import make_logger
3869

@@ -42,15 +73,102 @@
4273

4374
logger = make_logger(__name__)
4475

45-
# Global state
46-
_meter_provider: MeterProvider | None = None # Set only when this module creates the provider
76+
# Module state
77+
_auto_instrumentation_bootstrapped = False
78+
_meter_provider: MeterProvider | None = None # Set only when this module creates the provider
4779
_initialized: bool = False
4880

49-
# Default configuration
5081
DEFAULT_SERVICE_NAME = "agentex"
5182
DEFAULT_EXPORT_INTERVAL_MS = 30000 # 30 seconds
5283

5384

85+
def _detected_resource() -> Resource:
86+
"""Resource attributes from OTEL_* env (operator-injected or local)."""
87+
return get_aggregated_resources([OTELResourceDetector()])
88+
89+
90+
def _unique_instance_id(resource: Resource) -> str:
91+
"""Worker-unique service.instance.id (OTel #4390)."""
92+
pid = os.getpid()
93+
existing = resource.attributes.get("service.instance.id")
94+
if existing:
95+
existing = str(existing)
96+
suffix = f".{pid}"
97+
return existing if existing.endswith(suffix) else f"{existing}{suffix}"
98+
service = (
99+
resource.attributes.get("service.name")
100+
or os.environ.get("OTEL_SERVICE_NAME")
101+
or "unknown"
102+
)
103+
pod = resource.attributes.get("k8s.pod.name") or "unknown"
104+
return f"{service}.{pod}.{pid}"
105+
106+
107+
def _sync_instance_id_to_env(instance_id: str) -> None:
108+
"""Write service.instance.id into OTEL_RESOURCE_ATTRIBUTES for auto-instrumentation."""
109+
key = "service.instance.id"
110+
parts = [
111+
part.strip()
112+
for part in os.environ.get("OTEL_RESOURCE_ATTRIBUTES", "").split(",")
113+
if part.strip() and not part.strip().startswith(f"{key}=")
114+
]
115+
parts.append(f"{key}={instance_id}")
116+
os.environ["OTEL_RESOURCE_ATTRIBUTES"] = ",".join(parts)
117+
118+
119+
# --- Auto-instrumentation bootstrap ---
120+
121+
122+
def bootstrap_auto_instrumentation() -> bool:
123+
"""Call ``initialize()`` once per process when auto-instrumentation is available.
124+
125+
Call from ``app.py`` before any auto-instrumented library (FastAPI, httpx,
126+
SQLAlchemy, etc.) — instrumentors patch at import time. Each uvicorn spawn
127+
worker imports ``app.py`` fresh, so one call per worker is enough.
128+
129+
Runs when: contrib packages are installed (no ``ImportError``).
130+
Skips when: bootstrap already succeeded in this process.
131+
On ``ImportError`` or ``initialize()`` failure, returns False and leaves
132+
the flag unset so a later call can retry.
133+
134+
Export config, ``OTEL_SDK_DISABLED``, and disabled instrumentations are
135+
handled inside ``initialize()`` — not gated here. Custom app metrics use
136+
``init_otel_metrics()`` separately.
137+
138+
Returns:
139+
True if ``initialize()`` completed; False if skipped or failed.
140+
"""
141+
global _auto_instrumentation_bootstrapped
142+
143+
if _auto_instrumentation_bootstrapped:
144+
return False
145+
146+
try:
147+
from opentelemetry.instrumentation.auto_instrumentation import initialize
148+
except ImportError:
149+
return False
150+
151+
try:
152+
_sync_instance_id_to_env(_unique_instance_id(_detected_resource()))
153+
initialize()
154+
except Exception:
155+
logger.warning(
156+
"OpenTelemetry auto-instrumentation bootstrap failed; continuing without it",
157+
exc_info=True,
158+
)
159+
return False
160+
161+
_auto_instrumentation_bootstrapped = True
162+
logger.debug(
163+
"OpenTelemetry auto-instrumentation bootstrapped (pid=%s)",
164+
os.getpid(),
165+
)
166+
return True
167+
168+
169+
# --- Custom application metrics ---
170+
171+
54172
def _global_meter_provider() -> MeterProvider | None:
55173
"""Return the global MeterProvider if installed, else None (proxy is ignored)."""
56174
provider = metrics.get_meter_provider()
@@ -152,6 +270,10 @@ def init_otel_metrics(
152270
"deployment.environment": environment
153271
or os.environ.get("ENVIRONMENT", "development"),
154272
}
273+
).merge(
274+
Resource.create(
275+
{"service.instance.id": _unique_instance_id(_detected_resource())}
276+
)
155277
)
156278
reader = PeriodicExportingMetricReader(
157279
exporter=_create_metric_exporter(endpoint, protocol),
@@ -167,9 +289,11 @@ def init_otel_metrics(
167289
_meter_provider = provider
168290
_initialized = True
169291
logger.info(
170-
f"OpenTelemetry metrics initialized: endpoint={endpoint}, "
171-
f"protocol={protocol}, service={resolved_service_name}, "
172-
f"interval={resolved_export_interval_ms}ms"
292+
"OpenTelemetry metrics initialized: endpoint=%s, protocol=%s, service=%s, interval=%sms",
293+
endpoint,
294+
protocol,
295+
resolved_service_name,
296+
resolved_export_interval_ms,
173297
)
174298
return _meter_provider
175299

@@ -209,11 +333,6 @@ def shutdown_otel_metrics() -> None:
209333
except Exception:
210334
logger.exception("OpenTelemetry metrics shutdown failed")
211335
finally:
212-
if _meter_provider is not None:
213-
try:
214-
metrics.set_meter_provider(NoOpMeterProvider())
215-
except Exception:
216-
logger.exception("Failed to reset global MeterProvider after shutdown")
217336
_meter_provider = None
218337
_initialized = False
219338

0 commit comments

Comments
 (0)