fix(otel): custom metrics coexistence, FastAPI HTTP metrics, and export fixes#302
Draft
james-cardenas wants to merge 7 commits into
Draft
fix(otel): custom metrics coexistence, FastAPI HTTP metrics, and export fixes#302james-cardenas wants to merge 7 commits into
james-cardenas wants to merge 7 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Comment on lines
+345
to
+348
| def _http_metrics_enabled() -> bool: | ||
| """Return whether in-process FastAPI HTTP metrics should be installed.""" | ||
| flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower() | ||
| return flag not in {"0", "false", "no", "off"} |
There was a problem hiding this comment.
Deny-list silently enables opt-in feature on unexpected values
Any value not in {"0", "false", "no", "off"} — including an empty string — returns True and enables HTTP metrics. In Kubernetes it is common to inherit env vars from a ConfigMap where a key exists but its value is empty (e.g. AGENTEX_OTEL_HTTP_METRICS_ENABLED: ""). That would silently enable in-process HTTP instrumentation and trigger double-instrumentation with the OTel Operator, producing duplicate http_server_* timeseries and potential cardinality issues.
An allow-list approach is safer for an opt-in flag: return flag in {"1", "true", "yes", "on"}.
Suggested change
| def _http_metrics_enabled() -> bool: | |
| """Return whether in-process FastAPI HTTP metrics should be installed.""" | |
| flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower() | |
| return flag not in {"0", "false", "no", "off"} | |
| def _http_metrics_enabled() -> bool: | |
| """Return whether in-process FastAPI HTTP metrics should be installed.""" | |
| flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower() | |
| return flag in {"1", "true", "yes", "on"} |
Prompt To Fix With AI
This is a comment left during a code review.
Path: agentex/src/utils/otel_metrics.py
Line: 345-348
Comment:
**Deny-list silently enables opt-in feature on unexpected values**
Any value not in `{"0", "false", "no", "off"}` — including an empty string — returns `True` and enables HTTP metrics. In Kubernetes it is common to inherit env vars from a `ConfigMap` where a key exists but its value is empty (e.g. `AGENTEX_OTEL_HTTP_METRICS_ENABLED: ""`). That would silently enable in-process HTTP instrumentation and trigger double-instrumentation with the OTel Operator, producing duplicate `http_server_*` timeseries and potential cardinality issues.
An allow-list approach is safer for an opt-in flag: `return flag in {"1", "true", "yes", "on"}`.
```suggestion
def _http_metrics_enabled() -> bool:
"""Return whether in-process FastAPI HTTP metrics should be installed."""
flag = os.environ.get("AGENTEX_OTEL_HTTP_METRICS_ENABLED", "false").strip().lower()
return flag in {"1", "true", "yes", "on"}
```
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extracts the OTel fixes from #282 into a clean branch on current
main, without the grant-perf changes or the reverted grant removal.auth_cache.*,db.client.*) attach to the OTel Operator's existing provider instead of replacing it (fixes missing auto-instrumentation HTTP metrics).http_server_*metrics when the operator auto-instrumentation path is unavailable, instrumented at import time.OTELResourceDetectorfor custom metrics resource attributes; append PID toservice.instance.idon standaloneMeterProvider.Commits (cherry-picked from #282)
Excluded from #282: grant perf removal/revert commits and the duplicate coexistence commit already on main (#287).
Test plan
uv run pytest tests/unit/utils/test_otel_metrics.py— 35 passedhttp_server_request_duration_seconds+ custom metrics in GrafanaMade with Cursor
Greptile Summary
This PR extracts OTel observability fixes from #282 into a clean branch. It addresses three related problems: custom app metrics now attach to an existing OTel Operator
MeterProviderwithout replacing it; in-process FastAPI HTTP metrics instrumentation is available as an opt-in feature and is applied at module-import time (before Starlette caches its middleware stack); and OTLP histogram export is switched to cumulative temporality for correct Mimirrate()/increase()queries.init_otel_metricschecks for an existing SDKMeterProviderfirst; if absent and an OTLP endpoint is configured it installs one, gracefully handling a post-set_meter_providerrace by shutting down the orphan.configure_app_metrics(fastapi_app)is now called at module-import time (replacing the lifespan call), with idempotency checks and a guard against double-instrumentation by the OTel Operator.OTELResourceDetectorreplaces hand-codedResource.create(...), and a per-process PID suffix is appended toservice.instance.idto disambiguate uvicorn workers.Confidence Score: 4/5
Safe to merge for the coexistence, standalone, and temporality changes; the FastAPI opt-in gate has a logic gap worth addressing before production rollout with AGENTEX_OTEL_HTTP_METRICS_ENABLED enabled.
The coexistence/standalone provider logic, orphan-shutdown race handling, and cumulative temporality fix are well-tested and sound. The deny-list in
_http_metrics_enabledmeans an empty env-var value (e.g. a blank ConfigMap entry) would silently enable in-process HTTP instrumentation and cause double-instrumentation with the OTel Operator. Theservice_name/service_version/environmentparams are now dead in the resource-building path, which creates API confusion but no immediate runtime breakage since no current caller passes them.agentex/src/utils/otel_metrics.py — specifically
_http_metrics_enabled(deny-list) and the now-unusedinit_otel_metricskeyword arguments.Important Files Changed
_http_metrics_enabledcan unexpectedly enable the opt-in feature (e.g. on empty env var);service_name/service_version/environmentparams are no longer reflected in the resource but remain in the signature.init_otel_metrics()call from lifespan to module-import time viaconfigure_app_metrics(fastapi_app), correctly ensuring FastAPI middleware is instrumented before the ASGI server's first message._METER_PROVIDER_SET_ONCEfor proper inter-test isolation of the SDK proxy state.opentelemetry-instrumentation-fastapi>=0.49b0as a direct dependency for in-process HTTP metrics.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[configure_app_metrics called at module import] --> B[init_otel_metrics] B --> C{SDK MeterProvider already global?} C -->|Yes| D[coexistence mode: attach via get_meter] C -->|No| E{OTLP endpoint configured?} E -->|No| F[disabled mode: return None] E -->|Yes| G[Build resource via OTELResourceDetector + PID] G --> H[Create MeterProvider with cumulative temporality] H --> I[set_meter_provider] I --> J{Global now points to our provider?} J -->|Yes| K[standalone mode: _meter_provider = provider] J -->|No - race condition| L[orphan shutdown: use existing global] A --> M{AGENTEX_OTEL_HTTP_METRICS_ENABLED?} M -->|false - default| N[skip FastAPI instrumentation] M -->|true| O{App already instrumented?} O -->|Yes| P[skip - idempotent] O -->|No| Q[FastAPIInstrumentor.instrument_app] Q --> R[http.server.request.duration metrics recorded]Comments Outside Diff (1)
agentex/src/utils/otel_metrics.py, line 194-200 (link)service_name,service_version, andenvironmentargs are now dead parametersSince the resource is built entirely by
_build_resource()(which reads fromOTELResourceDetector/ env vars), theservice_name,service_version, andenvironmentarguments are no longer reflected in the metrics resource.resolved_service_nameonly appears in the log string, not in theResource. A caller who passesinit_otel_metrics(service_name="my-svc")will see "my-svc" in logs but not in Prometheus labels — a subtle mismatch. Either wire these args into_build_resource()or remove them from the signature (with a deprecation note if they're part of a public API).Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Append PID to service.instance.id on sta..." | Re-trigger Greptile