perf(agentex): drop redundant owner-edge grant on task creation#282
Draft
NiteshDhanpal wants to merge 9 commits into
Draft
perf(agentex): drop redundant owner-edge grant on task creation#282NiteshDhanpal wants to merge 9 commits into
NiteshDhanpal wants to merge 9 commits into
Conversation
_get_or_create_task created a task and then called grant_with_retry(task), but create_task already registers the task as owner via register_resource(task, parent=agent). On both backends those write the same owner grant: - SGP: register_resource delegates to grant(permission="owner"); grant_with_retry also POSTs permission="owner" — byte-identical upsert to egp-api-backend /private/v5/agentex/permissions. - Spark: register_resource confers the OWNER role (and the parent edge); grant_with_retry maps create -> OWNER (_RESOURCE_ROLE_MAP), the same role with no parent — fully subsumed by register_resource. So every new task issued two identical authorization writes. Dropping the duplicate halves new-task authz write volume and removes a retrying round-trip from the message/send critical path. Also removes the now-dead grant_with_retry method and its sole-use imports (asyncio, random, AuthenticationServiceUnavailableError, AgentexResource). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation (#287) - Fixes custom app metrics (`auth_cache.*`, `db.client.*`) coexisting with OTel Operator Python auto-instrumentation instead of replacing its global `MeterProvider`. - When a real SDK `MeterProvider` is already installed (operator), `init_otel_metrics()` attaches custom instruments to it and does not call `set_meter_provider()` or shut it down on exit. - In standalone mode (local dev, no operator), the module still creates its own provider when an OTLP endpoint is configured, with protocol-aware export (`grpc` vs `http/protobuf`) via `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` / `OTEL_EXPORTER_OTLP_PROTOCOL`. - Relates to: https://linear.app/scale-epd/issue/SGPINF-1718/execute-rocket-load-test-and-generate-report-for-10x-traffic The previous implementation always called `metrics.set_meter_provider()` with a gRPC exporter, which replaced the operator's HTTP/protobuf provider and dropped auto-instrumentation HTTP metrics (`http_server_request_duration_seconds`, etc.). `get_meter()` also returned `None` under operator injection because it only checked a module-local provider. - Detect an existing real `MeterProvider` via `_global_meter_provider()` and use it in shared mode. - Track `_meter_provider` only when this module creates the provider (for shutdown). - `get_meter()` works when either we created a provider or a global one is already installed. - `shutdown_otel_metrics()` only shuts down a provider this module created. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes the conflict between custom app metrics (`auth_cache.*`, `db.client.*`) and the OTel Operator's Python auto-instrumentation by detecting an already-installed SDK `MeterProvider` and attaching to it rather than replacing it. In standalone mode the module still creates its own provider with protocol-aware export (`grpc` or `http/protobuf`), now with proper exception handling that prevents `PeriodicExportingMetricReader` thread leaks and ensures a `NoOpMeterProvider` is installed on shutdown to prevent a stale dead provider from being mistaken for a live operator provider on re-init. - **Shared mode**: `_global_meter_provider()` detects the operator's provider; `init_otel_metrics()` attaches to it with no `set_meter_provider()` call, and `shutdown_otel_metrics()` leaves it running. - **Standalone mode**: `set_meter_provider()` is wrapped in `try/except` that shuts down the newly-created provider on failure, and the `finally` block installs `NoOpMeterProvider()` to clear the global slot so re-init works correctly. - **Tests**: 11 new unit tests cover both modes, protocol selection, failure/retry paths, and cache-instrument attribute preservation, with private-SDK state injection wrapped in `pytest.skip` on `AttributeError`. <details><summary><h3>Confidence Score: 5/5</h3></summary> Safe to merge — the coexistence logic is correct, previously identified flag-ordering and thread-leak bugs are fixed, and the shutdown path now clears the global OTel slot to prevent stale-provider misdetection on re-init. The state-machine transitions across shared/standalone/disabled modes are all covered by tests, the try/finally exception handling in both init_otel_metrics and shutdown_otel_metrics is correct, and the invariant that _meter_provider is only set when this module owns the global slot is maintained throughout. No files require special attention. The one test isolation gap (cache_metrics state not restored in teardown) is latent and has no impact on the current test suite. </details> <h3>Important Files Changed</h3> | Filename | Overview | |----------|----------| | agentex/src/utils/otel_metrics.py | Core fix: detects existing SDK MeterProvider via _global_meter_provider(), attaches in shared mode without calling set_meter_provider(), adds try/except around set_meter_provider() to shut down leaked PeriodicExportingMetricReader threads, and installs NoOpMeterProvider in finally-block to clear the global slot on standalone shutdown. Previously flagged ordering bugs are corrected. | | agentex/tests/unit/utils/test_otel_metrics.py | New test file with 11 unit tests covering shared mode coexistence, standalone mode lifecycle, protocol selection, retry-after-failure, and cache instrument attribute preservation. The test_custom_metrics_preserve_instrument_attributes_in_shared_mode test mutates cache_metrics module state at setup but has no corresponding teardown, leaving _instruments_initialized=True and counters bound to the test provider after the test completes. | </details> <a href="https://app.greptile.com/api/ide/cursor?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursorDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3"><img alt="Fix All in Cursor" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/ide/claude-code?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaudeDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3"><img alt="Fix All in Claude Code" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/api/ide/codex?prompt=IMPORTANT%3A%20Work%20in%20the%20repository%20%22scaleapi%2Fscale-agentex%22%20on%20the%20existing%20branch%20%22jamesc-fix-otel-shared-meterprovider%22.%20Checkout%20that%20branch%20%E2%80%94%20do%20NOT%20create%20a%20new%20branch%20or%20open%20a%20new%20PR.%20Push%20your%20changes%20to%20%22jamesc-fix-otel-shared-meterprovider%22.%0A%0AFix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodexDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3"><img alt="Fix All in Codex" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3" height="20"></picture></a> <details><summary>Prompt To Fix All With AI</summary> `````markdown Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes. --- agentex/tests/unit/utils/test_otel_metrics.py:192-197 The test resets `cache_metrics` module state before the test body but leaves `_instruments_initialized=True` and the counters bound to the test-local `operator_provider` after the test completes. The autouse fixture restores the OTel global slot but does not touch `cache_metrics`. Any future test that calls `record_cache_access`/`record_cache_eviction` after this test will find `_instruments_initialized=True`, skip re-initialization, and silently emit to the stale instruments. Restoring the three fields in teardown keeps isolation complete. ````` </details> <sub>Reviews (8): Last reviewed commit: ["Merge branch 'main' into jamesc-fix-otel..."](51a270e) | [Re-trigger Greptile](https://app.greptile.com/api/retrigger?id=36313424)</sub> <!-- /greptile_comment -->
Reverts the grant_with_retry removal (8c90deb). With register_resource gated off per-account by fgac-tasks-dual-write (agentex-auth #358) and that flag unregistered in sgp-infra-staging, nothing writes the per-task owner edge — a regression from the pre-FGAC behavior where grant wrote it unconditionally on every task. register_resource alone is not sufficient while dual-write is off, so this restores the ungated grant so task ownership is recorded regardless of the dual-write flag state, until the FGAC dual-write rollout and the per-task-edge dependency question are resolved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_get_or_create_taskcreated a task and then calledgrant_with_retry(task)— butcreate_taskalready registers the task as owner viaregister_resource(task, parent=agent). Both calls write the same owner grant, so every new task issued two identical authorization writes toegp-api-backend /private/v5/agentex/permissions.This PR removes the duplicate (and the now-dead
grant_with_retrymethod + its sole-use imports). No behavior change on either backend.Why it's redundant (both backends verified)
register_resource(increate_task)grant_with_retryregister_resourcedelegates togrant(permission="owner")grant()also POSTspermission="owner"grant(create)→_RESOURCE_ROLE_MAP[create] = OWNER→ grants OWNER (no parent)authorization_service.granthardcodesoperation=create, which maps toOWNERon Spark and topermission="owner"on SGP — exactly whatregister_resourcealready established.register_resourceadditionally sets the parent edge thatgrant_with_retryomitted, so the registration path is strictly the more complete of the two.Impact
Observed on staging-infra: a load test drove ~433k
POST /private/v5/agentex/permissionswrites at ~595ms avg / ~2.1s p95, roughly 2× the new-task count because of this duplicate. Removing it:grant_with_retryretried up to 3× with backoff) from themessage/sendcritical path, helping tail latency.Changes
await self.grant_with_retry(task)from_get_or_create_task(create branch).grant_with_retrymethod (only call site was the line above; line 252 was its own recursive retry).asyncio,random,AuthenticationServiceUnavailableError,AgentexResource.Net: 1 file, 33 deletions, 0 additions.
Testing
ruff check— cleantests/unit/services/test_task_service.py+tests/unit/use_cases/test_agents_acp_use_case.py— 54 passedtests/integration/use_cases/test_task_authz_dual_write.py(directly exercises the authz registration path) — 5 passedForward-looking note
The SGP↔Spark equivalence relies on
authorization_service.granthardcodingoperation=create(→ OWNER). If a future change makesgrant_with_retryconfer a narrower operation, re-evaluate. The_RESOURCE_ROLE_MAPcomment ("Pending Port redesign that takes a role directly") flags this area as known-lossy, which is extra reason to remove the duplicate now.🤖 Generated with Claude Code
Greptile Summary
Removes the
grant_with_retrycall (and the method itself) from_get_or_create_taskinAgentsACPUseCase, becausetask_service.create_taskalready callsauthorization_service.register_resource(task, parent=agent)— a strictly more complete write that subsumes the duplicate owner grant.register_resourceincreate_taskestablishes both the OWNER role and the parent edge; the removedgrant_with_retryonly posted the OWNER role with no parent, so the retained path is the more complete of the two.grant_with_retrymethod, its exponential-backoff retry loop, and 4 now-unused imports (asyncio,random,AuthenticationServiceUnavailableError,AgentexResource) are all deleted — net 33-line reduction with no logic change.Confidence Score: 5/5
Safe to merge — the removed code was confirmed dead at both SGP and Spark backends, and the retained register_resource call in create_task is the more complete write.
The change is a pure deletion: create_task already performs register_resource(task, parent=agent) which grants OWNER and sets the parent edge. The removed grant_with_retry only re-posted OWNER without a parent — fully subsumed. No references to grant_with_retry remain anywhere in the codebase, the 4 removed imports are genuinely unused, and the PR description provides load-test evidence confirming the duplicate writes.
No files require special attention.
Important Files Changed
Sequence Diagram
sequenceDiagram participant UC as AgentsACPUseCase participant TS as TaskService participant AS as AuthorizationService Note over UC,AS: Before this PR UC->>TS: create_task(agent, ...) TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge" TS-->>UC: task UC->>AS: grant_with_retry → grant(task) duplicate OWNER (no parent) Note over UC,AS: After this PR UC->>TS: create_task(agent, ...) TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge" TS-->>UC: task Note right of UC: grant_with_retry removed — no duplicate writeReviews (1): Last reviewed commit: "perf(agentex): drop redundant owner-edge..." | Re-trigger Greptile