perf(agentex): drop redundant owner-edge grant on task creation by NiteshDhanpal · Pull Request #282 · scaleapi/scale-agentex

NiteshDhanpal · 2026-06-08T06:01:02Z

Summary

_get_or_create_task created a task and then called grant_with_retry(task) — but create_task already registers the task as owner via register_resource(task, parent=agent). Both calls write the same owner grant, so every new task issued two identical authorization writes to egp-api-backend /private/v5/agentex/permissions.

This PR removes the duplicate (and the now-dead grant_with_retry method + its sole-use imports). No behavior change on either backend.

Why it's redundant (both backends verified)

Backend	`register_resource` (in `create_task`)	`grant_with_retry`	Equivalent?
SGP (legacy)	`register_resource` delegates to `grant(permission="owner")`	`grant()` also POSTs `permission="owner"`	✅ byte-identical upsert
Spark (SpiceDB)	confers the OWNER role + sets the parent edge	`grant(create)` → `_RESOURCE_ROLE_MAP[create] = OWNER` → grants OWNER (no parent)	✅ same role, fully subsumed

authorization_service.grant hardcodes operation=create, which maps to OWNER on Spark and to permission="owner" on SGP — exactly what register_resource already established. register_resource additionally sets the parent edge that grant_with_retry omitted, so the registration path is strictly the more complete of the two.

Impact

Observed on staging-infra: a load test drove ~433k POST /private/v5/agentex/permissions writes at ~595ms avg / ~2.1s p95, roughly 2× the new-task count because of this duplicate. Removing it:

Halves new-task authorization write volume on the hot path.
Removes a retrying round-trip (grant_with_retry retried up to 3× with backoff) from the message/send critical path, helping tail latency.

Changes

Remove await self.grant_with_retry(task) from _get_or_create_task (create branch).
Remove the now-dead grant_with_retry method (only call site was the line above; line 252 was its own recursive retry).
Remove the 4 imports it solely used: asyncio, random, AuthenticationServiceUnavailableError, AgentexResource.

Net: 1 file, 33 deletions, 0 additions.

Testing

ruff check — clean
tests/unit/services/test_task_service.py + tests/unit/use_cases/test_agents_acp_use_case.py — 54 passed
tests/integration/use_cases/test_task_authz_dual_write.py (directly exercises the authz registration path) — 5 passed

Forward-looking note

The SGP↔Spark equivalence relies on authorization_service.grant hardcoding operation=create (→ OWNER). If a future change makes grant_with_retry confer a narrower operation, re-evaluate. The _RESOURCE_ROLE_MAP comment ("Pending Port redesign that takes a role directly") flags this area as known-lossy, which is extra reason to remove the duplicate now.

🤖 Generated with Claude Code

Greptile Summary

Removes the grant_with_retry call (and the method itself) from _get_or_create_task in AgentsACPUseCase, because task_service.create_task already calls authorization_service.register_resource(task, parent=agent) — a strictly more complete write that subsumes the duplicate owner grant.

Redundant write removed: register_resource in create_task establishes both the OWNER role and the parent edge; the removed grant_with_retry only posted the OWNER role with no parent, so the retained path is the more complete of the two.
Dead code cleanup: the grant_with_retry method, its exponential-backoff retry loop, and 4 now-unused imports (asyncio, random, AuthenticationServiceUnavailableError, AgentexResource) are all deleted — net 33-line reduction with no logic change.

Confidence Score: 5/5

Safe to merge — the removed code was confirmed dead at both SGP and Spark backends, and the retained register_resource call in create_task is the more complete write.

The change is a pure deletion: create_task already performs register_resource(task, parent=agent) which grants OWNER and sets the parent edge. The removed grant_with_retry only re-posted OWNER without a parent — fully subsumed. No references to grant_with_retry remain anywhere in the codebase, the 4 removed imports are genuinely unused, and the PR description provides load-test evidence confirming the duplicate writes.

No files require special attention.

Important Files Changed

Filename	Overview
agentex/src/domain/use_cases/agents_acp_use_case.py	Removes the redundant grant_with_retry method and its sole call-site; task_service.create_task already performs the equivalent (and more complete) register_resource write, so no authorization behavior is altered.

Sequence Diagram

sequenceDiagram
    participant UC as AgentsACPUseCase
    participant TS as TaskService
    participant AS as AuthorizationService

    Note over UC,AS: Before this PR
    UC->>TS: create_task(agent, ...)
    TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge"
    TS-->>UC: task
    UC->>AS: grant_with_retry → grant(task) duplicate OWNER (no parent)

    Note over UC,AS: After this PR
    UC->>TS: create_task(agent, ...)
    TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge"
    TS-->>UC: task
    Note right of UC: grant_with_retry removed — no duplicate write

_{Reviews (1): Last reviewed commit: "perf(agentex): drop redundant owner-edge..." | Re-trigger Greptile}

_get_or_create_task created a task and then called grant_with_retry(task), but create_task already registers the task as owner via register_resource(task, parent=agent). On both backends those write the same owner grant: - SGP: register_resource delegates to grant(permission="owner"); grant_with_retry also POSTs permission="owner" — byte-identical upsert to egp-api-backend /private/v5/agentex/permissions. - Spark: register_resource confers the OWNER role (and the parent edge); grant_with_retry maps create -> OWNER (_RESOURCE_ROLE_MAP), the same role with no parent — fully subsumed by register_resource. So every new task issued two identical authorization writes. Dropping the duplicate halves new-task authz write volume and removes a retrying round-trip from the message/send critical path. Also removes the now-dead grant_with_retry method and its sole-use imports (asyncio, random, AuthenticationServiceUnavailableError, AgentexResource). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation (#287) - Fixes custom app metrics (`auth_cache.*`, `db.client.*`) coexisting with OTel Operator Python auto-instrumentation instead of replacing its global `MeterProvider`. - When a real SDK `MeterProvider` is already installed (operator), `init_otel_metrics()` attaches custom instruments to it and does not call `set_meter_provider()` or shut it down on exit. - In standalone mode (local dev, no operator), the module still creates its own provider when an OTLP endpoint is configured, with protocol-aware export (`grpc` vs `http/protobuf`) via `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` / `OTEL_EXPORTER_OTLP_PROTOCOL`. - Relates to: https://linear.app/scale-epd/issue/SGPINF-1718/execute-rocket-load-test-and-generate-report-for-10x-traffic The previous implementation always called `metrics.set_meter_provider()` with a gRPC exporter, which replaced the operator's HTTP/protobuf provider and dropped auto-instrumentation HTTP metrics (`http_server_request_duration_seconds`, etc.). `get_meter()` also returned `None` under operator injection because it only checked a module-local provider. - Detect an existing real `MeterProvider` via `_global_meter_provider()` and use it in shared mode. - Track `_meter_provider` only when this module creates the provider (for shutdown). - `get_meter()` works when either we created a provider or a global one is already installed. - `shutdown_otel_metrics()` only shuts down a provider this module created.  <h3>Greptile Summary</h3> This PR fixes the conflict between custom app metrics (`auth_cache.*`, `db.client.*`) and the OTel Operator's Python auto-instrumentation by detecting an already-installed SDK `MeterProvider` and attaching to it rather than replacing it. In standalone mode the module still creates its own provider with protocol-aware export (`grpc` or `http/protobuf`), now with proper exception handling that prevents `PeriodicExportingMetricReader` thread leaks and ensures a `NoOpMeterProvider` is installed on shutdown to prevent a stale dead provider from being mistaken for a live operator provider on re-init. - **Shared mode**: `_global_meter_provider()` detects the operator's provider; `init_otel_metrics()` attaches to it with no `set_meter_provider()` call, and `shutdown_otel_metrics()` leaves it running. - **Standalone mode**: `set_meter_provider()` is wrapped in `try/except` that shuts down the newly-created provider on failure, and the `finally` block installs `NoOpMeterProvider()` to clear the global slot so re-init works correctly. - **Tests**: 11 new unit tests cover both modes, protocol selection, failure/retry paths, and cache-instrument attribute preservation, with private-SDK state injection wrapped in `pytest.skip` on `AttributeError`. <details><summary><h3>Confidence Score: 5/5</h3></summary> Safe to merge — the coexistence logic is correct, previously identified flag-ordering and thread-leak bugs are fixed, and the shutdown path now clears the global OTel slot to prevent stale-provider misdetection on re-init. The state-machine transitions across shared/standalone/disabled modes are all covered by tests, the try/finally exception handling in both init_otel_metrics and shutdown_otel_metrics is correct, and the invariant that _meter_provider is only set when this module owns the global slot is maintained throughout. No files require special attention. The one test isolation gap (cache_metrics state not restored in teardown) is latent and has no impact on the current test suite. </details> <h3>Important Files Changed</h3> | Filename | Overview | |----------|----------| | agentex/src/utils/otel_metrics.py | Core fix: detects existing SDK MeterProvider via _global_meter_provider(), attaches in shared mode without calling set_meter_provider(), adds try/except around set_meter_provider() to shut down leaked PeriodicExportingMetricReader threads, and installs NoOpMeterProvider in finally-block to clear the global slot on standalone shutdown. Previously flagged ordering bugs are corrected. | | agentex/tests/unit/utils/test_otel_metrics.py | New test file with 11 unit tests covering shared mode coexistence, standalone mode lifecycle, protocol selection, retry-after-failure, and cache instrument attribute preservation. The test_custom_metrics_preserve_instrument_attributes_in_shared_mode test mutates cache_metrics module state at setup but has no corresponding teardown, leaving _instruments_initialized=True and counters bound to the test provider after the test completes. | </details> <a href="https://app.greptile.com/api/ide/cursor?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursorDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3"><img alt="Fix All in Cursor" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/ide/claude-code?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaudeDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3"><img alt="Fix All in Claude Code" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3" height="20"></picture></a> <a href="https://app.greptile.com/api/ide/codex?prompt=IMPORTANT%3A%20Work%20in%20the%20repository%20%22scaleapi%2Fscale-agentex%22%20on%20the%20existing%20branch%20%22jamesc-fix-otel-shared-meterprovider%22.%20Checkout%20that%20branch%20%E2%80%94%20do%20NOT%20create%20a%20new%20branch%20or%20open%20a%20new%20PR.%20Push%20your%20changes%20to%20%22jamesc-fix-otel-shared-meterprovider%22.%0A%0AFix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source media="(prefers-color-scheme: dark)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodexDark.svg?v=3"><source media="(prefers-color-scheme: light)" srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3"><img alt="Fix All in Codex" src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3" height="20"></picture></a> <details><summary>Prompt To Fix All With AI</summary> `````markdown Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes. --- agentex/tests/unit/utils/test_otel_metrics.py:192-197 The test resets `cache_metrics` module state before the test body but leaves `_instruments_initialized=True` and the counters bound to the test-local `operator_provider` after the test completes. The autouse fixture restores the OTel global slot but does not touch `cache_metrics`. Any future test that calls `record_cache_access`/`record_cache_eviction` after this test will find `_instruments_initialized=True`, skip re-initialization, and silently emit to the stale instruments. Restoring the three fields in teardown keeps isolation complete. ````` </details> <sub>Reviews (8): Last reviewed commit: ["Merge branch 'main' into jamesc-fix-otel..."](51a270e) | [Re-trigger Greptile](https://app.greptile.com/api/retrigger?id=36313424)</sub>

Reverts the grant_with_retry removal (8c90deb). With register_resource gated off per-account by fgac-tasks-dual-write (agentex-auth #358) and that flag unregistered in sgp-infra-staging, nothing writes the per-task owner edge — a regression from the pre-FGAC behavior where grant wrote it unconditionally on every task. register_resource alone is not sufficient while dual-write is off, so this restores the ungated grant so task ownership is recorded regardless of the dual-write flag state, until the FGAC dual-write rollout and the per-task-edge dependency question are resolved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NiteshDhanpal requested a review from a team as a code owner June 8, 2026 06:01

NiteshDhanpal marked this pull request as draft June 8, 2026 06:01

james-cardenas and others added 8 commits June 8, 2026 16:48

Push metrics logging and auto-instrument fix

9dccc8b

Add opt-in in-process FastAPI OTel HTTP metrics instrumentation

6f5ad12

Fix FastAPI OTel HTTP metrics by instrumenting at import time.

b59b92e

Use OTELResourceDetector for custom metrics resource attributes

b3586c0

Set Cumulative export for Metrics

6a8db54

Append PID to service.instance.id on standalone MeterProvider resource

1f44305

james-cardenas mentioned this pull request Jun 11, 2026

fix(otel): custom metrics coexistence, FastAPI HTTP metrics, and export fixes #302

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(agentex): drop redundant owner-edge grant on task creation#282

perf(agentex): drop redundant owner-edge grant on task creation#282
NiteshDhanpal wants to merge 9 commits into
mainfrom
perf/agentex-drop-redundant-task-grant

NiteshDhanpal commented Jun 8, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NiteshDhanpal commented Jun 8, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it's redundant (both backends verified)

Impact

Changes

Testing

Forward-looking note

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NiteshDhanpal commented Jun 8, 2026 •

edited by greptile-apps Bot

Loading