Skip to content

perf(agentex): drop redundant owner-edge grant on task creation#282

Draft
NiteshDhanpal wants to merge 9 commits into
mainfrom
perf/agentex-drop-redundant-task-grant
Draft

perf(agentex): drop redundant owner-edge grant on task creation#282
NiteshDhanpal wants to merge 9 commits into
mainfrom
perf/agentex-drop-redundant-task-grant

Conversation

@NiteshDhanpal

@NiteshDhanpal NiteshDhanpal commented Jun 8, 2026

Copy link
Copy Markdown

Summary

_get_or_create_task created a task and then called grant_with_retry(task) — but create_task already registers the task as owner via register_resource(task, parent=agent). Both calls write the same owner grant, so every new task issued two identical authorization writes to egp-api-backend /private/v5/agentex/permissions.

This PR removes the duplicate (and the now-dead grant_with_retry method + its sole-use imports). No behavior change on either backend.

Why it's redundant (both backends verified)

Backend register_resource (in create_task) grant_with_retry Equivalent?
SGP (legacy) register_resource delegates to grant(permission="owner") grant() also POSTs permission="owner" ✅ byte-identical upsert
Spark (SpiceDB) confers the OWNER role + sets the parent edge grant(create)_RESOURCE_ROLE_MAP[create] = OWNER → grants OWNER (no parent) ✅ same role, fully subsumed

authorization_service.grant hardcodes operation=create, which maps to OWNER on Spark and to permission="owner" on SGP — exactly what register_resource already established. register_resource additionally sets the parent edge that grant_with_retry omitted, so the registration path is strictly the more complete of the two.

Impact

Observed on staging-infra: a load test drove ~433k POST /private/v5/agentex/permissions writes at ~595ms avg / ~2.1s p95, roughly 2× the new-task count because of this duplicate. Removing it:

  • Halves new-task authorization write volume on the hot path.
  • Removes a retrying round-trip (grant_with_retry retried up to 3× with backoff) from the message/send critical path, helping tail latency.

Changes

  • Remove await self.grant_with_retry(task) from _get_or_create_task (create branch).
  • Remove the now-dead grant_with_retry method (only call site was the line above; line 252 was its own recursive retry).
  • Remove the 4 imports it solely used: asyncio, random, AuthenticationServiceUnavailableError, AgentexResource.

Net: 1 file, 33 deletions, 0 additions.

Testing

  • ruff check — clean
  • tests/unit/services/test_task_service.py + tests/unit/use_cases/test_agents_acp_use_case.py54 passed
  • tests/integration/use_cases/test_task_authz_dual_write.py (directly exercises the authz registration path) — 5 passed

Forward-looking note

The SGP↔Spark equivalence relies on authorization_service.grant hardcoding operation=create (→ OWNER). If a future change makes grant_with_retry confer a narrower operation, re-evaluate. The _RESOURCE_ROLE_MAP comment ("Pending Port redesign that takes a role directly") flags this area as known-lossy, which is extra reason to remove the duplicate now.

🤖 Generated with Claude Code

Greptile Summary

Removes the grant_with_retry call (and the method itself) from _get_or_create_task in AgentsACPUseCase, because task_service.create_task already calls authorization_service.register_resource(task, parent=agent) — a strictly more complete write that subsumes the duplicate owner grant.

  • Redundant write removed: register_resource in create_task establishes both the OWNER role and the parent edge; the removed grant_with_retry only posted the OWNER role with no parent, so the retained path is the more complete of the two.
  • Dead code cleanup: the grant_with_retry method, its exponential-backoff retry loop, and 4 now-unused imports (asyncio, random, AuthenticationServiceUnavailableError, AgentexResource) are all deleted — net 33-line reduction with no logic change.

Confidence Score: 5/5

Safe to merge — the removed code was confirmed dead at both SGP and Spark backends, and the retained register_resource call in create_task is the more complete write.

The change is a pure deletion: create_task already performs register_resource(task, parent=agent) which grants OWNER and sets the parent edge. The removed grant_with_retry only re-posted OWNER without a parent — fully subsumed. No references to grant_with_retry remain anywhere in the codebase, the 4 removed imports are genuinely unused, and the PR description provides load-test evidence confirming the duplicate writes.

No files require special attention.

Important Files Changed

Filename Overview
agentex/src/domain/use_cases/agents_acp_use_case.py Removes the redundant grant_with_retry method and its sole call-site; task_service.create_task already performs the equivalent (and more complete) register_resource write, so no authorization behavior is altered.

Sequence Diagram

sequenceDiagram
    participant UC as AgentsACPUseCase
    participant TS as TaskService
    participant AS as AuthorizationService

    Note over UC,AS: Before this PR
    UC->>TS: create_task(agent, ...)
    TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge"
    TS-->>UC: task
    UC->>AS: grant_with_retry → grant(task) duplicate OWNER (no parent)

    Note over UC,AS: After this PR
    UC->>TS: create_task(agent, ...)
    TS->>AS: "register_resource(task, parent=agent) OWNER + parent edge"
    TS-->>UC: task
    Note right of UC: grant_with_retry removed — no duplicate write
Loading

Reviews (1): Last reviewed commit: "perf(agentex): drop redundant owner-edge..." | Re-trigger Greptile

_get_or_create_task created a task and then called grant_with_retry(task),
but create_task already registers the task as owner via
register_resource(task, parent=agent). On both backends those write the
same owner grant:

- SGP: register_resource delegates to grant(permission="owner");
  grant_with_retry also POSTs permission="owner" — byte-identical upsert
  to egp-api-backend /private/v5/agentex/permissions.
- Spark: register_resource confers the OWNER role (and the parent edge);
  grant_with_retry maps create -> OWNER (_RESOURCE_ROLE_MAP), the same
  role with no parent — fully subsumed by register_resource.

So every new task issued two identical authorization writes. Dropping the
duplicate halves new-task authz write volume and removes a retrying
round-trip from the message/send critical path. Also removes the now-dead
grant_with_retry method and its sole-use imports (asyncio, random,
AuthenticationServiceUnavailableError, AgentexResource).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NiteshDhanpal NiteshDhanpal requested a review from a team as a code owner June 8, 2026 06:01
@NiteshDhanpal NiteshDhanpal marked this pull request as draft June 8, 2026 06:01
james-cardenas and others added 8 commits June 8, 2026 16:48
…ation (#287)

- Fixes custom app metrics (`auth_cache.*`, `db.client.*`) coexisting
with OTel Operator Python auto-instrumentation instead of replacing its
global `MeterProvider`.
- When a real SDK `MeterProvider` is already installed (operator),
`init_otel_metrics()` attaches custom instruments to it and does not
call `set_meter_provider()` or shut it down on exit.
- In standalone mode (local dev, no operator), the module still creates
its own provider when an OTLP endpoint is configured, with
protocol-aware export (`grpc` vs `http/protobuf`) via
`OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` / `OTEL_EXPORTER_OTLP_PROTOCOL`.
- Relates to:
https://linear.app/scale-epd/issue/SGPINF-1718/execute-rocket-load-test-and-generate-report-for-10x-traffic

The previous implementation always called `metrics.set_meter_provider()`
with a gRPC exporter, which replaced the operator's HTTP/protobuf
provider and dropped auto-instrumentation HTTP metrics
(`http_server_request_duration_seconds`, etc.). `get_meter()` also
returned `None` under operator injection because it only checked a
module-local provider.

- Detect an existing real `MeterProvider` via `_global_meter_provider()`
and use it in shared mode.
- Track `_meter_provider` only when this module creates the provider
(for shutdown).
- `get_meter()` works when either we created a provider or a global one
is already installed.
- `shutdown_otel_metrics()` only shuts down a provider this module
created.

<!-- greptile_comment -->

<h3>Greptile Summary</h3>

This PR fixes the conflict between custom app metrics (`auth_cache.*`,
`db.client.*`) and the OTel Operator's Python auto-instrumentation by
detecting an already-installed SDK `MeterProvider` and attaching to it
rather than replacing it. In standalone mode the module still creates
its own provider with protocol-aware export (`grpc` or `http/protobuf`),
now with proper exception handling that prevents
`PeriodicExportingMetricReader` thread leaks and ensures a
`NoOpMeterProvider` is installed on shutdown to prevent a stale dead
provider from being mistaken for a live operator provider on re-init.

- **Shared mode**: `_global_meter_provider()` detects the operator's
provider; `init_otel_metrics()` attaches to it with no
`set_meter_provider()` call, and `shutdown_otel_metrics()` leaves it
running.
- **Standalone mode**: `set_meter_provider()` is wrapped in `try/except`
that shuts down the newly-created provider on failure, and the `finally`
block installs `NoOpMeterProvider()` to clear the global slot so re-init
works correctly.
- **Tests**: 11 new unit tests cover both modes, protocol selection,
failure/retry paths, and cache-instrument attribute preservation, with
private-SDK state injection wrapped in `pytest.skip` on
`AttributeError`.

<details><summary><h3>Confidence Score: 5/5</h3></summary>

Safe to merge — the coexistence logic is correct, previously identified
flag-ordering and thread-leak bugs are fixed, and the shutdown path now
clears the global OTel slot to prevent stale-provider misdetection on
re-init.

The state-machine transitions across shared/standalone/disabled modes
are all covered by tests, the try/finally exception handling in both
init_otel_metrics and shutdown_otel_metrics is correct, and the
invariant that _meter_provider is only set when this module owns the
global slot is maintained throughout.

No files require special attention. The one test isolation gap
(cache_metrics state not restored in teardown) is latent and has no
impact on the current test suite.
</details>

<h3>Important Files Changed</h3>

| Filename | Overview |
|----------|----------|
| agentex/src/utils/otel_metrics.py | Core fix: detects existing SDK
MeterProvider via _global_meter_provider(), attaches in shared mode
without calling set_meter_provider(), adds try/except around
set_meter_provider() to shut down leaked PeriodicExportingMetricReader
threads, and installs NoOpMeterProvider in finally-block to clear the
global slot on standalone shutdown. Previously flagged ordering bugs are
corrected. |
| agentex/tests/unit/utils/test_otel_metrics.py | New test file with 11
unit tests covering shared mode coexistence, standalone mode lifecycle,
protocol selection, retry-after-failure, and cache instrument attribute
preservation. The
test_custom_metrics_preserve_instrument_attributes_in_shared_mode test
mutates cache_metrics module state at setup but has no corresponding
teardown, leaving _instruments_initialized=True and counters bound to
the test provider after the test completes. |

</details>

<a
href="https://app.greptile.com/api/ide/cursor?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&pr=287&platform=github"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursorDark.svg?v=3"><source
media="(prefers-color-scheme: light)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3"><img
alt="Fix All in Cursor"
src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCursor.svg?v=3"
height="20"></picture></a> <a
href="https://app.greptile.com/ide/claude-code?prompt=Fix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaudeDark.svg?v=3"><source
media="(prefers-color-scheme: light)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3"><img
alt="Fix All in Claude Code"
src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInClaude.svg?v=3"
height="20"></picture></a> <a
href="https://app.greptile.com/api/ide/codex?prompt=IMPORTANT%3A%20Work%20in%20the%20repository%20%22scaleapi%2Fscale-agentex%22%20on%20the%20existing%20branch%20%22jamesc-fix-otel-shared-meterprovider%22.%20Checkout%20that%20branch%20%E2%80%94%20do%20NOT%20create%20a%20new%20branch%20or%20open%20a%20new%20PR.%20Push%20your%20changes%20to%20%22jamesc-fix-otel-shared-meterprovider%22.%0A%0AFix%20the%20following%201%20code%20review%20issue.%20Work%20through%20them%20one%20at%20a%20time%2C%20proposing%20concise%20fixes.%0A%0A---%0A%0A%23%23%23%20Issue%201%20of%201%0Aagentex%2Ftests%2Funit%2Futils%2Ftest_otel_metrics.py%3A192-197%0AThe%20test%20resets%20%60cache_metrics%60%20module%20state%20before%20the%20test%20body%20but%20leaves%20%60_instruments_initialized%3DTrue%60%20and%20the%20counters%20bound%20to%20the%20test-local%20%60operator_provider%60%20after%20the%20test%20completes.%20The%20autouse%20fixture%20restores%20the%20OTel%20global%20slot%20but%20does%20not%20touch%20%60cache_metrics%60.%20Any%20future%20test%20that%20calls%20%60record_cache_access%60%2F%60record_cache_eviction%60%20after%20this%20test%20will%20find%20%60_instruments_initialized%3DTrue%60%2C%20skip%20re-initialization%2C%20and%20silently%20emit%20to%20the%20stale%20instruments.%20Restoring%20the%20three%20fields%20in%20teardown%20keeps%20isolation%20complete.%0A%0A&repo=scaleapi%2Fscale-agentex&pr=287&platform=github"><picture><source
media="(prefers-color-scheme: dark)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodexDark.svg?v=3"><source
media="(prefers-color-scheme: light)"
srcset="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3"><img
alt="Fix All in Codex"
src="https://greptile-static-assets.s3.amazonaws.com/badges/FixAllInCodex.svg?v=3"
height="20"></picture></a>

<details><summary>Prompt To Fix All With AI</summary>

`````markdown
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

agentex/tests/unit/utils/test_otel_metrics.py:192-197
The test resets `cache_metrics` module state before the test body but leaves `_instruments_initialized=True` and the counters bound to the test-local `operator_provider` after the test completes. The autouse fixture restores the OTel global slot but does not touch `cache_metrics`. Any future test that calls `record_cache_access`/`record_cache_eviction` after this test will find `_instruments_initialized=True`, skip re-initialization, and silently emit to the stale instruments. Restoring the three fields in teardown keeps isolation complete.

`````

</details>

<sub>Reviews (8): Last reviewed commit: ["Merge branch &#39;main&#39;
into
jamesc-fix-otel..."](51a270e)
| [Re-trigger
Greptile](https://app.greptile.com/api/retrigger?id=36313424)</sub>

<!-- /greptile_comment -->
Reverts the grant_with_retry removal (8c90deb). With register_resource
gated off per-account by fgac-tasks-dual-write (agentex-auth #358) and that
flag unregistered in sgp-infra-staging, nothing writes the per-task owner
edge — a regression from the pre-FGAC behavior where grant wrote it
unconditionally on every task.

register_resource alone is not sufficient while dual-write is off, so this
restores the ungated grant so task ownership is recorded regardless of the
dual-write flag state, until the FGAC dual-write rollout and the
per-task-edge dependency question are resolved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants