Skip to content

Tracker intake hardening slice: rate-limit budget, spawn cap, sandbox boundary, and write-back port #2344

Description

@jleechan2015

Summary

PR #2325 (feat(tracker): GitHub issue intake) closes #2324 and lands the per-project poll-and-spawn path. This issue proposes the hardening slice that needs to land alongside (or as a follow-up fast-follow to) #2325 before issue intake is safe to enable on a real repo with more than a handful of eligible open issues.

This is not a duplicate of #2282 / #2324 / #2325 / #2288 — those define the intake product. This issue focuses on the durable-boundary, sandbox, and rate-limit budget mechanics that PR #2325's body says it does not cover (per-project 5-min backoff on failure does not bound parallelism on a healthy first cycle, and the read path goes through gh which is the active critical bottleneck).

If the maintainer prefers, this can be merged as a hardening checklist on PR #2325 itself rather than a separate issue; the goal is to surface the gaps, not to require a separate PR.

Gaps not currently addressed by #2325 (verified 2026-07-01 against PR #2325 body)

# Gap Source Why it matters for intake
1 No max_concurrent_intake_spawns global cap on workers started by trackerintake.Observer. Per-project 5-min backoff (#2325) bounds failure, not success-path parallelism. Reviewer C failure mode #1 (worker amplification storm). Related: #918 (maxConcurrentSessions enforcement, codex-scoped today). A noisy label or a long backlog + a clean first poll can spawn N agents in one 30-second cycle → SQLite WAL contention, loopback HTTP queue back-pressure, host OOM, GitHub 429s within minutes.
2 No backfill / first-poll flood guard. The poll loop fires every tick once intake is enabled; there is no WHERE remote_updated_at > cursor initial-bounds clause visible from #2325's description. Reviewer A tracker_sync_cursor recommendation; Reviewer C failure modes #1 + #7 (CDC durability gap on crash). After daemon restart, or when intake is first enabled on a project with N open issues, the loop can dispatch once per poll cycle for each eligible issue, exceeding #1 even at modest backlogs.
3 ao spawn does not reject closed/cancelled issues (#2063 is open). #2063 + #2064 open. Auto-intake amplifies #2063 from a one-off CLI footgun into a permanent rate-limit + worker-budget leak: every label flip on a closed issue triggers a fresh session.
4 Tracker reads go through gh CLI subprocesses; the dashboard read path is benchmarked as the dominant 20–40s bottleneck, not tmux/ps (#1885, priority: critical). #1885 critical. A 30-second poll cadence plus a 20–40s gh round-trip means the poller effectively cannot sustain 30s cadence once it has more than a few eligible issues per project. The intake loop will visibly fall behind the moment it is enabled.
5 No sandbox boundary at worker spawn. Issue body is rendered into the worker prompt and the worker inherits the daemon environment (which holds AO_GITHUB_TOKEN / gh auth token). Reviewer C top risk (prompt injection via issue body → secret exfiltration). A malicious issue body → prompt-injected worker → exfil. Read-only-ness on the tracker side does not protect you from the worker side.
6 No write-back path / no GitHub App. #2325 is explicitly read-only toward GitHub. The state-transition gap (issue #40, "in-progress" / "in-review" reverse-map label is read but never written) means status propagation from worker → issue is unsupported. Reviewer B GitHub-App-vs-OAuth note; existing tracker.go comment re #40. When an intake-spawned worker transitions its PR through ci_failed → review_pending → mergeable, the originating issue cannot be transitioned to in_progress (or done on merge) without a write-back path. For org rollouts the recommended path is a GitHub App, not user tokens.

Proposed v1 hardening (composes into PR #2325 or a fast-follow PR)

P1 — must have before any project enables intake in production:

  1. Durable tracker_sync_cursor row per project. Poller emits issues with updated_at > cursor only; the cursor advances atomically in the same SQLite transaction as the issue_observed_at write. (Closes C failure mode feat: notifier-composio plugin + integration tests for all plugins #7.)
  2. max_concurrent_intake_spawns knob, default 2, env-driven. Backing semaphore in Session Manager, decoupled from poll cadence. (Closes C failure modes feat: implement web dashboard with attention-zone UI and API routes #1 + feat: agent plugins, OpenCode plugin, integration tests, CI #5.)
  3. First-poll throttle. On the first N cycles after intake is enabled (or after a tracked downtime), cap dispatch_per_cycle = min(issues_per_cycle, ramp_limit); ramp limit configurable, default 1. (Closes C failure mode feat: implement web dashboard with attention-zone UI and API routes #1.)
  4. Closed/cancelled short-circuit at intake time — even before the observer dispatches. (ao spawn does not reject closed/cancelled issues — spawns full session on dead work #2063 follow-up becomes load-bearing.)
  5. Tracker reads via the direct REST adapter, not gh subprocess. Fall back to gh only when no token env is set. (Unblocks bug(core,web): dashboard read path makes per-session gh calls — gh is the dominant 20-40s bottleneck (benchmarked), not tmux/ps; #1858 is scoped to the wrong probes #1885 critical; required for any cadence below ~60s.)

P2 — must have before intake is a default-on recommendation:

  1. Sandbox boundary on spawned worker runtime. No GITHUB_TOKEN, no ~/.ssh, no daemon env inherited. Issue body pinned as untrusted data in the prompt template (quoted / delimited), never rendered as instructions. (Closes C top risk.)
  2. Optional write-back port TrackerWriter behind a separate interface, opt-in per project. v1 implementation: comment the PR link back to the issue on session start; on PR merge, optionally transition open → done. (Closes A's "separate write-only port" verdict; complements fix: recognize terminated/done session states and hide terminal for dead sessions #40.)
  3. tracker_intake_enabled = false by default. Document the recommendation that projects opt in one at a time so the rate-limit budget per token stays bounded.

P3 — nice to have, document the deferred scope explicitly:

  1. GitHub App migration plan (per-installation tokens, webhook ingest of issues.opened / issues.labeled) so org rollouts avoid the user-token RPS ceiling. Webhook is the de-facto pattern for org-wide write-back; the polling path remains the fallback health-check loop.
  2. Multi-provider (Linear, Jira, GitLab) parallel intake behind the same TrackerResolver seam. (Paired with feat(tracker): multi-provider issue intake (GitHub, Linear, Jira) #2288.)

Evidence bar for "done" (proposed merge checklist)

  • Unit test: issues.state_transition_at rejects every illegal flip, including machine-issued in_progress → done.
  • Integration test: an injected fixture of 50 eligible issues + a single cycle dispatches ≤ max_concurrent_intake_spawns workers; restart-of-daemon dispatches 0 (cursor + durable fact prevent dup spawn).
  • Load test: 10k issues in SQLite, 1-hour soak, sustained RSS ≤ baseline + 20%, zero GitHub 429s. Token remaining ≥ 10% at all times.
  • Security test: spawned worker process env contains no GITHUB_TOKEN, ~/.ssh unreadable inside worker mount, egress denied by default. An injection-laden issue body never reaches shell / agent context as code.
  • Crash test: kill daemon mid-poll, restart, assert no duplicate worker for issues already in in_progress.
  • Schema migration is additive (issue_observations table), no rewrite of pull_requests / pr_observations.
  • Sign-off: PR-side SCM Observer owner confirms the intake observer does not touch the PR poller's write path.

Context

/cc the GitHub issue templates request (#2210) — this issue would arguably have been filed into that template slot had it existed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions