agentic-backend-readiness: per-agent secret isolation, dormant cortex, structured worker outcomes by jamiepine · Pull Request #583 · spacedriveapp/spacebot

jamiepine · 2026-05-05T04:00:34Z

Summary

Five surgical changes from docs/design-docs/agentic-backend-readiness.md that take spacebot from "single-user instance" to "ready to be the agentic backend of a production multi-tenant application." Each phase is one commit; phases are independent except 3→4 (4 needs the WorkerOutcome enum 3 invents).

The deployment shape this enables: thousands of mostly-idle agents on shared infrastructure, each with isolated credentials, woken on external events instead of polling, with predictable per-task latency and clean failure surfacing for things humans need to escalate.

Phase 1 — per-agent secret isolation

SecretScope { InstanceShared | Agent(id) } orthogonal to the existing SecretCategory { System | Tool }. Worker subprocess env now only carries InstanceShared(Tool) ∪ Agent(this_agent, Tool) — a tenant's worker can no longer printenv another tenant's GH_TOKEN. System secrets (LLM keys, messaging tokens) stay InstanceShared since their consumers are singletons. redb keys are NUL-delimited; pre-scope keys auto-migrate to InstanceShared on first open. Browser secret resolver tries agent scope, then shared.

Phase 2 — dormant cortex mode

CortexMode { Active, Dormant }. Dormant agents skip all four cortex loops (warmup / tick / association / ready-task). External events deliver wakes via an mpsc-fed WakeManager — send_agent_message post-insert, cron fire, and POST /api/agents/{id}/wake (admin / debugging) are the trigger sites. Wake is a one-shot ready-task pickup; idempotent on active-mode agents (the spawn loop and wake race harmlessly). New MemoryJanitorConfig (opt-in) replaces the periodic maintenance the dormant cortex no longer runs.

Phase 3 — structured `WorkerOutcome` + wall-clock worker timeout

Worker::run returns Result<WorkerOutcome> instead of Result<String> — variants Success, Partial, Cancelled, Timeout, Failed, Blocked (added in phase 4). Wall-clock budget from new CortexConfig.worker_wall_clock_timeout_secs (default 1800s, distinct from the existing supervisor idle-kill worker_timeout_secs). tokio::time::timeout wraps run_inner; on expiry the inner future is dropped and we return Timeout with progress read from a shared atomic so the caller knows how far the worker got.

Phase 4 — browser captcha / login-wall / WAF detection

DOM-only detection (tools/browser_detection.rs) for recaptcha, hcaptcha, Cloudflare Turnstile, Cloudflare interstitial (two-marker heuristic to avoid false-positive on every CF-hosted site), and login-wall final-URL paths with optional same-host gate. The browser tool writes a positive detection into a worker-shared BlockSignal slot; worker reads it at each loop iteration and short-circuits to WorkerOutcome::Blocked with structured BlockReason + BlockEvidence. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly.

Phase 5 — per-agent cron defaults

CortexConfig.cron_default_timeout_secs: Option<u64>. Resolution: per-job override → per-agent default → 1500s system default. Lifts the previously hard-coded constant out of the scheduler.

What's new at the surface

// New structured outcome from Worker::run
pub enum WorkerOutcome {
    Success { result: String },
    Partial { result: String, segments_run: usize },
    Cancelled { reason: String },
    Timeout { elapsed_secs: u64, segments_run: usize },
    Blocked { reason: BlockReason, url: Option<String>, evidence: Box<BlockEvidence> },
    Failed { reason: String },
}

// Scoped secret access
let scope = SecretScope::agent(&agent_id);
store.set(&scope, "DEPLOY_TOKEN", value, SecretCategory::Tool)?;
let env = store.tool_env_vars(&agent_id);  // shared ∪ this agent

// Wake a dormant agent
crate::agent::wake::fire_wake(&deps.wake_tx, &target_agent_id);

[defaults.cortex]
mode = "dormant"                          # phase 2
worker_wall_clock_timeout_secs = 1800     # phase 3
cron_default_timeout_secs = 600           # phase 5

[memory_janitor]                          # phase 2
enabled = true
interval_secs = 86400

What's deferred (called out as v1)

Bulletin caching with TTL on wake — current MVP just kicks ready-task pickup; worker prompt rendering already pulls the latest bulletin from memory state. Cache is a burst-trigger optimization.
Per-conversation worker_wall_clock_timeout_secs override — system + per-agent default sufficient for v1.
Header-based block detection (429, Retry-After, WAF banners) — chromiumoxide doesn't expose response headers cleanly. DOM heuristics cover the high-value path.
Per-agent secrets endpoints + dashboard work — the core Rust machinery is in place; admin-level shared secrets endpoints unchanged.
Promote BlockEvidence (HTML snippet, screenshot) into ProcessEvent — text result + worker logs sufficient for v1.

Test plan

cargo test --lib — 874/874 pass (includes 3 new secret-scope tests, 1 new browser-detection test suite of 10, 4 new worker-outcome classification tests)
Smoke: run an instance with cortex.mode = "dormant" for one agent, confirm no LLM bulletin generation in logs at idle
Smoke: cross-agent send_agent_message to a dormant receiver; verify task is picked up immediately (not on next tick)
Smoke: hit POST /api/agents/{id}/wake against a dormant agent with a ready task
Smoke: browser_navigate to a known captcha-fronted URL; verify worker exits with Blocked instead of looping
Smoke: set Agent(alice, GH_TOKEN) and Agent(bob, GH_TOKEN) with different values; spawn a worker for each; verify env contains the right one
Smoke: existing instance with pre-scope secrets.redb opens cleanly, all keys readable as InstanceShared

Worker::run now returns Result<WorkerOutcome> with variants Success, Partial, Cancelled, Timeout, Failed — replaces the prior Result<String> so callers can distinguish a clean exit from a max-segments partial, a wall-clock timeout, a cancellation, or a hard failure without peeking at the result text. Adds CortexConfig.worker_wall_clock_timeout_secs (default 1800s), distinct from the existing supervisor idle-kill bound. Worker::run wraps run_inner in tokio::time::timeout; on expiry the inner future is dropped and we return Timeout, reading segment progress from a shared atomic so the caller knows how far the worker got. spawn_worker_task is now generic over Result<WorkerOutcome>; the OpenCode worker boundary wraps its text result into Success at the spawn site. Cortex pickup_one_ready_task path consolidates into a single Result<WorkerOutcome, WorkerCompletionError> dispatch and splits on success vs failure once. Lays the groundwork for phase 4 to add a Blocked variant for captcha / login-wall detection in the browser tool.

Adds WorkerOutcome::Blocked with structured BlockReason (Captcha, LoginWall, RateLimit, FraudDetect, Unknown) and BlockEvidence (final URL + truncated HTML snippet) so a worker that hits an external defense exits cleanly and surfaces enough for parent-app escalation without retry-looping on a dead end. Detection lives in a new pure module (tools/browser_detection.rs) operating on DOM markers — recaptcha / hcaptcha / Cloudflare Turnstile iframe srcs, two-marker Cloudflare interstitial heuristic, login-path final URL with optional same-host gate. Conservative on purpose: a single Cloudflare marker doesn't fire (would false-positive on every CF-hosted site); login-wall on a different host is treated as an intentional federation, not a block. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly. The browser tools (navigate, click) run detection after their action and write to a worker-shared BlockSignal slot; the worker reads the slot at each loop iteration and short-circuits before the next LLM turn. Mirrors the SpacebotHook.outcome_signaled pattern so the worker loop stays the single arbiter of terminal state. Channel-level browser tool servers (cortex / interactive) pass None for the signal — only worker-driven sessions short-circuit.

Introduces SecretScope { InstanceShared | Agent(id) } orthogonal to the existing SecretCategory { System | Tool }. Worker subprocess env now only carries InstanceShared(Tool) ∪ Agent(this_agent, Tool) — a tenant's worker can no longer printenv another tenant's GH_TOKEN. System secrets (LLM keys, messaging tokens) stay InstanceShared since their consumers are singletons. Storage layout: redb keys are now NUL-delimited "s\\x00<NAME>" or "a\\x00<AGENT_ID>\\x00<NAME>". Pre-scope keys are detected and rewritten to InstanceShared on first store open — idempotent, non-destructive. Existing instances keep working without admin intervention; promotion of tool secrets to per-agent is an explicit op. Sandbox now owns the agent_id and reads scoped secrets in tool_secrets(). Worker prompt rendering, the scrubber, the SecretSetTool, and the browser_type secret-resolution path all flow agent_id through. Channel and cortex tool servers pass None for the optional agent_id — those contexts fall back to instance-shared only. The browser secret resolver tries the agent scope first, then shared, so worker-scoped GH_TOKEN can shadow a shared one without breaking the channel-level fallback. API surface keeps the existing PUT/DELETE/INFO endpoints as admin-level InstanceShared writes; list now returns scope per row. Per-agent secret endpoints + dashboard work follow. Tests: 3 new — per-agent isolation, shadow-shared precedence, and legacy key migration. 874/874 lib tests pass.

Adds CortexConfig.cron_default_timeout_secs (Option<u64>, None = system default 1500s). Resolution chain in cron/scheduler is per-job override → per-agent default → DEFAULT_CRON_TIMEOUT_SECS. Lifts the previous hard-coded 1500s out of the scheduler so research-heavy agent classes can raise it without setting timeout_secs on every job, and short-task classes can lower it without a per-job hint.

Adds CortexMode { Active, Dormant } and lets agents skip every cortex loop (warmup, tick, association, ready-task pickup) when dormant. Cuts periodic LLM-backed bulletin generation to zero on idle agents — the deployment shape this whole branch is aimed at. Wake substrate (src/agent/wake.rs): a single mpsc-fed WakeManager task runs at startup with a shared HashMap<AgentId, AgentDeps> registry. Tools and cron fire wakes by sending the receiving agent's id over AgentDeps.wake_tx; the manager looks up the receiver and runs cortex::wake_one (one ready-task pickup pass). Active-mode agents are also valid wake targets — wake_one races with the spawn_ready_task_loop harmlessly because task_store.claim_next_ready is transactional. Trigger sites: - send_agent_message post-task_store.create - cron fire path - POST /api/agents/{id}/wake (admin / debugging) Memory janitor (src/agent/maintenance.rs): MemoryJanitorConfig.enabled (default false) toggles a single instance-wide task that walks every registered agent on interval_secs (default daily) and runs the existing memory::maintenance machinery. Required when running dormant agents at scale; opt-in. Idempotent on active-mode agents. Deferred from doc plan: bulletin caching with TTL (current minimum viable wake just kicks pickup; the worker prompt rendering still pulls the latest bulletin from memory state), messaging/webhook routing-loop wake hooks (channels still spawn naturally — cortex loops are not on that path).

coderabbitai · 2026-05-05T04:00:43Z

Walkthrough

This PR implements five production-readiness features for multi-tenant agentic backend deployment: per-agent secret scoping with legacy migration, dormant cortex mode with wake-on-trigger infrastructure, wall-clock worker timeouts with structured outcomes, browser block/captcha detection, and per-agent cron timeout defaults. A design document outlines the phased rollout and cross-file dependencies.

Changes

Agentic Backend Readiness (Integrated Feature Delivery)

Layer / File(s)	Summary
Design & Config Schema `docs/design-docs/agentic-backend-readiness.md`, `src/config/toml_schema.rs`, `src/config/types.rs`	Design document frames five independent deployment changes; TOML schema and config types introduce `CortexMode` (Active/Dormant), `MemoryJanitorConfig`, and timeout fields (`worker_wall_clock_timeout_secs`, `cron_default_timeout_secs`) for runtime configuration.
Data Shape: Worker & Secrets `src/agent/worker.rs`, `src/secrets/store.rs`	`WorkerOutcome` enum replaces `String` return type (variants: `Success`, `Partial`, `Cancelled`, `Timeout`, `Blocked`, `Failed`); `BlockReason` and `BlockEvidence` carry block detection data; `BlockSignal` and `segments_run` tracker added to `Worker`. `SecretScope` enum (InstanceShared vs Agent-scoped) introduced; redb keys now prefixed with scope tags; legacy unscoped keys migrated to shared scope on store init.
Data Shape: Config & Wake `src/lib.rs`, `src/api/state.rs`	`AgentDeps` gains optional `wake_tx: WakeSender`; `ApiState` gains `wake_tx: ArcSwap<Option<WakeSender>>` for wake dispatch.
Core: Worker Execution & Timeout `src/agent/worker.rs`	`Worker::run()` wraps `run_inner()` in `tokio::select!` timeout; returns `WorkerOutcome::Timeout` when elapsed, otherwise outcome from inner loop. All error/completion paths refactored to return `WorkerOutcome` variants instead of `Err`/`Result<String>`. Block signal integration short-circuits execution and returns `WorkerOutcome::Blocked` with evidence.
Core: Block Detection `src/tools/browser_detection.rs`, `src/tools/browser.rs`	New `browser_detection` module classifies captcha (provider heuristics), Cloudflare challenge (multi-marker gating), and login-wall (path/host matching). `BrowserContext` gains `blocked_signal` and `agent_id` fields; `browser_navigate` and `browser_click` invoke `run_block_detection` post-navigation and signal blocks via shared slot on match.
Core: Secrets Store Scoping `src/secrets/store.rs`	Core CRUD methods (`set`, `get`, `delete`, `list`, `get_metadata`) now scope-aware; `tool_env_vars`, `tool_secret_names`, `tool_secret_pairs` accept `agent_id` and return union of shared + agent-scoped secrets with agent shadowing. Export/import updated to include `scope` field; encryption/rotation adjusted to iterate scope-prefixed keys.
Core: Dormant Mode & Wake Infrastructure `src/agent/wake.rs`, `src/agent/maintenance.rs`, `src/agent/cortex.rs`	`spawn_wake_manager` dispatcher receives `AgentId`s from bounded channel and spawns per-wake tasks calling `cortex::wake_one(&deps)`; `fire_wake` sends best-effort wake signals. `spawn_memory_janitor` periodically runs maintenance for all agents. `cortex::wake_one()` helper picks up single ready task synchronously.
Cortex Integration & Config Resolution `src/config/load.rs`, `src/agent/cortex.rs`	`CortexConfig::resolve` loads `worker_wall_clock_timeout_secs` and `cron_default_timeout_secs` from overrides/defaults. Cortex worker result handling refactored to normalize into `Result<WorkerOutcome, WorkerCompletionError>`, then classify via `map_worker_completion` and scrub outcome content. Agent-scoped tool secret names fetched when available.
Cron & Task Integration `src/cron/scheduler.rs`, `src/agent/channel.rs`, `src/tools/send_agent_message.rs`	Cron timeout resolution chains job override → agent default → system fallback; fires wake signal after job completion via `fire_wake(wake_tx, agent_id)`. `SendAgentMessageTool` gains optional `wake_tx` field and `with_wake_tx()` builder; after task insertion, conditionally wakes receiver. `Channel::new` applies `wake_tx` to message tool.
Agent-scoped Secret Scrubbing `src/secrets/scrub.rs`, `src/agent/branch.rs`, `src/opencode/worker.rs`, `src/tools/spawn_worker.rs`	`scrub_with_store` now takes `agent_id` parameter and scrubs tool secrets scoped to agent; all callers updated to pass `agent_id` context. Tool secret name collection for system prompts now agent-scoped.
Sandbox & Tool Wiring `src/sandbox.rs`, `src/tools.rs`, `src/tools/secret_set.rs`	`Sandbox::new()` requires `agent_id` and uses it for scoped `tool_env_vars(agent_id)` lookup. `create_worker_tool_server()` gains `blocked_signal` parameter and passes through to `register_browser_tools`. `SecretSetTool` scopes writes to `SecretScope::agent(agent_id)` and requires agent on construction.
API & Completion Wiring `src/api/agents.rs`, `src/api/server.rs`, `src/api/secrets.rs`, `src/api/providers.rs`	New `POST /agents/{agent_id}/wake` endpoint checks `state.wake_tx` and fires wake. Secrets API list now includes per-secret `scope` field and sorts by `(scope, name)`; all secret CRUD calls scoped to `SecretScope::shared()`. Provider secret resolution uses shared scope. Agent creation wires `wake_tx` into `AgentDeps`; warmup sets it to `None`.
Channel Dispatch & Worker Completion `src/agent/channel_dispatch.rs`, `src/agent/invariant_harness.rs`	`spawn_worker_task` generic bound changed to `Future<Output = Result<WorkerOutcome>>`; result processing normalizes outcome, scrubs content (including `Blocked` evidence), classifies via `map_worker_completion`, emits `ProcessEvent::WorkerComplete` with derived `(result_text, notify, success)`. Tests updated for `Timeout`/`Blocked`/`Partial` classification.
Main Startup & Initialization `src/main.rs`	`initialize_agents` creates shared `wake_registry`, spawns `spawn_wake_manager`, stores `wake_tx` into `api_state`, injects `wake_tx` into each agent's `AgentDeps`, and registers agents into registry. Cortex startup skips warmup/runtime/association loops for dormant agents; memory janitor spawned once instance-wide when enabled.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

spacedriveapp/spacebot#323: Both PRs modify worker/branch completion paths and egress scrubbing logic; this PR extends that work with agent-scoped secrets API and broader WorkerOutcome refactors.
spacedriveapp/spacebot#260: Both PRs change secrets store and scrubbing surfaces; this PR completes per-agent scoping and worker integration on top of that foundation.
spacedriveapp/spacebot#301: Both PRs modify cortex runtime and per-agent event wiring; this PR's wake manager and memory janitor build on tick-loop and event-bus refactoring from that PR.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main five independent changes: per-agent secret isolation, dormant cortex mode, and structured worker outcomes with block detection.
Description check	✅ Passed	The PR description comprehensively explains all five phases, the surface API changes, deferred work, and includes a detailed test plan; it directly relates to the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch agentic-backend-readiness

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.1)

src/api/server.rs

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/config/load.rs (1)
2639-2663: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Whitelist the new [memory_janitor] top-level section.

Adding memory_janitor here without updating KNOWN_TOP_LEVEL_KEYS means every valid [memory_janitor] config will still emit the generic “unknown top-level key” warning from warn_unknown_config_keys, which makes the new setting look ignored.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config/load.rs` around lines 2639 - 2663, The new MemoryJanitorConfig
(constructed as memory_janitor in load.rs) was added to Config but not added to
the whitelist of known top-level keys, so warn_unknown_config_keys still flags a
valid [memory_janitor] section; fix by adding "memory_janitor" to the
KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by
warn_unknown_config_keys) so that the new top-level key is recognized, and
ensure any unit tests that verify known keys are updated to include
MemoryJanitorConfig if present.
src/agent/worker.rs (1)
841-985: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Follow-up runs never consume blocked_signal.

The initial task loop exits with WorkerOutcome::Blocked, but the interactive follow-up loop goes straight from tool/browser failures to generic Failed. If a resumed or interactive worker hits a captcha/login wall during follow-up, phase 4 regresses to an opaque failure instead of the structured blocked outcome.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agent/worker.rs` around lines 841 - 985, The follow-up loop currently
turns tool/browser/captcha/login stalls into generic failures instead of
honoring blocked_signal and returning WorkerOutcome::Blocked; update the error
handling in the follow-up retry loop (the match that produces follow_up_result
and the Err(error) arms) to check/consume the same blocked_signal used in the
main task loop and, when present, set the worker state and outcome to the
blocked path (mirroring WorkerOutcome::Blocked behavior) instead of treating it
as a normal failure; specifically, when a tool-nudge/captcha/login condition is
detected (e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used
elsewhere), call the same blocked_signal consumption logic and produce the
blocked outcome (and avoid writing a generic failed state) so follow-up runs
resume to the structured blocked handling rather than WorkerState::Failed.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design-docs/agentic-backend-readiness.md`:
- Line 11: Typo in the use-case bullet: change the word "pollluted" to
"polluted" in the sentence "Owns its entity's history forever (not pollluted by
other entities' context)"; update that exact bullet text so it reads "Owns its
entity's history forever (not polluted by other entities' context)".
- Around line 125-137: The fenced code block that begins with "```" and contains
the flow sketch starting "trigger fires → cortex::wake(agent_id, trigger) ..."
is missing a language tag; update that opening fence to include the language
"text" (i.e., change ``` to ```text) so markdownlint MD040 is satisfied and
syntax highlighting is preserved for the flow sketch.

In `@src/agent/cortex.rs`:
- Around line 3974-3982: The code treats task_store.update(...) which returns
Result<Option<_>, _> as a simple success/failure; explicitly handle the Ok(None)
case in both persistence branches where task_store.update(...) is awaited (the
lines around the current call to task_store.update(task.task_number,
UpdateTaskInput { status: Some(TaskStatus::Done), .. })) so that Ok(Some(_))
proceeds to emit TaskUpdated, WorkerComplete and delegation notifications but
Ok(None) is treated as “no row updated” and does not emit those events; update
the matching logic in the same blocks that already check Err(_) (and mirror the
timeout-branch matching that handles Ok(Some), Ok(None), Err) to return or log a
no-update path instead of treating Ok(None) as success.
- Around line 3947-3973: The current code flattens all non-success worker
completions into a generic retry path by using map_worker_completion(normalized)
and then requeuing to TaskStatus::Ready; instead, preserve and inspect the
actual WorkerOutcome (do not discard the Ok(outcome) variant in normalized) and
handle WorkerOutcome::Timeout and WorkerOutcome::Blocked specially (e.g., set
distinct TaskStatus variants like TimedOut/Blocked or route to separate
backoff/retry logic) rather than immediately marking them Ready. Concretely:
stop converting Ok(Ok(outcome)) into only a result_text/success boolean
flow—keep the outcome value (from normalized or the original worker_result),
call map_worker_completion only for text/notify but then match on the preserved
WorkerOutcome to choose the correct TaskStatus (handling Timeout and Blocked
separately), while keeping the existing failure handling for
WorkerCompletionError and panics.

In `@src/agent/maintenance.rs`:
- Around line 46-55: The per-agent maintenance loop currently awaits
run_maintenance_for_agent(&deps).await serially with no timeout, so a hung call
can stall the whole janitor; wrap each per-agent call in a Tokio timeout
(tokio::time::timeout) or spawn it as a cancellable task and await with timeout
so a slow/hung agent is aborted and the loop continues; specifically, replace
the direct await of run_maintenance_for_agent(&deps).await in the agents loop
(and the similar section covering lines 60–76) with a timeout-wrapped call that
logs a distinct timeout warning (including agent_id) and continues to the next
agent when the timeout elapses, while still logging other errors from
run_maintenance_for_agent.
- Around line 60-77: run_maintenance_for_agent currently discards the
maintenance report from memory::maintenance::run_maintenance so prunes/merges
never mark knowledge synthesis dirty; change it to capture the maintenance
report result, inspect it for pruning/merge indicators (e.g.,
pruned_count/merged_count or a modified flag returned by run_maintenance), and
when non-zero call into the cortex runtime (via deps.runtime_config.cortex) to
bump or mark knowledge_synthesis_version/knowledge_synthesis_dirty using the
same mechanism the other maintenance path uses so dormant agents see the update.

In `@src/api/secrets.rs`:
- Around line 102-116: list_secrets() currently returns both InstanceShared and
Agent-scoped secrets while put_secret(), delete_secret(), and secret_info() are
hard-coded to SecretScope::shared(), causing inconsistent visibility and
inability to act on agent-scoped rows; either (A) restrict list_secrets() to
only include shared secrets (filter metadata by SecretScope::shared()) so the
API remains shared-only, or (B) thread an explicit scope parameter through the
CRUD handlers (modify the HTTP handlers and signatures for put_secret,
delete_secret, secret_info to accept a scope and use it when calling store.put,
store.delete, store.info) and update SecretListItem/response handling to include
the scope so clients can target agent-scoped secrets. Ensure all usages of
list_secrets, put_secret, delete_secret, and secret_info are adjusted
consistently (including places referenced around the other occurrences) so
listing and mutation use the same scope semantics.

In `@src/config/load.rs`:
- Around line 199-204: The worker_wall_clock_timeout_secs value is currently
accepted as-is so a configured 0 would be allowed; compute the resolved value
(using
overrides.worker_wall_clock_timeout_secs.unwrap_or(defaults.worker_wall_clock_timeout_secs))
and validate it the same way maintenance_interval_secs is validated—ensure the
resolved worker_wall_clock_timeout_secs is >= 1 and return an Err (with a clear
message) from the config loader if it is < 1 before returning Ok(config); update
the resolution logic near the
cron_default_timeout_secs/worker_wall_clock_timeout_secs lines and reference
worker_wall_clock_timeout_secs, overrides, defaults, and the surrounding config
resolution function when making the change.

In `@src/cron/scheduler.rs`:
- Around line 1405-1409: The wake is currently fired too early (using
context.deps.wake_tx and crate::agent::wake::fire_wake) before the cron channel
produces any follow-up tasks, risking the one-shot dormant wake being consumed
before work exists; move the wake invocation out of the current pre-channel
location and either (a) invoke
crate::agent::wake::fire_wake(context.deps.wake_tx, &context.deps.agent_id)
immediately after the cron channel processing completes, or (b) call it from the
specific side-effect creation path that enqueues follow-up tasks so the dormant
wake is delivered only when new work/associations are actually created.

In `@src/main.rs`:
- Around line 2776-2784: The wake_registry created for spawn_wake_manager is
only populated in initialize_agents and becomes stale when agents are
added/removed later; move its ownership to live agent map scope (the same place
the runtime agent HashMap is stored) or add explicit register/unregister hooks
on the wake-manager that are invoked from the main-loop add/remove paths so new
agents call wake_manager.register(agent_id, AgentDeps) and removed agents call
wake_manager.unregister(agent_id); update uses of wake_registry,
spawn_wake_manager, api_state.wake_tx and the add/remove handling in the main
loop to call these hooks so the registry stays in sync with runtime agent
changes.
- Around line 3752-3764: The startup warmup pass must be gated by the same
dormant-mode check so dormant agents don't run warmup; update the code that
schedules run_warmup_once to first read each agent's cortex mode
(agent.deps.runtime_config.cortex.load().mode) and skip calling or scheduling
run_warmup_once when cortex_mode.is_dormant() is true (the same check used in
the loop that skips the recurring cortex loops), ensuring agents iteration
(agents, agent_id, agent) never triggers run_warmup_once for dormant agents.

In `@src/secrets/store.rs`:
- Around line 748-791: The tool_env_vars() two-pass merge currently leaves a
shared secret in place when an agent-scoped secret exists but read_value(scope,
name) fails; change the second-pass behavior so that for each agent-scoped entry
(match SecretScope::Agent { id } if id == agent_id.as_ref()) you either insert
the agent value on success or explicitly remove any existing entry from result
when read_value returns None, ensuring unreadable agent overrides do not fall
back to shared secrets; apply the same change to the other similar merge block
that follows the same pattern (the other tool-secret-merge routine using
list_metadata and read_value).
- Around line 592-624: The delete() method currently performs mutations without
acquiring the shared mutation_guard, allowing race conditions with
enable_encryption(), rotate_key(), and set(); fix it by acquiring and holding
the mutation_guard for the duration of the delete operation (before begin_write
and until after commit) so the delete's write_txn and its removals from
SECRETS_TABLE and METADATA_TABLE cannot interleave with encryption/rotation/set
operations; ensure you use the same mutation_guard used by enable_encryption(),
rotate_key(), and set() so the lock is taken at the start of delete() and
released only after write_txn.commit() completes.

In `@src/tools/browser_detection.rs`:
- Around line 102-130: detect_login_wall currently flags any final_url path
matching login_path_signals as BlockReason::LoginWall when the final host equals
navigation_target_host, which blocks intentional sign-in navigations; update
detect_login_wall to also consider the requested/navigation target path (or
accept an explicit allow_expected_login boolean) so it only classifies as
LoginWall when the navigation target path differs from the final path (i.e., an
unexpected redirect to a login page) or when callers do not mark the login as
expected; use the existing symbols (detect_login_wall, final_url,
navigation_target_host, login_path_signals, BlockReason::LoginWall) to locate
the logic and add the extra comparison or parameter and short-circuit
accordingly.

In `@src/tools/browser.rs`:
- Around line 851-860: When resolving secrets in Browser::get (the block using
self.agent_id, SecretScope::agent and store.get), only fall back to
SecretScope::shared() if the agent-scoped get returns a NotFound error; if
store.get(&SecretScope::agent(...), name) returns any other error (e.g.
decrypt/unreadable), propagate that error wrapped in BrowserError::new instead
of trying the shared scope. Concretely: replace the current unconditional
fall-through logic with a match on the agent-scoped store.get result —
Ok(secret) => Ok(secret), Err(err) => if err indicates NotFound then attempt
store.get(&SecretScope::shared(), name) and map errors into
BrowserError::new(format!(...)), else return
Err(BrowserError::new(format!(...err...))). Ensure you reference store.get,
SecretScope::agent, SecretScope::shared, self.agent_id and BrowserError::new
when making the change.

---

Outside diff comments:
In `@src/agent/worker.rs`:
- Around line 841-985: The follow-up loop currently turns
tool/browser/captcha/login stalls into generic failures instead of honoring
blocked_signal and returning WorkerOutcome::Blocked; update the error handling
in the follow-up retry loop (the match that produces follow_up_result and the
Err(error) arms) to check/consume the same blocked_signal used in the main task
loop and, when present, set the worker state and outcome to the blocked path
(mirroring WorkerOutcome::Blocked behavior) instead of treating it as a normal
failure; specifically, when a tool-nudge/captcha/login condition is detected
(e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used
elsewhere), call the same blocked_signal consumption logic and produce the
blocked outcome (and avoid writing a generic failed state) so follow-up runs
resume to the structured blocked handling rather than WorkerState::Failed.

In `@src/config/load.rs`:
- Around line 2639-2663: The new MemoryJanitorConfig (constructed as
memory_janitor in load.rs) was added to Config but not added to the whitelist of
known top-level keys, so warn_unknown_config_keys still flags a valid
[memory_janitor] section; fix by adding "memory_janitor" to the
KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by
warn_unknown_config_keys) so that the new top-level key is recognized, and
ensure any unit tests that verify known keys are updated to include
MemoryJanitorConfig if present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c73ff7f-40cf-48a8-aeb3-d2dadb34c70f

📥 Commits

Reviewing files that changed from the base of the PR and between 8fb07b8 and dc644dd.

📒 Files selected for processing (31)

docs/design-docs/agentic-backend-readiness.md
src/agent.rs
src/agent/branch.rs
src/agent/channel.rs
src/agent/channel_dispatch.rs
src/agent/cortex.rs
src/agent/invariant_harness.rs
src/agent/maintenance.rs
src/agent/wake.rs
src/agent/worker.rs
src/api/agents.rs
src/api/providers.rs
src/api/secrets.rs
src/api/server.rs
src/api/state.rs
src/config/load.rs
src/config/toml_schema.rs
src/config/types.rs
src/cron/scheduler.rs
src/lib.rs
src/main.rs
src/opencode/worker.rs
src/sandbox.rs
src/secrets/scrub.rs
src/secrets/store.rs
src/tools.rs
src/tools/browser.rs
src/tools/browser_detection.rs
src/tools/secret_set.rs
src/tools/send_agent_message.rs
src/tools/spawn_worker.rs

👮 Files not reviewed due to content moderation or server errors (4)

src/api/agents.rs
src/config/types.rs
src/agent/wake.rs
src/agent/channel_dispatch.rs

coderabbitai · 2026-05-05T04:52:10Z

+
+The parent application has many entities (customers, accounts, listings, projects — domain depends). Each entity benefits from a dedicated, persistent agent that:
+
+- Owns its entity's history forever (not pollluted by other entities' context)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the typo in the use-case bullets.

pollluted should be polluted.

🧰 Tools

🪛 LanguageTool

[grammar] ~11-~11: Ensure spelling is correct
Context: ... Owns its entity's history forever (not pollluted by other entities' context) - Holds iso...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design-docs/agentic-backend-readiness.md` at line 11, Typo in the use-case bullet: change the word "pollluted" to "polluted" in the sentence "Owns its entity's history forever (not pollluted by other entities' context)"; update that exact bullet text so it reads "Owns its entity's history forever (not polluted by other entities' context)".

coderabbitai · 2026-05-05T04:52:10Z

+```
+trigger fires
+  → cortex::wake(agent_id, trigger)
+      → check cached bulletin: if within grace period, reuse
+      → else: synthesize bulletin from memory store (no LLM tool calls — pure memory query + RRF + single LLM synthesis call)
+      → cache bulletin with TTL = wake_grace_period_secs
+      → hand off to trigger-appropriate process:
+          ─ task triggers → spawn worker via existing ready-task path
+          ─ messaging triggers → spawn channel
+          ─ webhook triggers → dispatch into the existing handler
+          ─ cron triggers → spawn ephemeral channel (existing cron pattern)
+      → grace timer starts; on expiry without new triggers, return to fully dormant
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to this fenced block.

The unlabeled fence trips markdownlint (MD040) and loses syntax highlighting. text would fit this flow sketch.

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 125-125: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design-docs/agentic-backend-readiness.md` around lines 125 - 137, The fenced code block that begins with "```" and contains the flow sketch starting "trigger fires → cortex::wake(agent_id, trigger) ..." is missing a language tag; update that opening fence to include the language "text" (i.e., change ``` to ```text) so markdownlint MD040 is satisfied and syntax highlighting is preserved for the flow sketch.

coderabbitai · 2026-05-05T04:52:10Z

+                    // Convert the worker's structured outcome (or panic / error)
+                    // into a single Result<WorkerOutcome, WorkerCompletionError>
+                    // so the success / failure split below is driven by one
+                    // notion of "did it work".
+                    let normalized: std::result::Result<WorkerOutcome, WorkerCompletionError> =
+                        match worker_result {
+                            Ok(Ok(outcome)) => Ok(outcome),
+                            Ok(Err(error)) => {
+                                Err(WorkerCompletionError::failed(scrub(error.to_string())))
                            }
-                        }
-                        Ok(Err(error)) => {
-                            let scrubbed_error = scrub(error.to_string());
-                            let (error_message, _notify, _success) = map_worker_completion_result(
-                                Err(WorkerCompletionError::failed(scrubbed_error.clone())),
-                            );
-                            let worker_complete_message = format!("Worker failed: {error}");
-                            run_logger.log_worker_completed(worker_id, &error_message, false);
-                            let requeue_result = task_store
-                                .update(
-                                    task.task_number,
-                                    UpdateTaskInput {
-                                        status: Some(TaskStatus::Ready),
-                                        clear_worker_id: true,
-                                        ..Default::default()
-                                    },
-                                )
-                                .await;
-
-                            if let Err(ref update_error) = requeue_result {
-                                tracing::warn!(
-                                    %update_error,
-                                    task_number = task.task_number,
-                                    "failed to return task to ready after failure"
-                                );
-                                logger.log(
-                                    "task_pickup_failed_to_persist",
-                                    &format!(
-                                        "Picked-up task #{} failed but could not persist failure state: {error}",
-                                        task.task_number
-                                    ),
-                                    Some(serde_json::json!({
-                                        "task_number": task.task_number,
-                                        "worker_id": worker_id.to_string(),
-                                        "error": error.to_string(),
-                                    })),
-                                );
-                            } else {
-                                let _ = event_tx.send(ProcessEvent::TaskUpdated {
-                                    agent_id: Arc::from(agent_id.as_str()),
-                                    task_number: task.task_number,
-                                    status: "ready".to_string(),
-                                    action: "updated".to_string(),
-                                });
-
-                                logger.log(
-                                    "task_pickup_failed",
-                                    &format!(
-                                        "Picked-up task #{} failed: {error}",
-                                        task.task_number
-                                    ),
-                                    Some(serde_json::json!({
-                                        "task_number": task.task_number,
-                                        "worker_id": worker_id.to_string(),
-                                        "error": error.to_string(),
-                                    })),
+                            Err(panic_payload) => {
+                                let panic_message =
+                                    crate::agent::panic_payload_to_string(&*panic_payload);
+                                tracing::error!(
+                                    %worker_id,
+                                    %panic_message,
+                                    "cortex worker task panicked"
                                );
-
-                                notify_delegation_completion(
-                                    &task,
-                                    &error_message,
-                                    false,
-                                    &agent_id,
-                                    &links,
-                                    &agent_names,
-                                    &sqlite_pool,
-                                    &injection_tx,
-                                )
-                                .await;
+                                Err(WorkerCompletionError::failed(format!(
+                                    "worker task panicked: {}",
+                                    scrub(panic_message)
+                                )))
                            }
+                        };
+                    let (result_text, _notify, success) = map_worker_completion(normalized);
+                    let result_text = scrub(result_text);
+                    if success {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't flatten Timeout and Blocked into an immediate ready retry.

This branch now throws away the WorkerOutcome kind after map_worker_completion() and requeues every non-success outcome to TaskStatus::Ready. For detached tasks, that means WorkerOutcome::Timeout and WorkerOutcome::Blocked will hot-loop on the next pickup pass instead of taking a distinct path, so a task can repeatedly hit the same wall-clock timeout or CAPTCHA/login wall forever. Keep the structured outcome around here and route at least Timeout / Blocked separately from generic failures.

Also applies to: 4049-4059

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/cortex.rs` around lines 3947 - 3973, The current code flattens all non-success worker completions into a generic retry path by using map_worker_completion(normalized) and then requeuing to TaskStatus::Ready; instead, preserve and inspect the actual WorkerOutcome (do not discard the Ok(outcome) variant in normalized) and handle WorkerOutcome::Timeout and WorkerOutcome::Blocked specially (e.g., set distinct TaskStatus variants like TimedOut/Blocked or route to separate backoff/retry logic) rather than immediately marking them Ready. Concretely: stop converting Ok(Ok(outcome)) into only a result_text/success boolean flow—keep the outcome value (from normalized or the original worker_result), call map_worker_completion only for text/notify but then match on the preserved WorkerOutcome to choose the correct TaskStatus (handling Timeout and Blocked separately), while keeping the existing failure handling for WorkerCompletionError and panics.

coderabbitai · 2026-05-05T04:52:10Z

+                        let db_updated = task_store
+                            .update(
+                                task.task_number,
+                                UpdateTaskInput {
+                                    status: Some(TaskStatus::Done),
+                                    ..Default::default()
+                                },
+                            )
+                            .await;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle TaskStore::update(...)=Ok(None) explicitly in both persistence branches.

TaskStore::update is clearly a Result<Option<_>> here—the timeout path below already matches Ok(Some(_)), Ok(None), and Err(_). In these new success/failure branches you only test Err(_), so Ok(None) is treated as success: you'll emit TaskUpdated, WorkerComplete, and delegation notifications even though no task row was actually updated.

Suggested shape

- let db_updated = task_store.update(...).await; - if let Err(ref error) = db_updated { - ... - } else { - ... - } + match task_store.update(...).await { + Ok(Some(_)) => { + ... + } + Ok(None) => { + tracing::warn!(task_number = task.task_number, "task update returned no row"); + ... + } + Err(error) => { + ... + } + }

Also applies to: 3984-4047, 4051-4060, 4062-4112

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/cortex.rs` around lines 3974 - 3982, The code treats task_store.update(...) which returns Result<Option<_>, _> as a simple success/failure; explicitly handle the Ok(None) case in both persistence branches where task_store.update(...) is awaited (the lines around the current call to task_store.update(task.task_number, UpdateTaskInput { status: Some(TaskStatus::Done), .. })) so that Ok(Some(_)) proceeds to emit TaskUpdated, WorkerComplete and delegation notifications but Ok(None) is treated as “no row updated” and does not emit those events; update the matching logic in the same blocks that already check Err(_) (and mirror the timeout-branch matching that handles Ok(Some), Ok(None), Err) to return or log a no-update path instead of treating Ok(None) as success.

coderabbitai · 2026-05-05T04:52:10Z

+            for deps in agents {
+                let agent_id = deps.agent_id.clone();
+                if let Err(error) = run_maintenance_for_agent(&deps).await {
+                    tracing::warn!(
+                        %agent_id,
+                        %error,
+                        "janitor maintenance pass failed for agent"
+                    );
+                }
+            }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound each agent pass so one hung maintenance job can't stall the whole janitor.

This loop awaits maintenance serially for every agent, with no timeout or isolation. If one store or embedding call hangs, the janitor never reaches the remaining agents, which effectively disables dormant-mode maintenance instance-wide. The existing cortex path had timeout/cancellation protection; this replacement needs at least a per-agent timeout.

Also applies to: 60-76

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/maintenance.rs` around lines 46 - 55, The per-agent maintenance loop currently awaits run_maintenance_for_agent(&deps).await serially with no timeout, so a hung call can stall the whole janitor; wrap each per-agent call in a Tokio timeout (tokio::time::timeout) or spawn it as a cancellable task and await with timeout so a slow/hung agent is aborted and the loop continues; specifically, replace the direct await of run_maintenance_for_agent(&deps).await in the agents loop (and the similar section covering lines 60–76) with a timeout-wrapped call that logs a distinct timeout warning (including agent_id) and continues to the next agent when the timeout elapses, while still logging other errors from run_maintenance_for_agent.

coderabbitai · 2026-05-05T04:52:11Z

+    // Start cortex warmup, runtime, and association loops for each agent.
+    // Skip every loop when the agent is in Dormant mode — those agents wake
+    // only on external triggers (cross-agent message, cron fire, admin API).
    for (agent_id, agent) in agents.iter() {
+        let cortex_mode = agent.deps.runtime_config.cortex.load().mode;
+        if cortex_mode.is_dormant() {
+            tracing::info!(
+                agent_id = %agent_id,
+                "cortex loops skipped: agent is in dormant mode"
+            );
+            continue;
+        }
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Dormant mode still performs startup warmup.

This block skips the recurring cortex loops, but Line 3143 still schedules run_warmup_once for every agent. Dormant agents will therefore do warmup work at boot anyway, which breaks the dormant-mode contract and can still touch model/embedding startup paths. Gate the startup warmup pass with the same mode check.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.rs` around lines 3752 - 3764, The startup warmup pass must be gated by the same dormant-mode check so dormant agents don't run warmup; update the code that schedules run_warmup_once to first read each agent's cortex mode (agent.deps.runtime_config.cortex.load().mode) and skip calling or scheduling run_warmup_once when cortex_mode.is_dormant() is true (the same check used in the loop that skips the recurring cortex loops), ensuring agents iteration (agents, agent_id, agent) never triggers run_warmup_once for dormant agents.

coderabbitai · 2026-05-05T04:52:11Z

+    pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> {
        let write_txn = self.db.begin_write().map_err(|error| {
            SecretsError::Other(anyhow::anyhow!(
                "failed to begin write transaction: {error}"
            ))
        })?;
+        let key = encode_key(scope, name);
        {
            let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| {
                SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}"))
            })?;
-            secrets.remove(name).map_err(|error| {
-                SecretsError::Other(anyhow::anyhow!("failed to remove key '{name}': {error}"))
+            secrets.remove(key.as_str()).map_err(|error| {
+                SecretsError::Other(anyhow::anyhow!(
+                    "failed to remove key '{name}' ({scope}): {error}"
+                ))
            })?;

            let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| {
                SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}"))
            })?;
-            meta.remove(name).map_err(|error| {
+            meta.remove(key.as_str()).map_err(|error| {
                SecretsError::Other(anyhow::anyhow!(
-                    "failed to remove metadata for '{name}': {error}"
+                    "failed to remove metadata for '{name}' ({scope}): {error}"
                ))
            })?;
        }
        write_txn.commit().map_err(|error| {
            SecretsError::Other(anyhow::anyhow!(
-                "failed to commit delete for '{name}': {error}"
+                "failed to commit delete for '{name}' ({scope}): {error}"
            ))
        })?;

        Ok(())


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard delete() with mutation_guard too.

delete is now the one mutating path that can still interleave with enable_encryption(), rotate_key(), or set(). If it lands after one of those methods snapshots the table but before it writes back, the deleted secret can be resurrected or have fresh metadata written back over the delete.

Suggested fix

pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> { + let _guard = self.mutation_guard.lock().expect("mutation guard poisoned"); let write_txn = self.db.begin_write().map_err(|error| { SecretsError::Other(anyhow::anyhow!( "failed to begin write transaction: {error}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> {

let write_txn = self.db.begin_write().map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to begin write transaction: {error}"

))

})?;

let key = encode_key(scope, name);

{

let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| {

SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}"))

})?;

secrets.remove(name).map_err(|error| {

SecretsError::Other(anyhow::anyhow!("failed to remove key '{name}': {error}"))

secrets.remove(key.as_str()).map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to remove key '{name}' ({scope}): {error}"

))

})?;

let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| {

SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}"))

})?;

meta.remove(name).map_err(|error| {

meta.remove(key.as_str()).map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to remove metadata for '{name}': {error}"

"failed to remove metadata for '{name}' ({scope}): {error}"

))

})?;

}

write_txn.commit().map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to commit delete for '{name}': {error}"

"failed to commit delete for '{name}' ({scope}): {error}"

))

})?;

Ok(())

pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> {

let _guard = self.mutation_guard.lock().expect("mutation guard poisoned");

let write_txn = self.db.begin_write().map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to begin write transaction: {error}"

))

})?;

let key = encode_key(scope, name);

{

let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| {

SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}"))

})?;

secrets.remove(key.as_str()).map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to remove key '{name}' ({scope}): {error}"

))

})?;

let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| {

SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}"))

})?;

meta.remove(key.as_str()).map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to remove metadata for '{name}' ({scope}): {error}"

))

})?;

}

write_txn.commit().map_err(|error| {

SecretsError::Other(anyhow::anyhow!(

"failed to commit delete for '{name}' ({scope}): {error}"

))

})?;

Ok(())

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/secrets/store.rs` around lines 592 - 624, The delete() method currently performs mutations without acquiring the shared mutation_guard, allowing race conditions with enable_encryption(), rotate_key(), and set(); fix it by acquiring and holding the mutation_guard for the duration of the delete operation (before begin_write and until after commit) so the delete's write_txn and its removals from SECRETS_TABLE and METADATA_TABLE cannot interleave with encryption/rotation/set operations; ensure you use the same mutation_guard used by enable_encryption(), rotate_key(), and set() so the lock is taken at the start of delete() and released only after write_txn.commit() completes.

coderabbitai · 2026-05-05T04:52:11Z

+    /// Get all tool secrets visible to `agent_id` as name→value pairs for
+    /// `Sandbox::wrap()` injection. Visibility is the union of every
+    /// `InstanceShared(Tool)` row and every `Agent(agent_id, Tool)` row;
+    /// agent-scoped values shadow shared values when names collide.
    ///
    /// Returns an empty map when locked (tool secrets unavailable).
-    pub fn tool_env_vars(&self) -> HashMap<String, String> {
+    pub fn tool_env_vars(&self, agent_id: &AgentId) -> HashMap<String, String> {
        if self.state() == StoreState::Locked {
            return HashMap::new();
        }

-        let metadata = match self.list_metadata() {
+        let metadata = match self.list_metadata(None) {
            Ok(m) => m,
            Err(error) => {
                tracing::warn!(%error, "failed to list secret metadata for tool env vars");
                return HashMap::new();
            }
        };

+        // Two-pass insert so agent-scoped values shadow same-named shared values.
        let mut result = HashMap::new();
-        for (name, meta) in &metadata {
-            if meta.category == SecretCategory::Tool {
-                match self.get(name) {
-                    Ok(secret) => {
-                        result.insert(name.clone(), secret.expose().to_string());
-                    }
-                    Err(error) => {
-                        tracing::warn!(secret = %name, %error, "failed to decrypt tool secret — skipping");
-                    }
-                }
+        for ((scope, name), _meta) in metadata
+            .iter()
+            .filter(|(_, m)| m.category == SecretCategory::Tool)
+        {
+            if !scope.is_shared() {
+                continue;
+            }
+            if let Some(value) = self.read_value(scope, name) {
+                result.insert(name.clone(), value);
+            }
+        }
+        for ((scope, name), _meta) in metadata
+            .iter()
+            .filter(|(_, m)| m.category == SecretCategory::Tool)
+        {
+            if !matches!(scope, SecretScope::Agent { id } if id == agent_id.as_ref()) {
+                continue;
+            }
+            if let Some(value) = self.read_value(scope, name) {
+                result.insert(name.clone(), value);
            }
        }
-
        result


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unreadable agent overrides currently fall back to the shared credential.

tool_env_vars() documents that agent-scoped values shadow shared ones, but the current two-pass merge only overwrites on a successful agent read. If an agent-scoped secret exists and read_value() fails, the shared value stays in the map and gets injected into that worker instead of surfacing the isolation problem.

Also applies to: 828-840

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/secrets/store.rs` around lines 748 - 791, The tool_env_vars() two-pass merge currently leaves a shared secret in place when an agent-scoped secret exists but read_value(scope, name) fails; change the second-pass behavior so that for each agent-scoped entry (match SecretScope::Agent { id } if id == agent_id.as_ref()) you either insert the agent value on success or explicitly remove any existing entry from result when read_value returns None, ensuring unreadable agent overrides do not fall back to shared secrets; apply the same change to the other similar merge block that follows the same pattern (the other tool-secret-merge routine using list_metadata and read_value).

coderabbitai · 2026-05-05T04:52:11Z

+pub fn detect_login_wall(
+    final_url: Option<&str>,
+    navigation_target_host: Option<&str>,
+) -> Option<BlockReason> {
+    let url = final_url?;
+    let parsed = url::Url::parse(url).ok()?;
+    let path = parsed.path().to_lowercase();
+    let login_path_signals = [
+        "/login",
+        "/signin",
+        "/auth/",
+        "/account/login",
+        "/users/sign_in",
+    ];
+    let path_is_login = login_path_signals
+        .iter()
+        .any(|signal| path.contains(signal));
+    if !path_is_login {
+        return None;
+    }
+    if let Some(target_host) = navigation_target_host {
+        let final_host = parsed.host_str().unwrap_or("");
+        if final_host.eq_ignore_ascii_case(target_host) {
+            return Some(BlockReason::LoginWall);
+        }
+    } else {
+        return Some(BlockReason::LoginWall);
+    }
+    None


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This will classify intentional sign-in flows as LoginWall.

The heuristic fires on any login-looking path once the final host matches the target host, and the caller only passes host context. A worker that deliberately navigates to /login or clicks a “Sign in” button will be short-circuited as blocked, which breaks normal auth flows.

Please compare against the requested URL/path, not just the host, or give callers a way to mark login as expected.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tools/browser_detection.rs` around lines 102 - 130, detect_login_wall currently flags any final_url path matching login_path_signals as BlockReason::LoginWall when the final host equals navigation_target_host, which blocks intentional sign-in navigations; update detect_login_wall to also consider the requested/navigation target path (or accept an explicit allow_expected_login boolean) so it only classifies as LoginWall when the navigation target path differs from the final path (i.e., an unexpected redirect to a login page) or when callers do not mark the login as expected; use the existing symbols (detect_login_wall, final_url, navigation_target_host, login_path_signals, BlockReason::LoginWall) to locate the logic and add the extra comparison or parameter and short-circuit accordingly.

coderabbitai · 2026-05-05T04:52:11Z

+        if let Some(agent_id) = self.agent_id.as_ref() {
+            let agent_scope = crate::secrets::store::SecretScope::agent(agent_id);
+            if let Ok(secret) = store.get(&agent_scope, name) {
+                return Ok(secret);
+            }
+        }
+        store
+            .get(&crate::secrets::store::SecretScope::shared(), name)
+            .map_err(|error| {
+                BrowserError::new(format!("failed to resolve secret '{name}': {error}"))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only fall back to the shared scope on NotFound.

Right now any agent-scope read error falls through to the shared secret. If the agent-scoped secret exists but is unreadable or decrypting fails, browser_type will silently type the shared credential instead of surfacing the isolation problem. That breaks the new shadowing contract and can use the wrong tenant’s secret.

Suggested fix

if let Some(agent_id) = self.agent_id.as_ref() { let agent_scope = crate::secrets::store::SecretScope::agent(agent_id); - if let Ok(secret) = store.get(&agent_scope, name) { - return Ok(secret); + match store.get(&agent_scope, name) { + Ok(secret) => return Ok(secret), + Err(crate::error::SecretsError::NotFound { .. }) => {} + Err(error) => { + return Err(BrowserError::new(format!( + "failed to resolve agent-scoped secret '{name}': {error}" + ))); + } } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tools/browser.rs` around lines 851 - 860, When resolving secrets in Browser::get (the block using self.agent_id, SecretScope::agent and store.get), only fall back to SecretScope::shared() if the agent-scoped get returns a NotFound error; if store.get(&SecretScope::agent(...), name) returns any other error (e.g. decrypt/unreadable), propagate that error wrapped in BrowserError::new instead of trying the shared scope. Concretely: replace the current unconditional fall-through logic with a match on the agent-scoped store.get result — Ok(secret) => Ok(secret), Err(err) => if err indicates NotFound then attempt store.get(&SecretScope::shared(), name) and map errors into BrowserError::new(format!(...)), else return Err(BrowserError::new(format!(...err...))). Ensure you reference store.get, SecretScope::agent, SecretScope::shared, self.agent_id and BrowserError::new when making the change.

jamiepine added 6 commits May 4, 2026 19:52

docs: agentic-backend-readiness design

c98e423

jamiepine marked this pull request as ready for review May 5, 2026 04:40

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agentic-backend-readiness: per-agent secret isolation, dormant cortex, structured worker outcomes#583

agentic-backend-readiness: per-agent secret isolation, dormant cortex, structured worker outcomes#583
jamiepine wants to merge 6 commits intomainfrom
agentic-backend-readiness

jamiepine commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		The parent application has many entities (customers, accounts, listings, projects — domain depends). Each entity benefits from a dedicated, persistent agent that:

		- Owns its entity's history forever (not pollluted by other entities' context)

Conversation

jamiepine commented May 5, 2026

Summary

Phase 1 — per-agent secret isolation

Phase 2 — dormant cortex mode

Phase 3 — structured WorkerOutcome + wall-clock worker timeout

Phase 4 — browser captcha / login-wall / WAF detection

Phase 5 — per-agent cron defaults

What's new at the surface

What's deferred (called out as v1)

Test plan

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phase 3 — structured `WorkerOutcome` + wall-clock worker timeout

coderabbitai Bot commented May 5, 2026 •

edited

Loading