agentic-backend-readiness: per-agent secret isolation, dormant cortex, structured worker outcomes#583
agentic-backend-readiness: per-agent secret isolation, dormant cortex, structured worker outcomes#583
Conversation
Worker::run now returns Result<WorkerOutcome> with variants Success, Partial, Cancelled, Timeout, Failed — replaces the prior Result<String> so callers can distinguish a clean exit from a max-segments partial, a wall-clock timeout, a cancellation, or a hard failure without peeking at the result text. Adds CortexConfig.worker_wall_clock_timeout_secs (default 1800s), distinct from the existing supervisor idle-kill bound. Worker::run wraps run_inner in tokio::time::timeout; on expiry the inner future is dropped and we return Timeout, reading segment progress from a shared atomic so the caller knows how far the worker got. spawn_worker_task is now generic over Result<WorkerOutcome>; the OpenCode worker boundary wraps its text result into Success at the spawn site. Cortex pickup_one_ready_task path consolidates into a single Result<WorkerOutcome, WorkerCompletionError> dispatch and splits on success vs failure once. Lays the groundwork for phase 4 to add a Blocked variant for captcha / login-wall detection in the browser tool.
Adds WorkerOutcome::Blocked with structured BlockReason (Captcha, LoginWall, RateLimit, FraudDetect, Unknown) and BlockEvidence (final URL + truncated HTML snippet) so a worker that hits an external defense exits cleanly and surfaces enough for parent-app escalation without retry-looping on a dead end. Detection lives in a new pure module (tools/browser_detection.rs) operating on DOM markers — recaptcha / hcaptcha / Cloudflare Turnstile iframe srcs, two-marker Cloudflare interstitial heuristic, login-path final URL with optional same-host gate. Conservative on purpose: a single Cloudflare marker doesn't fire (would false-positive on every CF-hosted site); login-wall on a different host is treated as an intentional federation, not a block. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly. The browser tools (navigate, click) run detection after their action and write to a worker-shared BlockSignal slot; the worker reads the slot at each loop iteration and short-circuits before the next LLM turn. Mirrors the SpacebotHook.outcome_signaled pattern so the worker loop stays the single arbiter of terminal state. Channel-level browser tool servers (cortex / interactive) pass None for the signal — only worker-driven sessions short-circuit.
Introduces SecretScope { InstanceShared | Agent(id) } orthogonal to the
existing SecretCategory { System | Tool }. Worker subprocess env now
only carries InstanceShared(Tool) ∪ Agent(this_agent, Tool) — a tenant's
worker can no longer printenv another tenant's GH_TOKEN. System secrets
(LLM keys, messaging tokens) stay InstanceShared since their consumers
are singletons.
Storage layout: redb keys are now NUL-delimited "s\\x00<NAME>" or
"a\\x00<AGENT_ID>\\x00<NAME>". Pre-scope keys are detected and rewritten
to InstanceShared on first store open — idempotent, non-destructive.
Existing instances keep working without admin intervention; promotion
of tool secrets to per-agent is an explicit op.
Sandbox now owns the agent_id and reads scoped secrets in tool_secrets().
Worker prompt rendering, the scrubber, the SecretSetTool, and the
browser_type secret-resolution path all flow agent_id through. Channel
and cortex tool servers pass None for the optional agent_id — those
contexts fall back to instance-shared only. The browser secret resolver
tries the agent scope first, then shared, so worker-scoped GH_TOKEN can
shadow a shared one without breaking the channel-level fallback.
API surface keeps the existing PUT/DELETE/INFO endpoints as admin-level
InstanceShared writes; list now returns scope per row. Per-agent secret
endpoints + dashboard work follow.
Tests: 3 new — per-agent isolation, shadow-shared precedence, and legacy
key migration. 874/874 lib tests pass.
Adds CortexConfig.cron_default_timeout_secs (Option<u64>, None = system default 1500s). Resolution chain in cron/scheduler is per-job override → per-agent default → DEFAULT_CRON_TIMEOUT_SECS. Lifts the previous hard-coded 1500s out of the scheduler so research-heavy agent classes can raise it without setting timeout_secs on every job, and short-task classes can lower it without a per-job hint.
Adds CortexMode { Active, Dormant } and lets agents skip every cortex
loop (warmup, tick, association, ready-task pickup) when dormant. Cuts
periodic LLM-backed bulletin generation to zero on idle agents — the
deployment shape this whole branch is aimed at.
Wake substrate (src/agent/wake.rs): a single mpsc-fed WakeManager task
runs at startup with a shared HashMap<AgentId, AgentDeps> registry.
Tools and cron fire wakes by sending the receiving agent's id over
AgentDeps.wake_tx; the manager looks up the receiver and runs
cortex::wake_one (one ready-task pickup pass). Active-mode agents are
also valid wake targets — wake_one races with the spawn_ready_task_loop
harmlessly because task_store.claim_next_ready is transactional.
Trigger sites:
- send_agent_message post-task_store.create
- cron fire path
- POST /api/agents/{id}/wake (admin / debugging)
Memory janitor (src/agent/maintenance.rs): MemoryJanitorConfig.enabled
(default false) toggles a single instance-wide task that walks every
registered agent on interval_secs (default daily) and runs the existing
memory::maintenance machinery. Required when running dormant agents at
scale; opt-in. Idempotent on active-mode agents.
Deferred from doc plan: bulletin caching with TTL (current minimum
viable wake just kicks pickup; the worker prompt rendering still pulls
the latest bulletin from memory state), messaging/webhook routing-loop
wake hooks (channels still spawn naturally — cortex loops are not on
that path).
WalkthroughThis PR implements five production-readiness features for multi-tenant agentic backend deployment: per-agent secret scoping with legacy migration, dormant cortex mode with wake-on-trigger infrastructure, wall-clock worker timeouts with structured outcomes, browser block/captcha detection, and per-agent cron timeout defaults. A design document outlines the phased rollout and cross-file dependencies. ChangesAgentic Backend Readiness (Integrated Feature Delivery)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.42.1)src/api/server.rsTip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/config/load.rs (1)
2639-2663:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winWhitelist the new
[memory_janitor]top-level section.Adding
memory_janitorhere without updatingKNOWN_TOP_LEVEL_KEYSmeans every valid[memory_janitor]config will still emit the generic “unknown top-level key” warning fromwarn_unknown_config_keys, which makes the new setting look ignored.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config/load.rs` around lines 2639 - 2663, The new MemoryJanitorConfig (constructed as memory_janitor in load.rs) was added to Config but not added to the whitelist of known top-level keys, so warn_unknown_config_keys still flags a valid [memory_janitor] section; fix by adding "memory_janitor" to the KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by warn_unknown_config_keys) so that the new top-level key is recognized, and ensure any unit tests that verify known keys are updated to include MemoryJanitorConfig if present.src/agent/worker.rs (1)
841-985:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFollow-up runs never consume
blocked_signal.The initial task loop exits with
WorkerOutcome::Blocked, but the interactive follow-up loop goes straight from tool/browser failures to genericFailed. If a resumed or interactive worker hits a captcha/login wall during follow-up, phase 4 regresses to an opaque failure instead of the structured blocked outcome.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agent/worker.rs` around lines 841 - 985, The follow-up loop currently turns tool/browser/captcha/login stalls into generic failures instead of honoring blocked_signal and returning WorkerOutcome::Blocked; update the error handling in the follow-up retry loop (the match that produces follow_up_result and the Err(error) arms) to check/consume the same blocked_signal used in the main task loop and, when present, set the worker state and outcome to the blocked path (mirroring WorkerOutcome::Blocked behavior) instead of treating it as a normal failure; specifically, when a tool-nudge/captcha/login condition is detected (e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used elsewhere), call the same blocked_signal consumption logic and produce the blocked outcome (and avoid writing a generic failed state) so follow-up runs resume to the structured blocked handling rather than WorkerState::Failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design-docs/agentic-backend-readiness.md`:
- Line 11: Typo in the use-case bullet: change the word "pollluted" to
"polluted" in the sentence "Owns its entity's history forever (not pollluted by
other entities' context)"; update that exact bullet text so it reads "Owns its
entity's history forever (not polluted by other entities' context)".
- Around line 125-137: The fenced code block that begins with "```" and contains
the flow sketch starting "trigger fires → cortex::wake(agent_id, trigger) ..."
is missing a language tag; update that opening fence to include the language
"text" (i.e., change ``` to ```text) so markdownlint MD040 is satisfied and
syntax highlighting is preserved for the flow sketch.
In `@src/agent/cortex.rs`:
- Around line 3974-3982: The code treats task_store.update(...) which returns
Result<Option<_>, _> as a simple success/failure; explicitly handle the Ok(None)
case in both persistence branches where task_store.update(...) is awaited (the
lines around the current call to task_store.update(task.task_number,
UpdateTaskInput { status: Some(TaskStatus::Done), .. })) so that Ok(Some(_))
proceeds to emit TaskUpdated, WorkerComplete and delegation notifications but
Ok(None) is treated as “no row updated” and does not emit those events; update
the matching logic in the same blocks that already check Err(_) (and mirror the
timeout-branch matching that handles Ok(Some), Ok(None), Err) to return or log a
no-update path instead of treating Ok(None) as success.
- Around line 3947-3973: The current code flattens all non-success worker
completions into a generic retry path by using map_worker_completion(normalized)
and then requeuing to TaskStatus::Ready; instead, preserve and inspect the
actual WorkerOutcome (do not discard the Ok(outcome) variant in normalized) and
handle WorkerOutcome::Timeout and WorkerOutcome::Blocked specially (e.g., set
distinct TaskStatus variants like TimedOut/Blocked or route to separate
backoff/retry logic) rather than immediately marking them Ready. Concretely:
stop converting Ok(Ok(outcome)) into only a result_text/success boolean
flow—keep the outcome value (from normalized or the original worker_result),
call map_worker_completion only for text/notify but then match on the preserved
WorkerOutcome to choose the correct TaskStatus (handling Timeout and Blocked
separately), while keeping the existing failure handling for
WorkerCompletionError and panics.
In `@src/agent/maintenance.rs`:
- Around line 46-55: The per-agent maintenance loop currently awaits
run_maintenance_for_agent(&deps).await serially with no timeout, so a hung call
can stall the whole janitor; wrap each per-agent call in a Tokio timeout
(tokio::time::timeout) or spawn it as a cancellable task and await with timeout
so a slow/hung agent is aborted and the loop continues; specifically, replace
the direct await of run_maintenance_for_agent(&deps).await in the agents loop
(and the similar section covering lines 60–76) with a timeout-wrapped call that
logs a distinct timeout warning (including agent_id) and continues to the next
agent when the timeout elapses, while still logging other errors from
run_maintenance_for_agent.
- Around line 60-77: run_maintenance_for_agent currently discards the
maintenance report from memory::maintenance::run_maintenance so prunes/merges
never mark knowledge synthesis dirty; change it to capture the maintenance
report result, inspect it for pruning/merge indicators (e.g.,
pruned_count/merged_count or a modified flag returned by run_maintenance), and
when non-zero call into the cortex runtime (via deps.runtime_config.cortex) to
bump or mark knowledge_synthesis_version/knowledge_synthesis_dirty using the
same mechanism the other maintenance path uses so dormant agents see the update.
In `@src/api/secrets.rs`:
- Around line 102-116: list_secrets() currently returns both InstanceShared and
Agent-scoped secrets while put_secret(), delete_secret(), and secret_info() are
hard-coded to SecretScope::shared(), causing inconsistent visibility and
inability to act on agent-scoped rows; either (A) restrict list_secrets() to
only include shared secrets (filter metadata by SecretScope::shared()) so the
API remains shared-only, or (B) thread an explicit scope parameter through the
CRUD handlers (modify the HTTP handlers and signatures for put_secret,
delete_secret, secret_info to accept a scope and use it when calling store.put,
store.delete, store.info) and update SecretListItem/response handling to include
the scope so clients can target agent-scoped secrets. Ensure all usages of
list_secrets, put_secret, delete_secret, and secret_info are adjusted
consistently (including places referenced around the other occurrences) so
listing and mutation use the same scope semantics.
In `@src/config/load.rs`:
- Around line 199-204: The worker_wall_clock_timeout_secs value is currently
accepted as-is so a configured 0 would be allowed; compute the resolved value
(using
overrides.worker_wall_clock_timeout_secs.unwrap_or(defaults.worker_wall_clock_timeout_secs))
and validate it the same way maintenance_interval_secs is validated—ensure the
resolved worker_wall_clock_timeout_secs is >= 1 and return an Err (with a clear
message) from the config loader if it is < 1 before returning Ok(config); update
the resolution logic near the
cron_default_timeout_secs/worker_wall_clock_timeout_secs lines and reference
worker_wall_clock_timeout_secs, overrides, defaults, and the surrounding config
resolution function when making the change.
In `@src/cron/scheduler.rs`:
- Around line 1405-1409: The wake is currently fired too early (using
context.deps.wake_tx and crate::agent::wake::fire_wake) before the cron channel
produces any follow-up tasks, risking the one-shot dormant wake being consumed
before work exists; move the wake invocation out of the current pre-channel
location and either (a) invoke
crate::agent::wake::fire_wake(context.deps.wake_tx, &context.deps.agent_id)
immediately after the cron channel processing completes, or (b) call it from the
specific side-effect creation path that enqueues follow-up tasks so the dormant
wake is delivered only when new work/associations are actually created.
In `@src/main.rs`:
- Around line 2776-2784: The wake_registry created for spawn_wake_manager is
only populated in initialize_agents and becomes stale when agents are
added/removed later; move its ownership to live agent map scope (the same place
the runtime agent HashMap is stored) or add explicit register/unregister hooks
on the wake-manager that are invoked from the main-loop add/remove paths so new
agents call wake_manager.register(agent_id, AgentDeps) and removed agents call
wake_manager.unregister(agent_id); update uses of wake_registry,
spawn_wake_manager, api_state.wake_tx and the add/remove handling in the main
loop to call these hooks so the registry stays in sync with runtime agent
changes.
- Around line 3752-3764: The startup warmup pass must be gated by the same
dormant-mode check so dormant agents don't run warmup; update the code that
schedules run_warmup_once to first read each agent's cortex mode
(agent.deps.runtime_config.cortex.load().mode) and skip calling or scheduling
run_warmup_once when cortex_mode.is_dormant() is true (the same check used in
the loop that skips the recurring cortex loops), ensuring agents iteration
(agents, agent_id, agent) never triggers run_warmup_once for dormant agents.
In `@src/secrets/store.rs`:
- Around line 748-791: The tool_env_vars() two-pass merge currently leaves a
shared secret in place when an agent-scoped secret exists but read_value(scope,
name) fails; change the second-pass behavior so that for each agent-scoped entry
(match SecretScope::Agent { id } if id == agent_id.as_ref()) you either insert
the agent value on success or explicitly remove any existing entry from result
when read_value returns None, ensuring unreadable agent overrides do not fall
back to shared secrets; apply the same change to the other similar merge block
that follows the same pattern (the other tool-secret-merge routine using
list_metadata and read_value).
- Around line 592-624: The delete() method currently performs mutations without
acquiring the shared mutation_guard, allowing race conditions with
enable_encryption(), rotate_key(), and set(); fix it by acquiring and holding
the mutation_guard for the duration of the delete operation (before begin_write
and until after commit) so the delete's write_txn and its removals from
SECRETS_TABLE and METADATA_TABLE cannot interleave with encryption/rotation/set
operations; ensure you use the same mutation_guard used by enable_encryption(),
rotate_key(), and set() so the lock is taken at the start of delete() and
released only after write_txn.commit() completes.
In `@src/tools/browser_detection.rs`:
- Around line 102-130: detect_login_wall currently flags any final_url path
matching login_path_signals as BlockReason::LoginWall when the final host equals
navigation_target_host, which blocks intentional sign-in navigations; update
detect_login_wall to also consider the requested/navigation target path (or
accept an explicit allow_expected_login boolean) so it only classifies as
LoginWall when the navigation target path differs from the final path (i.e., an
unexpected redirect to a login page) or when callers do not mark the login as
expected; use the existing symbols (detect_login_wall, final_url,
navigation_target_host, login_path_signals, BlockReason::LoginWall) to locate
the logic and add the extra comparison or parameter and short-circuit
accordingly.
In `@src/tools/browser.rs`:
- Around line 851-860: When resolving secrets in Browser::get (the block using
self.agent_id, SecretScope::agent and store.get), only fall back to
SecretScope::shared() if the agent-scoped get returns a NotFound error; if
store.get(&SecretScope::agent(...), name) returns any other error (e.g.
decrypt/unreadable), propagate that error wrapped in BrowserError::new instead
of trying the shared scope. Concretely: replace the current unconditional
fall-through logic with a match on the agent-scoped store.get result —
Ok(secret) => Ok(secret), Err(err) => if err indicates NotFound then attempt
store.get(&SecretScope::shared(), name) and map errors into
BrowserError::new(format!(...)), else return
Err(BrowserError::new(format!(...err...))). Ensure you reference store.get,
SecretScope::agent, SecretScope::shared, self.agent_id and BrowserError::new
when making the change.
---
Outside diff comments:
In `@src/agent/worker.rs`:
- Around line 841-985: The follow-up loop currently turns
tool/browser/captcha/login stalls into generic failures instead of honoring
blocked_signal and returning WorkerOutcome::Blocked; update the error handling
in the follow-up retry loop (the match that produces follow_up_result and the
Err(error) arms) to check/consume the same blocked_signal used in the main task
loop and, when present, set the worker state and outcome to the blocked path
(mirroring WorkerOutcome::Blocked behavior) instead of treating it as a normal
failure; specifically, when a tool-nudge/captcha/login condition is detected
(e.g., via SpacebotHook::is_tool_nudge_reason or the same signal used
elsewhere), call the same blocked_signal consumption logic and produce the
blocked outcome (and avoid writing a generic failed state) so follow-up runs
resume to the structured blocked handling rather than WorkerState::Failed.
In `@src/config/load.rs`:
- Around line 2639-2663: The new MemoryJanitorConfig (constructed as
memory_janitor in load.rs) was added to Config but not added to the whitelist of
known top-level keys, so warn_unknown_config_keys still flags a valid
[memory_janitor] section; fix by adding "memory_janitor" to the
KNOWN_TOP_LEVEL_KEYS collection (or equivalent constant/vec used by
warn_unknown_config_keys) so that the new top-level key is recognized, and
ensure any unit tests that verify known keys are updated to include
MemoryJanitorConfig if present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2c73ff7f-40cf-48a8-aeb3-d2dadb34c70f
📒 Files selected for processing (31)
docs/design-docs/agentic-backend-readiness.mdsrc/agent.rssrc/agent/branch.rssrc/agent/channel.rssrc/agent/channel_dispatch.rssrc/agent/cortex.rssrc/agent/invariant_harness.rssrc/agent/maintenance.rssrc/agent/wake.rssrc/agent/worker.rssrc/api/agents.rssrc/api/providers.rssrc/api/secrets.rssrc/api/server.rssrc/api/state.rssrc/config/load.rssrc/config/toml_schema.rssrc/config/types.rssrc/cron/scheduler.rssrc/lib.rssrc/main.rssrc/opencode/worker.rssrc/sandbox.rssrc/secrets/scrub.rssrc/secrets/store.rssrc/tools.rssrc/tools/browser.rssrc/tools/browser_detection.rssrc/tools/secret_set.rssrc/tools/send_agent_message.rssrc/tools/spawn_worker.rs
👮 Files not reviewed due to content moderation or server errors (4)
- src/api/agents.rs
- src/config/types.rs
- src/agent/wake.rs
- src/agent/channel_dispatch.rs
|
|
||
| The parent application has many entities (customers, accounts, listings, projects — domain depends). Each entity benefits from a dedicated, persistent agent that: | ||
|
|
||
| - Owns its entity's history forever (not pollluted by other entities' context) |
There was a problem hiding this comment.
Fix the typo in the use-case bullets.
pollluted should be polluted.
🧰 Tools
🪛 LanguageTool
[grammar] ~11-~11: Ensure spelling is correct
Context: ... Owns its entity's history forever (not pollluted by other entities' context) - Holds iso...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/design-docs/agentic-backend-readiness.md` at line 11, Typo in the
use-case bullet: change the word "pollluted" to "polluted" in the sentence "Owns
its entity's history forever (not pollluted by other entities' context)"; update
that exact bullet text so it reads "Owns its entity's history forever (not
polluted by other entities' context)".
| ``` | ||
| trigger fires | ||
| → cortex::wake(agent_id, trigger) | ||
| → check cached bulletin: if within grace period, reuse | ||
| → else: synthesize bulletin from memory store (no LLM tool calls — pure memory query + RRF + single LLM synthesis call) | ||
| → cache bulletin with TTL = wake_grace_period_secs | ||
| → hand off to trigger-appropriate process: | ||
| ─ task triggers → spawn worker via existing ready-task path | ||
| ─ messaging triggers → spawn channel | ||
| ─ webhook triggers → dispatch into the existing handler | ||
| ─ cron triggers → spawn ephemeral channel (existing cron pattern) | ||
| → grace timer starts; on expiry without new triggers, return to fully dormant | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to this fenced block.
The unlabeled fence trips markdownlint (MD040) and loses syntax highlighting. text would fit this flow sketch.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 125-125: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/design-docs/agentic-backend-readiness.md` around lines 125 - 137, The
fenced code block that begins with "```" and contains the flow sketch starting
"trigger fires → cortex::wake(agent_id, trigger) ..." is missing a language tag;
update that opening fence to include the language "text" (i.e., change ``` to
```text) so markdownlint MD040 is satisfied and syntax highlighting is preserved
for the flow sketch.
| // Convert the worker's structured outcome (or panic / error) | ||
| // into a single Result<WorkerOutcome, WorkerCompletionError> | ||
| // so the success / failure split below is driven by one | ||
| // notion of "did it work". | ||
| let normalized: std::result::Result<WorkerOutcome, WorkerCompletionError> = | ||
| match worker_result { | ||
| Ok(Ok(outcome)) => Ok(outcome), | ||
| Ok(Err(error)) => { | ||
| Err(WorkerCompletionError::failed(scrub(error.to_string()))) | ||
| } | ||
| } | ||
| Ok(Err(error)) => { | ||
| let scrubbed_error = scrub(error.to_string()); | ||
| let (error_message, _notify, _success) = map_worker_completion_result( | ||
| Err(WorkerCompletionError::failed(scrubbed_error.clone())), | ||
| ); | ||
| let worker_complete_message = format!("Worker failed: {error}"); | ||
| run_logger.log_worker_completed(worker_id, &error_message, false); | ||
| let requeue_result = task_store | ||
| .update( | ||
| task.task_number, | ||
| UpdateTaskInput { | ||
| status: Some(TaskStatus::Ready), | ||
| clear_worker_id: true, | ||
| ..Default::default() | ||
| }, | ||
| ) | ||
| .await; | ||
|
|
||
| if let Err(ref update_error) = requeue_result { | ||
| tracing::warn!( | ||
| %update_error, | ||
| task_number = task.task_number, | ||
| "failed to return task to ready after failure" | ||
| ); | ||
| logger.log( | ||
| "task_pickup_failed_to_persist", | ||
| &format!( | ||
| "Picked-up task #{} failed but could not persist failure state: {error}", | ||
| task.task_number | ||
| ), | ||
| Some(serde_json::json!({ | ||
| "task_number": task.task_number, | ||
| "worker_id": worker_id.to_string(), | ||
| "error": error.to_string(), | ||
| })), | ||
| ); | ||
| } else { | ||
| let _ = event_tx.send(ProcessEvent::TaskUpdated { | ||
| agent_id: Arc::from(agent_id.as_str()), | ||
| task_number: task.task_number, | ||
| status: "ready".to_string(), | ||
| action: "updated".to_string(), | ||
| }); | ||
|
|
||
| logger.log( | ||
| "task_pickup_failed", | ||
| &format!( | ||
| "Picked-up task #{} failed: {error}", | ||
| task.task_number | ||
| ), | ||
| Some(serde_json::json!({ | ||
| "task_number": task.task_number, | ||
| "worker_id": worker_id.to_string(), | ||
| "error": error.to_string(), | ||
| })), | ||
| Err(panic_payload) => { | ||
| let panic_message = | ||
| crate::agent::panic_payload_to_string(&*panic_payload); | ||
| tracing::error!( | ||
| %worker_id, | ||
| %panic_message, | ||
| "cortex worker task panicked" | ||
| ); | ||
|
|
||
| notify_delegation_completion( | ||
| &task, | ||
| &error_message, | ||
| false, | ||
| &agent_id, | ||
| &links, | ||
| &agent_names, | ||
| &sqlite_pool, | ||
| &injection_tx, | ||
| ) | ||
| .await; | ||
| Err(WorkerCompletionError::failed(format!( | ||
| "worker task panicked: {}", | ||
| scrub(panic_message) | ||
| ))) | ||
| } | ||
| }; | ||
| let (result_text, _notify, success) = map_worker_completion(normalized); | ||
| let result_text = scrub(result_text); | ||
| if success { |
There was a problem hiding this comment.
Don't flatten Timeout and Blocked into an immediate ready retry.
This branch now throws away the WorkerOutcome kind after map_worker_completion() and requeues every non-success outcome to TaskStatus::Ready. For detached tasks, that means WorkerOutcome::Timeout and WorkerOutcome::Blocked will hot-loop on the next pickup pass instead of taking a distinct path, so a task can repeatedly hit the same wall-clock timeout or CAPTCHA/login wall forever. Keep the structured outcome around here and route at least Timeout / Blocked separately from generic failures.
Also applies to: 4049-4059
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agent/cortex.rs` around lines 3947 - 3973, The current code flattens all
non-success worker completions into a generic retry path by using
map_worker_completion(normalized) and then requeuing to TaskStatus::Ready;
instead, preserve and inspect the actual WorkerOutcome (do not discard the
Ok(outcome) variant in normalized) and handle WorkerOutcome::Timeout and
WorkerOutcome::Blocked specially (e.g., set distinct TaskStatus variants like
TimedOut/Blocked or route to separate backoff/retry logic) rather than
immediately marking them Ready. Concretely: stop converting Ok(Ok(outcome)) into
only a result_text/success boolean flow—keep the outcome value (from normalized
or the original worker_result), call map_worker_completion only for text/notify
but then match on the preserved WorkerOutcome to choose the correct TaskStatus
(handling Timeout and Blocked separately), while keeping the existing failure
handling for WorkerCompletionError and panics.
| let db_updated = task_store | ||
| .update( | ||
| task.task_number, | ||
| UpdateTaskInput { | ||
| status: Some(TaskStatus::Done), | ||
| ..Default::default() | ||
| }, | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
Handle TaskStore::update(...)=Ok(None) explicitly in both persistence branches.
TaskStore::update is clearly a Result<Option<_>> here—the timeout path below already matches Ok(Some(_)), Ok(None), and Err(_). In these new success/failure branches you only test Err(_), so Ok(None) is treated as success: you'll emit TaskUpdated, WorkerComplete, and delegation notifications even though no task row was actually updated.
Suggested shape
- let db_updated = task_store.update(...).await;
- if let Err(ref error) = db_updated {
- ...
- } else {
- ...
- }
+ match task_store.update(...).await {
+ Ok(Some(_)) => {
+ ...
+ }
+ Ok(None) => {
+ tracing::warn!(task_number = task.task_number, "task update returned no row");
+ ...
+ }
+ Err(error) => {
+ ...
+ }
+ }Also applies to: 3984-4047, 4051-4060, 4062-4112
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agent/cortex.rs` around lines 3974 - 3982, The code treats
task_store.update(...) which returns Result<Option<_>, _> as a simple
success/failure; explicitly handle the Ok(None) case in both persistence
branches where task_store.update(...) is awaited (the lines around the current
call to task_store.update(task.task_number, UpdateTaskInput { status:
Some(TaskStatus::Done), .. })) so that Ok(Some(_)) proceeds to emit TaskUpdated,
WorkerComplete and delegation notifications but Ok(None) is treated as “no row
updated” and does not emit those events; update the matching logic in the same
blocks that already check Err(_) (and mirror the timeout-branch matching that
handles Ok(Some), Ok(None), Err) to return or log a no-update path instead of
treating Ok(None) as success.
| for deps in agents { | ||
| let agent_id = deps.agent_id.clone(); | ||
| if let Err(error) = run_maintenance_for_agent(&deps).await { | ||
| tracing::warn!( | ||
| %agent_id, | ||
| %error, | ||
| "janitor maintenance pass failed for agent" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bound each agent pass so one hung maintenance job can't stall the whole janitor.
This loop awaits maintenance serially for every agent, with no timeout or isolation. If one store or embedding call hangs, the janitor never reaches the remaining agents, which effectively disables dormant-mode maintenance instance-wide. The existing cortex path had timeout/cancellation protection; this replacement needs at least a per-agent timeout.
Also applies to: 60-76
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/agent/maintenance.rs` around lines 46 - 55, The per-agent maintenance
loop currently awaits run_maintenance_for_agent(&deps).await serially with no
timeout, so a hung call can stall the whole janitor; wrap each per-agent call in
a Tokio timeout (tokio::time::timeout) or spawn it as a cancellable task and
await with timeout so a slow/hung agent is aborted and the loop continues;
specifically, replace the direct await of run_maintenance_for_agent(&deps).await
in the agents loop (and the similar section covering lines 60–76) with a
timeout-wrapped call that logs a distinct timeout warning (including agent_id)
and continues to the next agent when the timeout elapses, while still logging
other errors from run_maintenance_for_agent.
| // Start cortex warmup, runtime, and association loops for each agent. | ||
| // Skip every loop when the agent is in Dormant mode — those agents wake | ||
| // only on external triggers (cross-agent message, cron fire, admin API). | ||
| for (agent_id, agent) in agents.iter() { | ||
| let cortex_mode = agent.deps.runtime_config.cortex.load().mode; | ||
| if cortex_mode.is_dormant() { | ||
| tracing::info!( | ||
| agent_id = %agent_id, | ||
| "cortex loops skipped: agent is in dormant mode" | ||
| ); | ||
| continue; | ||
| } | ||
|
|
There was a problem hiding this comment.
Dormant mode still performs startup warmup.
This block skips the recurring cortex loops, but Line 3143 still schedules run_warmup_once for every agent. Dormant agents will therefore do warmup work at boot anyway, which breaks the dormant-mode contract and can still touch model/embedding startup paths. Gate the startup warmup pass with the same mode check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/main.rs` around lines 3752 - 3764, The startup warmup pass must be gated
by the same dormant-mode check so dormant agents don't run warmup; update the
code that schedules run_warmup_once to first read each agent's cortex mode
(agent.deps.runtime_config.cortex.load().mode) and skip calling or scheduling
run_warmup_once when cortex_mode.is_dormant() is true (the same check used in
the loop that skips the recurring cortex loops), ensuring agents iteration
(agents, agent_id, agent) never triggers run_warmup_once for dormant agents.
| pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> { | ||
| let write_txn = self.db.begin_write().map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!( | ||
| "failed to begin write transaction: {error}" | ||
| )) | ||
| })?; | ||
| let key = encode_key(scope, name); | ||
| { | ||
| let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) | ||
| })?; | ||
| secrets.remove(name).map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!("failed to remove key '{name}': {error}")) | ||
| secrets.remove(key.as_str()).map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!( | ||
| "failed to remove key '{name}' ({scope}): {error}" | ||
| )) | ||
| })?; | ||
|
|
||
| let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) | ||
| })?; | ||
| meta.remove(name).map_err(|error| { | ||
| meta.remove(key.as_str()).map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!( | ||
| "failed to remove metadata for '{name}': {error}" | ||
| "failed to remove metadata for '{name}' ({scope}): {error}" | ||
| )) | ||
| })?; | ||
| } | ||
| write_txn.commit().map_err(|error| { | ||
| SecretsError::Other(anyhow::anyhow!( | ||
| "failed to commit delete for '{name}': {error}" | ||
| "failed to commit delete for '{name}' ({scope}): {error}" | ||
| )) | ||
| })?; | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
Guard delete() with mutation_guard too.
delete is now the one mutating path that can still interleave with enable_encryption(), rotate_key(), or set(). If it lands after one of those methods snapshots the table but before it writes back, the deleted secret can be resurrected or have fresh metadata written back over the delete.
Suggested fix
pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> {
+ let _guard = self.mutation_guard.lock().expect("mutation guard poisoned");
let write_txn = self.db.begin_write().map_err(|error| {
SecretsError::Other(anyhow::anyhow!(
"failed to begin write transaction: {error}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> { | |
| let write_txn = self.db.begin_write().map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to begin write transaction: {error}" | |
| )) | |
| })?; | |
| let key = encode_key(scope, name); | |
| { | |
| let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) | |
| })?; | |
| secrets.remove(name).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!("failed to remove key '{name}': {error}")) | |
| secrets.remove(key.as_str()).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to remove key '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) | |
| })?; | |
| meta.remove(name).map_err(|error| { | |
| meta.remove(key.as_str()).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to remove metadata for '{name}': {error}" | |
| "failed to remove metadata for '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| } | |
| write_txn.commit().map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to commit delete for '{name}': {error}" | |
| "failed to commit delete for '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| Ok(()) | |
| pub fn delete(&self, scope: &SecretScope, name: &str) -> Result<(), SecretsError> { | |
| let _guard = self.mutation_guard.lock().expect("mutation guard poisoned"); | |
| let write_txn = self.db.begin_write().map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to begin write transaction: {error}" | |
| )) | |
| })?; | |
| let key = encode_key(scope, name); | |
| { | |
| let mut secrets = write_txn.open_table(SECRETS_TABLE).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!("failed to open secrets table: {error}")) | |
| })?; | |
| secrets.remove(key.as_str()).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to remove key '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| let mut meta = write_txn.open_table(METADATA_TABLE).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!("failed to open metadata table: {error}")) | |
| })?; | |
| meta.remove(key.as_str()).map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to remove metadata for '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| } | |
| write_txn.commit().map_err(|error| { | |
| SecretsError::Other(anyhow::anyhow!( | |
| "failed to commit delete for '{name}' ({scope}): {error}" | |
| )) | |
| })?; | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/secrets/store.rs` around lines 592 - 624, The delete() method currently
performs mutations without acquiring the shared mutation_guard, allowing race
conditions with enable_encryption(), rotate_key(), and set(); fix it by
acquiring and holding the mutation_guard for the duration of the delete
operation (before begin_write and until after commit) so the delete's write_txn
and its removals from SECRETS_TABLE and METADATA_TABLE cannot interleave with
encryption/rotation/set operations; ensure you use the same mutation_guard used
by enable_encryption(), rotate_key(), and set() so the lock is taken at the
start of delete() and released only after write_txn.commit() completes.
| /// Get all tool secrets visible to `agent_id` as name→value pairs for | ||
| /// `Sandbox::wrap()` injection. Visibility is the union of every | ||
| /// `InstanceShared(Tool)` row and every `Agent(agent_id, Tool)` row; | ||
| /// agent-scoped values shadow shared values when names collide. | ||
| /// | ||
| /// Returns an empty map when locked (tool secrets unavailable). | ||
| pub fn tool_env_vars(&self) -> HashMap<String, String> { | ||
| pub fn tool_env_vars(&self, agent_id: &AgentId) -> HashMap<String, String> { | ||
| if self.state() == StoreState::Locked { | ||
| return HashMap::new(); | ||
| } | ||
|
|
||
| let metadata = match self.list_metadata() { | ||
| let metadata = match self.list_metadata(None) { | ||
| Ok(m) => m, | ||
| Err(error) => { | ||
| tracing::warn!(%error, "failed to list secret metadata for tool env vars"); | ||
| return HashMap::new(); | ||
| } | ||
| }; | ||
|
|
||
| // Two-pass insert so agent-scoped values shadow same-named shared values. | ||
| let mut result = HashMap::new(); | ||
| for (name, meta) in &metadata { | ||
| if meta.category == SecretCategory::Tool { | ||
| match self.get(name) { | ||
| Ok(secret) => { | ||
| result.insert(name.clone(), secret.expose().to_string()); | ||
| } | ||
| Err(error) => { | ||
| tracing::warn!(secret = %name, %error, "failed to decrypt tool secret — skipping"); | ||
| } | ||
| } | ||
| for ((scope, name), _meta) in metadata | ||
| .iter() | ||
| .filter(|(_, m)| m.category == SecretCategory::Tool) | ||
| { | ||
| if !scope.is_shared() { | ||
| continue; | ||
| } | ||
| if let Some(value) = self.read_value(scope, name) { | ||
| result.insert(name.clone(), value); | ||
| } | ||
| } | ||
| for ((scope, name), _meta) in metadata | ||
| .iter() | ||
| .filter(|(_, m)| m.category == SecretCategory::Tool) | ||
| { | ||
| if !matches!(scope, SecretScope::Agent { id } if id == agent_id.as_ref()) { | ||
| continue; | ||
| } | ||
| if let Some(value) = self.read_value(scope, name) { | ||
| result.insert(name.clone(), value); | ||
| } | ||
| } | ||
|
|
||
| result |
There was a problem hiding this comment.
Unreadable agent overrides currently fall back to the shared credential.
tool_env_vars() documents that agent-scoped values shadow shared ones, but the current two-pass merge only overwrites on a successful agent read. If an agent-scoped secret exists and read_value() fails, the shared value stays in the map and gets injected into that worker instead of surfacing the isolation problem.
Also applies to: 828-840
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/secrets/store.rs` around lines 748 - 791, The tool_env_vars() two-pass
merge currently leaves a shared secret in place when an agent-scoped secret
exists but read_value(scope, name) fails; change the second-pass behavior so
that for each agent-scoped entry (match SecretScope::Agent { id } if id ==
agent_id.as_ref()) you either insert the agent value on success or explicitly
remove any existing entry from result when read_value returns None, ensuring
unreadable agent overrides do not fall back to shared secrets; apply the same
change to the other similar merge block that follows the same pattern (the other
tool-secret-merge routine using list_metadata and read_value).
| pub fn detect_login_wall( | ||
| final_url: Option<&str>, | ||
| navigation_target_host: Option<&str>, | ||
| ) -> Option<BlockReason> { | ||
| let url = final_url?; | ||
| let parsed = url::Url::parse(url).ok()?; | ||
| let path = parsed.path().to_lowercase(); | ||
| let login_path_signals = [ | ||
| "/login", | ||
| "/signin", | ||
| "/auth/", | ||
| "/account/login", | ||
| "/users/sign_in", | ||
| ]; | ||
| let path_is_login = login_path_signals | ||
| .iter() | ||
| .any(|signal| path.contains(signal)); | ||
| if !path_is_login { | ||
| return None; | ||
| } | ||
| if let Some(target_host) = navigation_target_host { | ||
| let final_host = parsed.host_str().unwrap_or(""); | ||
| if final_host.eq_ignore_ascii_case(target_host) { | ||
| return Some(BlockReason::LoginWall); | ||
| } | ||
| } else { | ||
| return Some(BlockReason::LoginWall); | ||
| } | ||
| None |
There was a problem hiding this comment.
This will classify intentional sign-in flows as LoginWall.
The heuristic fires on any login-looking path once the final host matches the target host, and the caller only passes host context. A worker that deliberately navigates to /login or clicks a “Sign in” button will be short-circuited as blocked, which breaks normal auth flows.
Please compare against the requested URL/path, not just the host, or give callers a way to mark login as expected.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/tools/browser_detection.rs` around lines 102 - 130, detect_login_wall
currently flags any final_url path matching login_path_signals as
BlockReason::LoginWall when the final host equals navigation_target_host, which
blocks intentional sign-in navigations; update detect_login_wall to also
consider the requested/navigation target path (or accept an explicit
allow_expected_login boolean) so it only classifies as LoginWall when the
navigation target path differs from the final path (i.e., an unexpected redirect
to a login page) or when callers do not mark the login as expected; use the
existing symbols (detect_login_wall, final_url, navigation_target_host,
login_path_signals, BlockReason::LoginWall) to locate the logic and add the
extra comparison or parameter and short-circuit accordingly.
| if let Some(agent_id) = self.agent_id.as_ref() { | ||
| let agent_scope = crate::secrets::store::SecretScope::agent(agent_id); | ||
| if let Ok(secret) = store.get(&agent_scope, name) { | ||
| return Ok(secret); | ||
| } | ||
| } | ||
| store | ||
| .get(&crate::secrets::store::SecretScope::shared(), name) | ||
| .map_err(|error| { | ||
| BrowserError::new(format!("failed to resolve secret '{name}': {error}")) |
There was a problem hiding this comment.
Only fall back to the shared scope on NotFound.
Right now any agent-scope read error falls through to the shared secret. If the agent-scoped secret exists but is unreadable or decrypting fails, browser_type will silently type the shared credential instead of surfacing the isolation problem. That breaks the new shadowing contract and can use the wrong tenant’s secret.
Suggested fix
if let Some(agent_id) = self.agent_id.as_ref() {
let agent_scope = crate::secrets::store::SecretScope::agent(agent_id);
- if let Ok(secret) = store.get(&agent_scope, name) {
- return Ok(secret);
+ match store.get(&agent_scope, name) {
+ Ok(secret) => return Ok(secret),
+ Err(crate::error::SecretsError::NotFound { .. }) => {}
+ Err(error) => {
+ return Err(BrowserError::new(format!(
+ "failed to resolve agent-scoped secret '{name}': {error}"
+ )));
+ }
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/tools/browser.rs` around lines 851 - 860, When resolving secrets in
Browser::get (the block using self.agent_id, SecretScope::agent and store.get),
only fall back to SecretScope::shared() if the agent-scoped get returns a
NotFound error; if store.get(&SecretScope::agent(...), name) returns any other
error (e.g. decrypt/unreadable), propagate that error wrapped in
BrowserError::new instead of trying the shared scope. Concretely: replace the
current unconditional fall-through logic with a match on the agent-scoped
store.get result — Ok(secret) => Ok(secret), Err(err) => if err indicates
NotFound then attempt store.get(&SecretScope::shared(), name) and map errors
into BrowserError::new(format!(...)), else return
Err(BrowserError::new(format!(...err...))). Ensure you reference store.get,
SecretScope::agent, SecretScope::shared, self.agent_id and BrowserError::new
when making the change.
Summary
Five surgical changes from
docs/design-docs/agentic-backend-readiness.mdthat take spacebot from "single-user instance" to "ready to be the agentic backend of a production multi-tenant application." Each phase is one commit; phases are independent except 3→4 (4 needs theWorkerOutcomeenum 3 invents).The deployment shape this enables: thousands of mostly-idle agents on shared infrastructure, each with isolated credentials, woken on external events instead of polling, with predictable per-task latency and clean failure surfacing for things humans need to escalate.
Phase 1 — per-agent secret isolation
SecretScope { InstanceShared | Agent(id) }orthogonal to the existingSecretCategory { System | Tool }. Worker subprocess env now only carriesInstanceShared(Tool) ∪ Agent(this_agent, Tool)— a tenant's worker can no longerprintenvanother tenant'sGH_TOKEN. System secrets (LLM keys, messaging tokens) stayInstanceSharedsince their consumers are singletons. redb keys are NUL-delimited; pre-scope keys auto-migrate toInstanceSharedon first open. Browser secret resolver tries agent scope, then shared.Phase 2 — dormant cortex mode
CortexMode { Active, Dormant }. Dormant agents skip all four cortex loops (warmup / tick / association / ready-task). External events deliver wakes via an mpsc-fedWakeManager—send_agent_messagepost-insert, cron fire, andPOST /api/agents/{id}/wake(admin / debugging) are the trigger sites. Wake is a one-shot ready-task pickup; idempotent on active-mode agents (the spawn loop and wake race harmlessly). NewMemoryJanitorConfig(opt-in) replaces the periodic maintenance the dormant cortex no longer runs.Phase 3 — structured
WorkerOutcome+ wall-clock worker timeoutWorker::runreturnsResult<WorkerOutcome>instead ofResult<String>— variantsSuccess,Partial,Cancelled,Timeout,Failed,Blocked(added in phase 4). Wall-clock budget from newCortexConfig.worker_wall_clock_timeout_secs(default 1800s, distinct from the existing supervisor idle-killworker_timeout_secs).tokio::time::timeoutwrapsrun_inner; on expiry the inner future is dropped and we returnTimeoutwith progress read from a shared atomic so the caller knows how far the worker got.Phase 4 — browser captcha / login-wall / WAF detection
DOM-only detection (
tools/browser_detection.rs) for recaptcha, hcaptcha, Cloudflare Turnstile, Cloudflare interstitial (two-marker heuristic to avoid false-positive on every CF-hosted site), and login-wall final-URL paths with optional same-host gate. The browser tool writes a positive detection into a worker-sharedBlockSignalslot; worker reads it at each loop iteration and short-circuits toWorkerOutcome::Blockedwith structuredBlockReason+BlockEvidence. Header-based heuristics deferred until chromiumoxide exposes response headers cleanly.Phase 5 — per-agent cron defaults
CortexConfig.cron_default_timeout_secs: Option<u64>. Resolution: per-job override → per-agent default → 1500s system default. Lifts the previously hard-coded constant out of the scheduler.What's new at the surface
What's deferred (called out as v1)
worker_wall_clock_timeout_secsoverride — system + per-agent default sufficient for v1.Retry-After, WAF banners) — chromiumoxide doesn't expose response headers cleanly. DOM heuristics cover the high-value path.BlockEvidence(HTML snippet, screenshot) intoProcessEvent— text result + worker logs sufficient for v1.Test plan
cargo test --lib— 874/874 pass (includes 3 new secret-scope tests, 1 new browser-detection test suite of 10, 4 new worker-outcome classification tests)cortex.mode = "dormant"for one agent, confirm no LLM bulletin generation in logs at idlesend_agent_messageto a dormant receiver; verify task is picked up immediately (not on next tick)POST /api/agents/{id}/wakeagainst a dormant agent with areadytaskbrowser_navigateto a known captcha-fronted URL; verify worker exits withBlockedinstead of loopingAgent(alice, GH_TOKEN)andAgent(bob, GH_TOKEN)with different values; spawn a worker for each; verify env contains the right onesecrets.redbopens cleanly, all keys readable asInstanceShared