Skip to content

Latest commit

 

History

History
749 lines (511 loc) · 169 KB

File metadata and controls

749 lines (511 loc) · 169 KB

v0.8.61 Implementation Designs — Batch 1 (runtime control plane)

Code-ready designs from the ultracode design wave (12 read-only agents), each building on the landed foundation modules. Ordered per the runtime-control-plane train priority. These guide the code waves; they are analysis, not yet implemented.

#3216 — effort: medium

Current state: The parent turn loop intentionally blocks on running children, which is the freeze. Concretely:

  • should_hold_turn_for_subagents(queued_completions, running_children) returns true whenever running_children > 0 (crates/tui/src/core/engine/turn_loop.rs:2494-2496). That is the policy gate the issue names.
  • When the model finishes a turn with no tool calls and there are no queued completions, the loop at turn_loop.rs:1090-1160 enters a loop { ... tokio::select! } that re-emits "Waiting on {running} sub-agent(s) to complete..." and will not let the turn end until a completion arrives, a steer arrives (which re-enters via continue 'turn_loop), or cancel fires. So a slow/timing-out child holds the whole parent turn "waiting for model response" exactly as reported.
  • Freeze-track nuance: this hold loop does NOT hold a lock across .awaitself.subagent_manager.read().await at 1092-1095 is scoped and dropped before the select!. The TUI render/input task is a separate task fed by tx_event/rx_steer channels (turn loop drains rx_steer opportunistically at turn_loop.rs:522-535 and 1123-1143; UI consumes EngineEvents in crates/tui/src/tui/ui.rs:2314+). So "the entire TUI froze" is driven by (a) the turn never ending + (b) heavy in-process child construction at spawn time (each child clones the runtime/tool registry, per docs/AGENT_RUNTIME.md:74-99), not by a mutex on the UI thread. The diagnosis track should confirm this with instrumentation rather than assume a UI-thread lock.

What already exists to build on:

  • agent_eval already takes block (default false, default true only with continue/resume) at crates/tui/src/tools/subagent/mod.rs:3571-3604 — the explicit, deliberate join primitive the issue asks for in step 3 already exists; the tool default is already nonblocking. The blocking is purely the turn-loop hold policy.
  • The completion sentinel path works: children push SubAgentCompletion over tx_subagent_completion (engine.rs:530-534, 1830), drained opportunistically and surfaced as an internal user runtime_event message (turn_loop.rs:1086-1175, 1283-1302; builder at 2468-2492).
  • The sidebar already projects running workers from AgentSpawned/AgentProgress events and renders a running/done count (events.rs:205-214; ui.rs:2314+, 9808), so "background work visible without turning the turn into waiting-for-model" is mostly a matter of not holding the turn.
  • Fanout is already bounded: SubAgentManager::max_agents + a launch_gate semaphore (interactive_max_launch, default 4) queue excess direct children instead of running them all at once (mod.rs:1561-1607, 2045-2051, 2150).

The gap vs. AGENT_RUNTIME.md cutover: the engine holds only an in-process subagent_manager: SharedSubAgentManager (engine.rs:518, 747) and has NO fleet handle (no fleet/Fleet reference in engine.rs). agent_openAgentSpawnTool spawns an in-process SubAgentRuntime (mod.rs:3055-3057, 2165-2170) and builds the legacy AgentWorkerSpec/AgentWorkerToolProfile (mod.rs:2115-2140), NOT a WorkerRuntimeProfile. The fleet path (durable ledger, retry, receipts) launches out-of-process codewhale exec subprocesses (crates/tui/src/fleet/executor.rs:40-79; manager.rs:201-337) and is reachable only via runtime_api.rs and main.rs, never from a live turn. worker_profile.rs (derive_child intersection) is still #![allow(dead_code)] (worker_profile.rs:18) — defined but unwired.

Design: Land in two layers. Layer A (release-blocker, small/surgical) fixes nonblocking + responsiveness and is fully testable now. Layer B (medium) does the worker_profile + fleet-enqueue wiring the AGENT_RUNTIME cutover wants, behind a feature flag, without a risky full cross-process cutover.

== Layer A — nonblocking parent turn + responsiveness (the actual fix for #3216) ==

A1. Change the hold policy. In turn_loop.rs:2494-2496, redefine: fn should_hold_turn_for_subagents(queued_completions: usize, running_children: usize) -> bool { queued_completions > 0 } i.e. drop the || running_children > 0 term. Keep the running_children: usize parameter (so call sites and the diagnostic surface still pass it) but stop holding on it. This directly satisfies issue steps 1-2: launching != joining; only queued completions (work already done that must be surfaced) resume the turn; running children become background work.

A2. Collapse the spin loop. At turn_loop.rs:1090-1160, the if completions.is_empty() { loop { ... } } block exists only to wait for running children. With A1 it never holds, so replace the whole loop/tokio::select! (1091-1159) with a single non-waiting check: drain any already-queued completions (the try_recv at 1087-1089 already did this), and if none, fall through to end the turn. Delete the "Waiting on N sub-agent(s)" status emission and the in-loop subagent_heartbeat_timeout cleanup branch; move stale-child cleanup to A4 so it is not on the turn-end path. Steers stay accepted via the existing pending_steers drain at 1066-1076 and the top-of-loop drain at 522-535.

A3. Update the call site at turn_loop.rs:1319-1326 (thinking-only status). With A1, should_hold_turn_for_subagents(0, running) is always false, so holding_for_subagents becomes false and the thinking-only status fires correctly when a reasoning-only turn ends with children still in the background. Add a one-line status when the turn ends with N children still running (e.g. "Turn ended; N sub-agent(s) still running in background — they will report when done") so the user knows work continues and results arrive via the completion sentinel (and agent_eval).

A4. Background reaper for stale children (replaces the in-loop cleanup at 1144-1157). Add a supervised background task (pattern: spawn_supervised, used at mod.rs:2165) owned by the engine that periodically calls subagent_manager.write().cleanup(stale_after) on config.subagent_heartbeat_timeout cadence and emits a status when it cancels stale agents. This keeps auto-cancel-stale behavior without coupling it to the parent turn ending. The completion channel already wakes the next turn when children finish, so the parent need not poll.

A5. Responsiveness instrumentation + watchdog (diagnosis track steps 2/6, release-blocking). Add crates/tui/src/runtime_liveness.rs: a cheap Arc<RuntimeLiveness> of AtomicU64 epoch-millis stamps — last_input_event, last_render_tick, last_engine_event, last_turn_progress — with touch_* setters and snapshot(). Stamp last_input_event where the UI reads keys, last_render_tick in the UI draw path, last_engine_event where the UI consumes EngineEvent (ui.rs:2314 region), last_turn_progress at top of the turn-loop iteration. Add a /debug agents command (or extend an existing debug command under crates/tui/src/commands) that dumps active workers (subagent_manager.read().list()/running_count), pending completion-channel depth, and the liveness snapshot deltas. This is the watchdog dump the issue requires and lets the stress test assert input is acknowledged within a bounded interval.

A6. Gate /swarm (issue step 7). Find the /swarm registration (grep swarm under crates/tui/src/commands) and keep it behind the existing feature gate / mark experimental until #3216 + #3215 land. If /swarm is already unlisted, add a code comment + a test asserting it is not promoted.

== Layer B — worker_profile + fleet-backed enqueue (AGENT_RUNTIME cutover, medium, flagged) ==

B1. Wire WorkerRuntimeProfile into the spawn path (uses the landed foundation). In spawn_background_with_assignment_options (mod.rs:2033-2140), compute a child profile: parent profile from the live session posture, requested profile from WorkerRuntimeProfile::for_role(agent_type) (worker_profile.rs:132) plus agent_open inputs (model→ModelRoute, allowed_tools→ToolScope::Explicit, max_depth→max_spawn_depth), then parent.derive_child(&requested) (worker_profile.rs:180). Map the result onto the existing AgentWorkerSpec: permissions.write/network, shell (ShellPolicy)→legacy allow_shell/exec-policy, tools→AgentWorkerToolProfile, max_spawn_depth. This makes per-child caps real and enforces non-escalation (issue step 6 + worker_profile follow-up #3217). Remove the #![allow(dead_code)] once consumed.

B2. Add a fleet-enqueue launch mode for high fanout. Give the engine an optional fleet: Option<Arc<FleetManager>> built from the workspace (FleetManager::open, manager.rs:146) .with_sub_agent_manager(self.subagent_manager.clone()) (manager.rs:170) and .with_exec_config(...). In AgentSpawnTool: when a new config flag [subagents].durable_fanout=true (default false) is set OR running_count would exceed an in-process threshold, route the spawn through FleetManager::create_run with a single-task FleetTaskSpecDocument derived from the assignment + profile (manager.rs:201-257), returning the fleet worker id as the agent_id in the same SubAgentResult projection so the model still "sees a sub-agent." The fleet path already gives durable ledger state, retry, receipts, inspect/restart (manager.rs:388-573) — exactly the cutover rule in AGENT_RUNTIME.md:54-72. agent_eval/agent_close map onto inspect_worker/interrupt_worker for fleet-backed ids.

B3. Keep one observation surface. Bridge fleet ledger events to the existing AgentSpawned/AgentProgress events (events.rs:205-214) so the sidebar renders fleet workers identically to in-process children (no second UI). Satisfies AGENT_RUNTIME.md:117-143.

Sequencing: ship Layer A first (release-blocker, independently testable). Land B1 (pure intersection, low risk) before B2/B3 (fleet enqueue, behind default-off flag) so the cutover is dogfoodable without changing default behavior.

Builds on: - crates/tui/src/worker_profile.rs (WorkerRuntimeProfile + derive_child intersection) — consumed by Layer B1 for per-child caps and non-escalation; the module doc names this as its follow-up (#3217).

  • crates/tui/src/goal_loop.rs decide_continuation, already wired via goal_continuation_message_if_needed (turn_loop.rs:1304-1317) — Layer A leaves it intact; once the hold is removed it keeps the parent productively working while children run in background (depends-on #3215).
  • crates/tui/src/fleet/* (manager.rs create_run/schedule_run/inspect/interrupt/restart; executor.rs codewhale-exec launch; ledger) — the durable substrate Layer B2/B3 enqueue onto. with_sub_agent_manager (manager.rs:170) is the existing hook.
  • crates/tui/src/resource_telemetry.rs + context_budget.rs (PressureLevel/budget math) — feed the "six workers only when budget/model availability permit" bound in issue step 6 / Layer B2 threshold.
  • Existing agent_eval block flag (mod.rs:3571-3604) — already the deliberate wait/finalize op from issue step 3; no new tool needed, just documented as the join primitive.
  • core/engine/turn_loop.rs freeze-cancel arm (interrupted_tool_result / between-batch cancel observation at 2512-2526) — already landed (#3216/#2211 partial); Layer A is the companion fix that stops the turn entering the wait in the first place.

Files: crates/tui/src/core/engine/turn_loop.rs, crates/tui/src/core/engine.rs, crates/tui/src/tools/subagent/mod.rs, crates/tui/src/worker_profile.rs, crates/tui/src/fleet/manager.rs, crates/tui/src/runtime_liveness.rs, crates/tui/src/tui/ui.rs, crates/tui/src/core/engine/tests.rs

Test plan: Layer A (must pass to call #3216 fixed):

  1. Unit, turn_loop.rs tests mod (~2782): replace turn_holds_open_for_running_or_completed_subagents with running_children_alone_do_not_hold_turn: assert !should_hold_turn_for_subagents(0, 1), !should_hold_turn_for_subagents(0, 6), still should_hold_turn_for_subagents(1, 0) (queued completion resumes). Proves issue step 1.
  2. Unit: should_emit_thinking_only_status with holding_for_subagents=false still fires when appropriate (guards A3).
  3. Engine turn-loop integration (core/engine/tests.rs, existing harness): drive a turn where manager.running_count()=2, zero queued completions; inject a steer on rx_steer; assert the turn ends (no spin on "Waiting on N") and the steer is accepted into the next turn within one iteration. The issue's "two workers running, no completions, steer accepted" test.
  4. Engine integration: push a SubAgentCompletion onto the channel mid-turn; assert the next turn surfaces the internal runtime_event message (sentinel path still works — guards A2).
  5. Reaper test (mod.rs tests): with a child whose last_activity exceeds heartbeat timeout, assert the background reaper cleanup cancels it WITHOUT the parent turn being in a wait loop.
  6. Stress/dogfood harness (new test in core/engine/tests.rs or a --stress-agents debug path): six fake workers — 3 fast success, 1 slow success, 1 provider-timeout-then-retry, 1 hard failure — while injecting synthetic steer/cancel; assert (a) parent turn never blocks on the slow/timeout worker, (b) each injected input updates RuntimeLiveness last_input_event within a bounded interval (e.g. <250ms), (c) success sentinels surface, (d) cancel interrupts promptly. The six-worker regression-acceptance test and the proof the UI stays live.
  7. Liveness unit (runtime_liveness.rs): touch_* updates stamps monotonically; snapshot deltas correct; /debug agents dump includes running workers + channel depth + liveness deltas.

Layer B: 8. Profile-wiring unit (mod.rs): spawning a child of a read-only Explore parent with a write-requesting Implementer request yields an AgentWorkerSpec with write=false and clamped shell (mirrors worker_profile.rs::child_cannot_escalate_beyond_a_readonly_parent at the spawn boundary). 9. Fanout-bound unit: with the cap below 6, the 5th/6th spawn queues via launch_gate rather than erroring/running. 10. Fleet-enqueue test (gated): with durable_fanout=true, an agent_open spawn creates a fleet run (FleetManager::create_run) whose worker id round-trips as the SubAgentResult agent_id, and agent_eval(block=false) maps to inspect_worker without blocking the turn.

Gates: cargo fmt; cargo clippy -p codewhale-tui (remove worker_profile dead_code allow once B1 consumes it); cargo test -p codewhale-tui.

Risks: - Behavior change in default flow: removing the running-children hold means a parent turn can now END while children run, then RESUME on a later completion sentinel. If a consumer assumed "turn end == all children done," it breaks. Mitigation: completion sentinel + sidebar already model in-flight work; add the A3 status line; agent_eval(block=true) remains for explicit join. Audit callers that treat the turn outcome as a join barrier.

  • Deleting the spin loop (A2) also removes the in-loop stale-child cleanup; if A4's background reaper is not landed in the same change, stale children leak. Land A4 with A2.
  • Steer delivery: the deleted select arm had a steer branch that did continue 'turn_loop. Steers are still drained at 522-535 and 1066-1076, but verify no turn-shape relied solely on that arm (test 3 covers this).
  • Layer B2 fleet enqueue is genuinely cross-process (spawns codewhale exec): provider auth, cwd, workspace-trust must resolve in the child exactly as interactive. executor.rs already refuses secret-bearing env and resolves creds in-process, but model/provider parity (DeepSeek/GLM base-url env) must be verified. Keep B2 behind a default-off flag; do not flip default until dogfooded. This is why B is separated from the release-blocking A.
  • The "whole TUI froze" claim: if A5 instrumentation reveals a real UI-thread block (e.g. a synchronous fs/process wait on the render path in ui.rs or shell_dispatcher), that is a separate fix beyond the hold policy. The design includes the instrumentation to find it but cannot assume its absence; the stress test (6) is what proves liveness. Treat any A5 finding of a UI-path block as additional required work before closing the issue.
  • worker_profile removing #![allow(dead_code)]: if B1 does not consume every public item, dead_code may flag leftovers; consume or scope the allow narrowly.

#3096 — effort: large

Current state: The headless-worker substrate is largely LANDED but UNWIRED for the default in-process sub-agent path; this issue is the wiring + projection pass.

EXISTS:

  1. Durable fleet runtime (out-of-process). crates/tui/src/fleet/{executor,scheduler,ledger,manager,host,worker_runtime,task_spec}.rs. A fleet worker IS a codewhale exec --output-format stream-json subprocess (executor.rs:40 build_worker_exec_command; executor.rs:86 map_exec_stream_line). Scheduler owns backpressure: FleetSchedulerPolicy{max_workers_per_run/host/task_class,lease_seconds,heartbeat_timeout} (scheduler.rs:16-35), tick_run recover->launch->refresh (scheduler.rs:68). Canonical event vocabulary FleetWorkerEventPayload (crates/protocol/src/fleet.rs:610) already has Queued/Leased/Starting/Running/ModelWait/RunningTool/Heartbeat/Artifact/Completed/Failed/Cancelled/Interrupted/Stale/Restarted/Escalated. docs/AGENT_RUNTIME.md is the authoritative target (one detached runtime; everything else launches/observes it; cutover rule lines 55-73: in-process children allowed only as a latency optimization but must expose the same terminal states/retry/receipts).
  2. Capability foundation (NEW, dead_code). crates/tui/src/worker_profile.rs: WorkerRuntimeProfile::for_role() encodes per-role posture (explore/review read-only+ReadOnly shell; plan read-only+None; verifier read-only+Full shell; implementer/general/tool full+Full) and derive_child() intersects so a child never escalates (permissions AND-ed, shell min, tools narrowed, depth clamped to MAX_SPAWN_DEPTH_CEILING). Header lines 11-16 say wiring is the #3217 follow-up. No non-test importers.
  3. In-process worker ledger (landed v0.8.59) in crates/tui/src/tools/subagent/mod.rs: AgentWorkerSpec (661), AgentWorkerStatus (636: Queued/Starting/Running/ModelWait/RunningTool/Completed/Failed/Cancelled/Interrupted), AgentWorkerEvent (750), AgentWorkerRecord (765). Manager records via register_worker (1802)/record_worker_event (1850); exposes list_worker_records()/get_worker_record() (1818-1824). Runtime API already projects them (runtime_api.rs:1484). Fleet<->subagent status bridge at worker_runtime.rs:171.

GAP: A. Default sub-agent path is a heavy session clone, not a profile-bounded worker. SubAgentRuntime (mod.rs:1206-1254) clones client+ToolContext+manager+mailbox+fork_context+mcp_pool; agent_open execute (mod.rs:3368) calls background_runtime() then hand-patches model/cwd/depth; never builds/enforces WorkerRuntimeProfile. B. Build-everything-then-filter. SubAgentToolRegistry::new (mod.rs:6504) -> with_full_agent_surface (registry.rs:960; invoked mod.rs:6517) registers the full surface for ALL roles, then narrows only at emit via a disallowed blocklist in tools_for_model (mod.rs:6561-6595). worker_profile posture is never consulted. C. TUI owns lifecycle via cards, not the ledger. subagent_routing.rs drives DelegateCard/FanoutCard (widgets/agent_card.rs, 861 lines) from mailbox + subagent_cache; active_fanout_counts (subagent_routing.rs:29) reads card slots; reconcile_cards_with_snapshots (subagent_routing.rs:81) exists because cards miss terminal events. No worker_lifecycle_counts() projection over list_worker_records(). D. Concurrency is a flat in-process semaphore. launch_gate seeded at DEFAULT_INTERACTIVE_LAUNCH_LIMIT=4 (config.rs:27), acquired only for spawn_depth==1 in run_subagent_task (mod.rs:4640); not host/provider-aware; a queued child never emits AgentWorkerStatus::Queued.

Design: Build ON the landed foundations; do not re-architect. Make the in-process default path satisfy AGENT_RUNTIME.md's "one substrate" rule (same role profiles, same lifecycle states, same projection) without forcing every child out-of-process. Five ordered, independently-landable steps.

STEP 1 - Wire WorkerRuntimeProfile into spawn + enforce non-escalation. Add SubAgentRuntime.profile: WorkerRuntimeProfile (root default for_role(General) with full perms) in mod.rs. Remove worker_profile.rs #![allow(dead_code)] as consumers land. In agent_open execute (mod.rs:~3368): replace bare background_runtime()+hand-patching with requested = WorkerRuntimeProfile::for_role(agent_type) (override .tools=Explicit on allowed_tools, .model=Fixed on model, .max_spawn_depth on max_depth) then child_profile = self.runtime.profile.derive_child(&requested); carry child_runtime.profile. Map onto existing knobs: allow_shell = shell!=None; pass tools/permissions into Step 2. This activates #414/#426/#1186.

STEP 2 - Build role tool surfaces directly. Add ToolRegistryBuilder::with_role_agent_surface(profile,client,model,manager,runtime,todo,plan) in registry.rs next to with_full_agent_surface (registry.rs:960). Compose only what the posture grants: always read-only file+search+git-history+diagnostics+note+handle+todo+plan; with_file_tools/with_patch_tools only if permissions.write; with_shell_tools/with_runtime_task_shell_tools only if shell==Full (read-only shell variant if ReadOnly); with_web_tools only if permissions.network; with_subagent_tools only if profile.can_spawn_child() (this lands "recursion is opt-in, leaf-by-default" and replaces the disallowed blocklist at mod.rs:6565). Point SubAgentToolRegistry::new (mod.rs:6504) at it; delete the post-hoc blocklist in tools_for_model; keep the Explicit allowlist only for legacy Custom.

STEP 3 - First-class queueing + emit ModelWait/RunningTool in-process. In run_subagent_task (mod.rs:4633) emit AgentWorkerStatus::Queued before acquiring the launch permit, then Starting->Running on acquire (Queued exists at mod.rs:637 but is never emitted in-process). Ensure the inner run_subagent loop's create_message wait emits ModelWait and tool dispatch emits RunningTool{tool_name} via record_worker_progress (mod.rs:1888), extending the v0.8.59 model_wait heartbeat to the in-process path so the ledger matches fleet stream-json.

STEP 4 - Count projection + TUI becomes a renderer. Add SubAgentManager::worker_lifecycle_counts() -> WorkerLifecycleCounts (fold over list_worker_records()) plus recoverable_vs_terminal() for asleep/stale/retrying. Re-point active_fanout_counts (subagent_routing.rs:29) and running_agent_count (subagent_routing.rs:16) at it. Keep DelegateCard/FanoutCard (agent_card.rs) as a COMPATIBILITY projection fed by counts (issue requires cards keep working during migration); reconcile_cards_with_snapshots (subagent_routing.rs:81) becomes one-way render-from-records, not repair. Satisfies "renders counts not a wall of tool rows" without ripping out widgets.

STEP 5 - Headless test harness. Add crates/tui/src/tools/subagent/headless_harness.rs (gated #[cfg(any(test, feature="test-support"))]): HeadlessWorkerHarness builds a SubAgentManager with no event_tx/mailbox, spawns via the SAME spawn_background_with_assignment_options, exposes drain_worker_records(); uses test_support.rs mock client for slow/cancel/timeout. Home of the agent_worker test module; proves TUI independence.

OPTIONAL STEP 6 (defer to #3159/#3154) - pressure-aware concurrency: derive launch_gate permits (mod.rs:1576) from provider route + host pressure (reuse provider_readiness.rs #3083, context_budget.rs::PressureLevel #3086). Not required for core acceptance; scheduler already owns this for fleet.

Builds on: PRIMARY: crates/tui/src/worker_profile.rs (WorkerRuntimeProfile/derive_child/for_role) — Steps 1+2 are exactly its #3217 wiring; its #![allow(dead_code)] comes off here; maps ShellPolicy/ToolScope/PermissionSet onto the legacy allow_shell bool + AgentWorkerToolProfile. ALSO: crates/protocol/src/fleet.rs FleetWorkerEventPayload + docs/AGENT_RUNTIME.md (canonical vocabulary + cutover rule the design follows; in-process AgentWorkerStatus stays 1:1 via worker_runtime.rs:171). The v0.8.59 AgentWorkerSpec/Event/Record ledger + list_worker_records() (mod.rs:636-798,1818) — Step 4 is a pure fold over these, no new persistence; Runtime API already projects them (runtime_api.rs:1484) so web/app-server come free. crates/tui/src/fleet/scheduler.rs + executor.rs — the durable out-of-process half is done; this issue only makes the in-process default expose matching states. codewhale_config DEFAULT_SPAWN_DEPTH/MAX_SPAWN_DEPTH_CEILING — the single recursion axis derive_child clamps to. Optional Step 6 reuses provider_readiness.rs (#3083) + context_budget.rs PressureLevel (#3086); resource_telemetry.rs (#2666) is the natural formatter if per-worker token/cost is later added to the projection. NOT needed for core acceptance: goal_loop.rs, model_registry.rs, request_tuning.rs, provider_adapter.rs.

Files: /Volumes/VIXinSSD/codewhale/crates/tui/src/tools/subagent/mod.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tools/registry.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/worker_profile.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tui/subagent_routing.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tui/widgets/agent_card.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tools/subagent/tests.rs

Test plan: cargo test -p codewhale-tui, matching the issue's named gates.

  1. agent_worker (NEW via headless_harness; satisfies cargo test -p codewhale-tui agent_worker):
  • profile_drives_tool_surface: Explore child registry has NO write_file/edit_file/apply_patch/exec_shell/web/agent_open (Step 2 builds directly, not filters); Implementer has write+patch and agent_open only when can_spawn_child().
  • child_cannot_escalate: read-only parent -> Implementer-requesting child has !write and shell<=ReadOnly (end-to-end through agent_open; mirrors worker_profile.rs unit test).
  • lifecycle_states_emitted: mock client stalls then runs a tool; assert record.events = Queued->Starting->Running->ModelWait->RunningTool->Completed in order (Step 3).
  • headless_no_tui: full child to completion with event_tx=None/mailbox=None; no panic + populated AgentWorkerRecord.
  • slow/cancel/timeout/queue: blocking mock provider -> Queued surfaces while launch_gate saturated; cancel mid-run -> Cancelled; step_api_timeout -> Failed{recoverable}.
  1. fanout (cargo test -p codewhale-tui fanout):
  • worker_lifecycle_counts_fold: 5 workers across states -> counts {queued:1,running:2,completed:2,...} (Step 4).
  • card_is_projection_not_truth: feed counts directly; FanoutCard renders matching running/done WITHOUT per-child mailbox terminal events (card no longer owns lifecycle).
  1. subagent (cargo test -p codewhale-tui subagent, stay green):
  • test_running_count_* (tests.rs:1845-1900) and test_subagent_tool_registry_reports_unavailable_tools (tests.rs:1282) still pass; update expected unavailable set for read-only roles; add assertion that an Explore child's emitted tools exclude shell.
  1. Regression guard: with_role_agent_surface(General) tool set == previous with_full_agent_surface set (default role behavior-preserving; only narrow roles shrink).

Manual (issue Verification): same multi-agent release-triage prompt in TUI shows compact "N queued / M running / K done" from worker_lifecycle_counts and no per-child transcript flood; interrupted child shows a distinct recoverable state; codewhale exec headless emits the same lifecycle states via stream-json.

Risks: 1. Behavior change in tool availability is the highest-impact risk. Step 2 narrows what read-only roles (explore/plan/review/verifier) can call. If any existing prompt/workflow silently relied on, e.g., an Explore agent writing a scratch file, it will now fail. MITIGATION: the General role (the default) must be byte-for-byte tool-equivalent to today's with_full_agent_surface (explicit regression test #4); only the typed read-only roles shrink, which is the issue's intent.

  1. Recursion-as-opt-in changes a default. Gating with_subagent_tools behind can_spawn_child() means a role at depth==max can no longer spawn — matches the issue ("ordinary subagents should be leaf workers by default") and the Kimi comment, but differs from today where the full surface always includes agent_*. MITIGATION: General/Implementer keep a non-zero default budget (DEFAULT_SPAWN_DEPTH=3), so common orchestration is unaffected; only leaf/cheap roles lose it.

  2. Card/ledger dual-write during migration. Step 4 keeps FanoutCard/DelegateCard as compatibility projections while making the ledger truth. If both are updated from different sources mid-migration, counts could disagree. MITIGATION: make the card a strict read-derivative of worker_lifecycle_counts() (one writer: the ledger); delete the mailbox->card terminal-state path and replace reconcile_cards_with_snapshots with render-from-records.

  3. Profile/legacy-knob reconciliation. worker_profile.rs intentionally left mapping ShellPolicy/ToolScope onto the legacy allow_shell bool + AgentWorkerToolProfile to the follow-up. ReadOnly shell has no enforcement point yet (command_safety treats shell as on/off). MITIGATION: Step 1 maps ReadOnly->allow_shell=true for now and files the command_safety read-only-shell enforcement as a tracked sub-task (do not block this issue on it; declare ReadOnly == intent, enforcement pending, exactly as worker_profile.rs already documents).

  4. Scope/size. This touches the 257KB subagent/mod.rs hot path and the tool registry used by BOTH the parent engine and children (registry.rs:951 notes they are kept in lockstep). A mistake in with_role_agent_surface could change the PARENT agent's surface. MITIGATION: parent engine continues to call with_full_agent_surface (unchanged); only SubAgentToolRegistry::new switches to with_role_agent_surface. Land steps independently behind the existing dead_code seams; Steps 1-2 and 4 can ship in separate PRs.

  5. Day-scale/planner-wakeup/persistence-restart criteria in the issue body are explicitly carved out to sibling slices (#3142 run ledger/receipts, #3159 leases/heartbeats/recovery, #3097 WhaleFlow authoring, #3178 /swarm). This design closes the CORE umbrella acceptance (headless worker contract, role profiles built directly, TUI-as-projection, headless test harness, scheduler-owned backpressure for the fleet path). Restart-from-persisted-ledger and planner-wakeup-debounce should be tracked as follow-ups, not attempted here, or effort balloons past large.


#3154 — effort: medium

Current state: Fleet backend landed v0.8.60, durable ledger-backed. fleet/manager.rs:145-573 FleetManager::open(workspace) gives sync status/inspect_worker/interrupt_worker/restart_worker/stop_all/stop_run/rebuild_state; runtime_state is None without a sub_agent_manager. fleet/ledger.rs is the source of truth so the TUI opens its own handle over app.workspace. Only consumer today is CLI run_fleet_command (main.rs:1482-1820); no /fleet slash command or TUI surface (confirmed by grep). FleetStatusSnapshot (manager.rs:82-98) and FleetWorkerInspection (100-117) are render-ready. Config scaffolds intent: ConfigSection Fleet (views/mod.rs:415,433), an experimental whaleflow row citing 3154/3178 (1194-1200), FleetConfigToml/FleetExecConfig (config/lib.rs:1069-1175). No Feature variant for fleet (features.rs:34-50), so register always-on like the task command with empty-state messaging.

Design: Open a FleetManager over app.workspace and project the durable ledger as an interactive overlay; no new actor. 1) Register a fleet command in commands/groups/utility/mod.rs (FLEET_INFO plus FunctionCommand run_fleet) with a new utility/fleet.rs parser mirroring utility/task.rs:7-41 mapping status/inspect/interrupt/restart/stop to AppAction::Fleet(FleetUiAction); add MessageId CmdFleetDescription in localization.rs. 2) In tui/app.rs at AppAction (5316) add Fleet(FleetUiAction) variants OpenStatus, Inspect, Interrupt, Restart, StopAll, StopRun. 3) In tui/ui.rs beside the Task arms (6795-6862) add a Fleet arm calling async handle_fleet_ui_action (mirror handle_mcp_ui_action, 6861) that opens FleetManager::open(app.workspace).with_exec_config, calls status/inspect_worker/interrupt_worker/restart_worker/stop_all/stop_run, pushes a FleetView or System cell, errors push a System cell. 4) New tui/views/fleet.rs FleetView impl ModalView mirrors ConfigView (456-1609) holding a FleetStatusSnapshot plus worker rows from rebuild_state; new ModalKind Fleet (21-42); j/k/arrows select, Enter emits FleetInspect, i/r/s emit Interrupt/Restart/Stop, R refresh, Esc close; render header counts reusing print_status (main.rs:1571-1593) plus a worker table with ConfigView selection highlight; tick emits FleetRefresh. 5) Add FleetInspect/Interrupt/Restart/Stop/Refresh to ViewEvent (views/mod.rs:91-215) and arms in handle_view_events (ui.rs:8068-8079); FleetRefresh reopens the manager and updates the live view via push_boxed (280) or as_any_mut (250). 6) Zero runs renders a start-a-run hint; always-on, no flag; destructive actions go only through the Rust-owned manager methods; render only ledger projections, no secrets. Order: land MessageId+parser+AppAction first; then a stub dispatch printing status via a System cell to prove the handle end-to-end; then ModalKind+FleetView render/keys; then ViewEvent arms; then tick refresh.

Builds on: Builds on the v0.8.60 fleet backend (fleet/manager.rs FleetManager::open plus status/inspect_worker/interrupt_worker/restart_worker/stop_all/stop_run/rebuild_state, fleet/ledger.rs, protocol/fleet.rs enums, config FleetExecConfig via with_exec_config). Reuses ModalView/ViewStack/ViewEvent and the handle_view_events loop patterned on ConfigView and the task command family. Mirrors CLI run_fleet_command (main.rs:1482-1820); lift worker_status_label/event_label/print_status/print_inspection into a shared projection module used by both CLI and FleetView. The codex/v0.8.61 foundations (worker_profile.rs, goal_loop.rs, model_registry.rs) are orthogonal. Future tie-in: with_sub_agent_manager to populate runtime_state (manager.rs:434).

Files: crates/tui/src/commands/groups/utility/fleet.rs, crates/tui/src/commands/groups/utility/mod.rs, crates/tui/src/tui/views/fleet.rs, crates/tui/src/tui/views/mod.rs, crates/tui/src/tui/app.rs, crates/tui/src/tui/ui.rs, crates/tui/src/localization.rs

Test plan: Unit: parser tests in utility/fleet.rs like utility/task.rs:43-100 (bare and status to OpenStatus; inspect w1 to Inspect; interrupt/restart/stop-all/stop-run/missing-arg). FleetView key tests under cfg(test): from a hand-made FleetStatusSnapshot assert j/k select, Enter emits FleetInspect with the selected worker, i/r/s emit the right events, Esc closes, empty renders the empty-state line, render-to-Buffer smoke, kind is ModalKind Fleet. Integration via the manager.rs temp-dir fake-binary harness (1353-1392): create_run plus schedule_run, assert status counts and a populated inspect_worker, assert a second independent FleetManager::open over the same path sees the same state, assert stop_all and stop_run flip statuses. Manual: run the fleet run command, open the TUI, run the fleet command, drill into a worker, press i/r/s and verify via the fleet status command, confirm live tick refresh. Gates: cargo test for the tui crate, cargo fmt, cargo clippy.

Risks: CLI/FleetView drift solved by sharing worker_status_label/event_label/print_status/print_inspection in one projection module. Live-refresh reopens the manager and replays the ledger each tick; debounce to at least 2500ms (the task-panel cadence at ui.rs:1394) and refresh only while FleetView is top. Stale view after control actions; recompute and update in place via push_boxed or as_any_mut. runtime_state always None here (no sub_agent_manager), so live per-worker step counts will not show, only ledger events. ModalKind/ViewEvent are matched in several places; grep consumers before adding Fleet to avoid non-exhaustive-match breaks. Concurrent writers: the executor may append while the TUI reads, but append-log plus rebuild makes reads safe snapshots; control writes go only through the manager methods.


#1812 — effort: medium

Current state: WHAT EXISTS. The TUI runs under #[tokio::main] (multi-thread runtime) — crates/tui/src/main.rs:1022. The render/event loop is a single async fn run_event_loop — crates/tui/src/tui/ui.rs:1221 — spawned/awaited at ui.rs:582. It is a hand-rolled poll loop (NOT tokio::select!). Each iteration: (a) non-blocking drains of async work — version check (ui.rs:1292), web-config (ui.rs:1303), translations (ui.rs:1307), and engine events via engine_handle.rx_event mpsc::Receiver<core::events::Event> drained with try_recv() (ui.rs:1428-1440); then (b) draw; then (c) BLOCKS on terminal input: if event::poll(poll_timeout)? { let evt = event::read()?; ... } (ui.rs:2859-2860), plus a resize-coalesce drain while event::poll(Duration::from_millis(0)) { event::read() } (ui.rs:2929-2930).

THE GAP. crossterm::event::poll/read are SYNCHRONOUS/blocking. Calling them inside an async task blocks the tokio worker thread that is polling the future. ui.rs:2851-2857 already documents this hazard (#549) and bolts on tokio::task::yield_now().await before the poll as a band-aid so the engine task gets a scheduler turn — confirming the loop knowingly does a blocking poll on an async thread. Poll timeout is normally tiny (UI_IDLE_POLL_MS=48ms, UI_ACTIVE_POLL_MS=24ms; ui.rs:154-155; clamped >=1ms by clamp_event_poll_timeout ui.rs:9545), so on most platforms the block is bounded. The freeze (#1812) is the Windows pathology: per the issue's PID 141448 thread dump the main thread is wedged in Wait - UserRequest = WaitForSingleObject(console_input_handle, INFINITE) inside crossterm's Windows poll, which NEVER returns after rapid exec_shell child-process exits corrupt the console input buffer — the poll_timeout argument is ignored at the OS boundary. The tokio runtime stays alive (it spawned 6 new worker threads 5h post-freeze), proving only the thread parked in event::poll is dead. Because input poll and async drains share one thread/future, that wedge also starves engine-event draining: the model's MessageStarted is produced by the engine but never rendered. There is NO dedicated input thread and NO event-stream feature today: crates/tui/Cargo.toml:36 pins crossterm = "0.28" with default features only (Cargo.lock 0.28.1); ratatui 0.30 over CrosstermBackend (color_compat.rs:57). PRECEDENT for the fix already exists: ui.rs:299-317 wraps the one-shot blocking enable_raw_mode() in tokio::task::spawn_blocking + tokio::time::timeout. Note a name clash to handle: crossterm input events are crossterm::event::Event (aliased Event, ui.rs:20) while engine events are also Event from core::events but referred to in the loop as EngineEvent (ui.rs:1448).

Design: GOAL: move ALL blocking event::poll/event::read off the tokio worker thread onto a dedicated OS input thread that forwards decoded crossterm::event::Events over a tokio mpsc channel. The async loop then NEVER blocks on the console; a wedged Windows console handle parks only the throwaway input thread, leaving engine-event draining and rendering fully alive (degrades freeze -> input-only stall, not whole-UI freeze). This is the dedicated-input-thread half of the requested "spawn_blocking / dedicated-input-thread fix"; chosen over per-iteration spawn_blocking(event::read) because spawn_blocking tasks cannot be cancelled and a wedged one would leak a blocking-pool thread every shutdown — a dedicated thread we explicitly detach is cleaner and matches the mpsc engine-event pattern already in the loop.

NEW MODULE: crates/tui/src/tui/input_thread.rs.

  • pub enum TerminalInput { Event(crossterm::event::Event), PollError(String) } (forward read errors instead of ?-propagating from another thread).
  • pub struct InputThreadHandle { rx: tokio::sync::mpsc::Receiver<TerminalInput>, _shutdown: Arc<AtomicBool> } with pub fn try_recv(&mut self) -> Result<TerminalInput, TryRecvError>, pub async fn recv(&mut self) -> Option<TerminalInput>, and pub fn shutdown(&self) (sets the AtomicBool).
  • pub fn spawn() -> InputThreadHandle: build mpsc::channel(256) (mirrors engine rx_event capacity, engine.rs:676); std::thread::Builder::new().name("codewhale-tui-input").spawn(move || loop { if shutdown.load(Relaxed) { break } match event::poll(Duration::from_millis(POLL_TICK_MS)) { Ok(true) => match event::read() { Ok(ev) => if tx.blocking_send(TerminalInput::Event(ev)).is_err() { break }, Err(e) => { let _ = tx.blocking_send(TerminalInput::PollError(e.to_string())); } }, Ok(false) => continue, Err(e) => { let _ = tx.blocking_send(TerminalInput::PollError(e.to_string())); /* brief backoff */ } } }). POLL_TICK_MS ~50ms so the thread periodically re-checks shutdown and stays responsive; the OS-level INFINITE-wait pathology now consumes only this thread. Use tx.blocking_send (sync thread context). The thread is intentionally detached/leaked on shutdown — never join() it (it may be unkillably parked in the Windows console wait); set the AtomicBool, drop the receiver (closes channel so blocking_send returns Err and the thread exits if/when poll ever returns), and let the process tear it down. #![allow(clippy::print_stdout)] not needed; no stdout writes.

WIRE INTO run_event_loop (ui.rs:1221):

  1. Just before loop { (after the startup block ~ui.rs:1286), spawn once: let mut input = crate::tui::input_thread::spawn();.
  2. REPLACE the blocking input section ui.rs:2859-2891 (the if event::poll(poll_timeout)? { let evt = event::read()?; ... } block) with non-blocking channel drains. Keep the existing per-iteration cadence by making the loop sleep on a timer that is interrupted by input: convert the bottom of the loop into tokio::select! { _ = input_ready(&mut input) => drain_input(), _ = tokio::time::sleep(poll_timeout) => {} } where one arm awaits the next TerminalInput and the other is the existing computed poll_timeout (so animation/redraw cadence, paste-burst flush, quit-arm expiry, autoscroll ticks — all the poll_timeout.min(...) adjustments at ui.rs:2829-2851 — are preserved unchanged). After the select wakes, drain ALL currently-buffered input with while let Ok(item) = input.try_recv() { handle(item) } so a burst is processed in one frame (replaces the old "poll then read one" semantics; matches the existing engine try_recv drain style at ui.rs:1430). Set app.needs_redraw = true when any TerminalInput::Event is handled (preserves ui.rs:2861). On TerminalInput::PollError(msg), tracing::warn! once and continue (do not kill the UI; previously ? would abort the whole loop — this is a resilience upgrade).
  3. The body that currently runs per crossterm event (paste handling ui.rs:2864-2891, focus/viewport recapture ui.rs:2899-2911, Resize handling ui.rs:2912+) moves verbatim into a small local closure/inline match keyed on the crossterm::event::Event pulled from the channel. The only change is the event SOURCE (channel vs event::read()); the per-event logic is unchanged.
  4. RESIZE COALESCE (ui.rs:2929-2947): the while event::poll(0) { event::read() } drain can no longer call crossterm directly. Replace with draining the input channel: while let Ok(TerminalInput::Event(Event::Resize(w,h))) = input.try_recv() { final_w=w; final_h=h }. Because the input thread already coalesces by feeding sequentially, and ratatui/crossterm still coalesce at source, this preserves the "act on final size only" behavior (#65). Non-resize items pulled during the drain must NOT be dropped (the old code admitted it dropped them at ui.rs:2935-2944) — instead handle them in the same frame via the same event handler, which is strictly better.
  5. SHUTDOWN: at the existing teardown after run_event_loop returns (ui.rs:592+), call input.shutdown() and drop the handle. Do not join. The emergency/panic restore path (emergency_restore_terminal ui.rs:9141) is unaffected — it only resets terminal modes, and the detached input thread dies with the process.
  6. tokio::task::yield_now().await (ui.rs:2857) and its #549 comment can be REMOVED — its purpose (give the engine task a turn before blocking) is moot once the loop no longer blocks on input; removal is optional and low-risk but should be called out so a reviewer knows the band-aid is being retired by the real fix.

PLATFORM/FEATURE: no new crossterm feature required (we drive sync event::poll/read from our own thread; we deliberately avoid the event-stream/EventStream async API because it internally still spawns the same kind of reader and would not let us detach-on-wedge as cleanly, and it would add a feature flag + mio dependency surface). Keep crossterm = "0.28" as-is.

OPTIONAL DEFENSE-IN-DEPTH (only if cheap; not required to close the freeze): after the input thread isolates the wedge, an idle stall is now observable as "input channel silent for >Ns while engine has pending work." A future watchdog could surface a toast, but the core fix does not need it — the engine and renderer no longer freeze. Do NOT pursue the issue-comment's FlushConsoleInputBuffer/watchdog branches here; they target a different (shell descendant-pipe) freeze mode already handled by PR #2498.

Builds on: None of the listed v0.8.61 foundations (worker_profile, goal_loop, model_registry, provider_readiness, context_budget, provider_adapter, resource_telemetry, request_tuning, record_thread_goal_usage) are in this code path — this is a pure TUI runtime/terminal-IO fix and is orthogonal to all of them. It is adjacent to the landed turn_loop cancellation work (core/engine/turn_loop.rs:81-82, 408-409 — the freeze cancel arm, #3216/#2211): that change ensures the ENGINE observes cancellation between tool batches; this change ensures the TUI keeps draining engine events and accepting input even when the Windows console wedges. The two are complementary halves of the freeze story (engine-side responsiveness + UI-side responsiveness) and should be cross-referenced in the PR. The fix reuses the existing in-repo pattern of offloading blocking terminal work via tokio (spawn_blocking precedent at crates/tui/src/tui/ui.rs:299-317) and the existing mpsc-drain pattern used for engine events (ui.rs:1428-1440).

Files: crates/tui/src/tui/input_thread.rs (NEW: dedicated input thread + InputThreadHandle + TerminalInput), crates/tui/src/tui/mod.rs (register mod input_thread;), crates/tui/src/tui/ui.rs (spawn input thread before the loop; replace blocking event::poll/read at 2859-2891 with channel drain inside a tokio::select! against the existing poll_timeout; rewrite resize-coalesce drain 2929-2947; remove yield_now band-aid 2857; call input.shutdown() in teardown after 592), crates/tui/src/tui/ui/tests.rs (unit tests for the channel-drain/handler refactor), crates/tui/Cargo.toml (no change expected — note explicitly that no new crossterm feature is added)

Test plan: UNIT (crates/tui/src/tui/ui/tests.rs, following its existing style; mock_engine_handle is already imported there):

  1. input_thread_forwards_events_over_channel: spawn the input thread is hard to drive without a real TTY, so instead extract the per-event handling into a testable fn handle_terminal_input(app:&mut App, ev: crossterm::event::Event, ...) and assert: a Event::Paste routes into the composer (mirror existing paste tests), Event::Resize updates app size, Event::FocusGained sets force_repaint (reuse focus_gained_forces_terminal_viewport_recapture at tests.rs:150).
  2. resize_coalesce_uses_last_size_from_channel: feed Resize(80,24),Resize(100,30) into a mock InputThreadHandle (construct one from a test-only mpsc::channel + pre-pushed items) and assert handle_resize is called with (100,30) and intermediate non-resize events are still handled, not dropped.
  3. poll_error_does_not_abort_loop: push TerminalInput::PollError("x") and assert the drain returns normally (no Err propagation) and emits a warn (can assert via a returned enum rather than tracing capture).
  4. shutdown_sets_flag_and_closes_channel: call handle.shutdown(); assert the AtomicBool is set and a subsequent blocking_send on the paired tx returns Err (channel closed) — proves the thread will exit if poll ever returns. EXISTING REGRESSION: run cargo test -p codewhale-tui and confirm the keyboard/mouse/paste/selection/resize tests (tests.rs:121-1049+) still pass after the event-source refactor — they exercise the same handlers, now fed from the channel. GATES: cargo fmt --check; cargo clippy -p codewhale-tui -- -D warnings (module denies print_stdout — ensure input_thread.rs writes nothing to stdout); cargo check -p codewhale-tui on Windows target if available. MANUAL (Windows 11, the repro from the issue): build release, run --yolo, fire 5 rapid exec_shell commands like the Event #2 timeline, then a findstr-style command; confirm the UI keeps rendering the model's streamed response (engine events keep draining) and remains responsive — i.e., the whole-UI freeze no longer reproduces. As a non-Windows smoke check, confirm normal typing/paste/resize/Ctrl+C behavior is unchanged on macOS/Linux.

Risks: 1. Latency/throughput: input now hops one mpsc channel before handling. Negligible (sub-ms) and the same pattern engine events already use, but verify fast typing/IME bursts feel identical by draining the whole channel each frame (do not handle one-per-frame). 2. Coalescing semantics: the old code dropped non-resize events seen during the resize-coalesce window (ui.rs:2935); the new drain must handle them instead — getting this wrong could double-handle or drop a keypress during a drag-resize. Covered by test #2. 3. Detached-thread leak: by design the input thread is never joined and may stay parked in the wedged Windows console wait forever; this is the correct trade (one parked throwaway thread vs whole-UI freeze) but means a clean process exit relies on the OS reaping it — acceptable for a CLI, and the runtime already detaches blocking work. 4. spawn_blocking alternative rejected: if a reviewer prefers per-iteration spawn_blocking(|| event::poll/read), note it cannot be aborted, so a wedged call leaks a blocking-pool thread on EVERY frame until the pool saturates — strictly worse than one dedicated thread. 5. Backpressure: if the async loop stalls for other reasons, blocking_send on a full 256 channel will park the input thread; that only delays input, never deadlocks the renderer; capacity 256 matches engine and is ample for human input. 6. Two Event types in scope (crossterm vs core::events) — the new channel must be typed crossterm::event::Event explicitly to avoid the alias clash noted in current_state. 7. Test reachability: the real thread needs a TTY; the design deliberately makes the per-event handler and the handle drain testable without spawning the thread, so CI coverage is meaningful.


#1737 — effort: medium

Current state: WORKTREE NOTE: the agent cwd /Volumes/VIXinSSD/codewhale/npm/codewhale is a subdir; the real repo root is /Volumes/VIXinSSD/codewhale on branch codex/v0.8.61. All paths below are repo-root-relative under /Volumes/VIXinSSD/codewhale/. Foundations confirmed landed via git log (worker_profile.rs, goal_loop.rs, resource_telemetry.rs, etc. all present at crates/tui/src/).

The issue has FOUR acceptance criteria. Tracing the code, substantial infrastructure already exists; the gaps are narrower than the issue implies.

(1) DUPLICATE task_shell_wait ROWS — mostly already solved, one hole. crates/tui/src/tui/sidebar.rs:1670-1684 collapses repeated polls via shell_wait_groups, keyed by shell_wait_poll_key (sidebar.rs:1824) which parses task_id: out of the row summary. summarize_tool_args (crates/tui/src/tui/history.rs:2384) DOES emit task_id: <val> truncated to 40 chars, and generic_tool_sidebar_summary (sidebar.rs:1544) joins it into the summary, so the common case collapses (proven by test tasks_panel_collapses_repeated_shell_waits_for_same_job sidebar.rs:3880). HOLES: (a) is_shell_wait_poll_row (sidebar.rs:1820) only matches status==Running && name=="wait Bash" — a poll that surfaces as Failed (transient) or whose task_id got truncated past 40 chars escapes and falls back to normalize_activity_text(&row.summary) which folds in the CHANGING output_summary, defeating dedup; (b) the live ("Live tools") and "Recent tools" sections run editorial_tool_rows independently (sidebar.rs:1220 oldest-first, :1287 newest-first) so the same wait can appear once in each section.

(2) FAILED COMMAND → INDEFINITE in_progress — the real architectural gap. runtime_turn_status is binary: set to "in_progress" at TurnStarted (crates/tui/src/tui/ui.rs:1807) and only resolved to completed/interrupted/failed at TurnComplete (ui.rs:1880-1888). The Tasks sidebar turn row renders this raw string verbatim: format!("Turn {} ({status})", ...) (sidebar.rs:874-878). So while the model keeps re-calling task_shell_wait after a foreground failure, the turn is genuinely still in_progress and the sidebar says only "in_progress" with no hint that the visible command already failed or that nothing useful is happening. The footer already classifies this better via stall_reason (crates/tui/src/tui/footer_ui.rs:123-159) after 30s ("background jobs running" / "waiting - no recent activity") and there's a ready predicate active_turn_has_running_tool (ui.rs:4924) — but none of this reaches the sidebar turn row. There is NO last_tool_failed signal in App state (grep confirms absent).

(3) CANCEL/CLEANUP STALE JOB — largely exists. /jobs cancel <id> and /jobs cancel-all are wired (crates/tui/src/commands/groups/utility/jobs.rs:51-59 → ShellJobAction in ui.rs:7202 kill_running). The sidebar already makes a running job's detail row clickable to /jobs cancel <id> (sidebar.rs:819-820) and shows a Ctrl+K -> /jobs cancel-all hint (sidebar.rs:956-962). Gap: a STALE job (running but no output for minutes) is not visually distinguished from a healthy running job, so the user doesn't know which to cancel.

(4) STALE-JOB WARNING AFTER N SECONDS — no infrastructure. BackgroundShell (crates/tui/src/tools/shell.rs, struct ~line 730) has started_at: Instant but NO last-output timestamp. The stale: bool on ShellJobSnapshot (shell.rs:113) is set ONLY by remember_stale_job on session restart (shell.rs:1682); job_snapshot always sets stale: false (shell.rs:719). job_status_rank (shell.rs:1749) already ranks stale=4, so the plumbing to surface staleness exists but nothing ever computes elapsed-since-output staleness. Additionally, the task_panel rebuild (ui.rs:1006-1017) only pushes background jobs where status==Running, dropping failed/killed/completed jobs entirely — correct for the panel, but means a still-Running orphan whose parent foreground command failed legitimately stays shown (this is the "background job running for several minutes" the issue describes).

Design: Build incrementally on the existing sidebar dedup framework, the /jobs command, and stall_reason. Four coordinated changes, smallest-blast-radius first.

STEP A — Elapsed-based staleness on BackgroundShell (crates/tui/src/tools/shell.rs). (i) Add last_output_at: Arc<Mutex<Instant>> (or a simpler AtomicU64 ms-since-start) to BackgroundShell, initialized to started_at. In the stdout/stderr reader threads (where bytes are appended to stdout_buffer/stderr_buffer, near shell.rs:1701 take_delta_from_buffer writers — actually the append sites in the spawned threads) and on each successful poll() that observes new bytes, update last_output_at = Instant::now(). (ii) Add a const const STALE_NO_OUTPUT_AFTER: Duration = Duration::from_secs(60);. (iii) In job_snapshot (shell.rs:694-720), compute let stale = self.status == ShellStatus::Running && self.last_output_at.elapsed() >= STALE_NO_OUTPUT_AFTER; and set stale from it (replacing the hardcoded stale: false at :719). Also add elapsed_since_output_ms: u64 to ShellJobSnapshot (shell.rs:100-115) so the UI can show "no output 2m". Keep remember_stale_job's restart path setting stale=true. list_jobs already sorts by job_status_rank(status, stale) so stale jobs auto-rank down (rank 4) without further change.

STEP B — Surface staleness in the Tasks panel (crates/tui/src/tui/ui.rs + crates/tui/src/tui/sidebar.rs). (i) Extend TaskPanelEntry (find def via grep; it lives in crates/tui/src/tui/app.rs) with stale: bool and elapsed_since_output_ms: Option<u64>. (ii) In the task_panel rebuild (ui.rs:1010-1016) carry job.stale and the new elapsed field into the pushed entry. (iii) In background_task_labels (sidebar.rs:1181) when task.stale, append a warning suffix e.g. format!("{} · stale, no output {}", ...) using format_duration_ms, and in task_panel_rows color (sidebar.rs:922-929) map stale running jobs to theme.warning instead of the normal running color, with the detail row already clickable to /jobs cancel <id> (no action change needed — sidebar.rs:819 already does this). (iv) When ANY background row is stale, change the cancel hint (sidebar.rs:956-962) text to call it out, e.g. "Ctrl+K -> cancel stale job".

STEP C — Sidebar turn-row classification (crates/tui/src/tui/sidebar.rs:865-883). Replace the raw ({status}) render with a derived sub-status when status=="in_progress". Add a small pure helper fn turn_progress_hint(app: &App) -> Option<&'static str> mirroring footer stall_reason precedence but with no 30s gate for the structural cases: if running_agent_count>0 → None (genuinely working); else if app.task_panel.iter().any(|t| t.status=="running") → "waiting on background job"; else if active_turn_has_running_tool(app) (reuse ui.rs:4924 — make it pub(crate)) → None; else if the most-recent active-cell tool entry is Failed AND no running tool/job remains → "last command failed, retrying". Render as format!("Turn {} (in_progress · {hint})", n) when hint is Some. This keeps runtime_turn_status itself untouched (so footer/composer/loading logic that gates on Some("in_progress") is unaffected — verified those call sites at ui.rs:1354/4810/4868/4919, footer_ui.rs:156/239, composer_ui.rs:26/29 all want the raw flag) and is purely presentational. Decide_continuation/goal_loop is NOT used here — that is cross-turn goal logic, orthogonal to this within-turn UI reconciliation.

STEP D — Close the dedup holes (crates/tui/src/tui/sidebar.rs). (i) Broaden is_shell_wait_poll_row (sidebar.rs:1820) to also match the friendly name regardless of Failed/Running and to cover both wait Bash and exec_shell_wait framings: row.name == "wait Bash" || row.name == "exec_shell_wait" and accept any status (so transient-failed polls collapse too). (ii) Harden shell_wait_poll_key (sidebar.rs:1824): when the task_id: marker is absent, do NOT fall back to the full output-bearing summary; instead fall back to the row NAME only (stable across polls) so output deltas can't split the group. (iii) Cross-section dedup: in recent_tool_rows (sidebar.rs:1279) filter out any row whose shell_wait_poll_key matches a key already present in the live active_tool_rows so a single in-flight wait never shows in both "Live tools" and "Recent tools". Thread the active keys in via a param.

ORDER: A → B (B depends on A's snapshot fields) can land together; C and D are independent and can land in either order. All four are additive/presentational except A which adds a timestamp write in the reader path (cheap).

Builds on: Primarily builds on PRE-EXISTING TUI infrastructure already on the branch, NOT the listed v0.8.61 foundations (those are provider/goal/budget substrate, orthogonal to this shell-lifecycle UI bug):

  • Existing sidebar dedup framework: editorial_tool_rows / shell_wait_groups / is_shell_wait_poll_row / shell_wait_poll_key (sidebar.rs:1615-1838).
  • Existing job staleness plumbing that was half-built: ShellJobSnapshot.stale + job_status_rank(status, stale) (shell.rs:113/1749) — Step A finally feeds it from elapsed time.
  • Existing /jobs cancel|cancel-all command + ShellJobAction + kill_running (jobs.rs / ui.rs:7202) and the already-clickable per-job cancel detail row (sidebar.rs:819).
  • Existing footer stall_reason (footer_ui.rs:123) and active_turn_has_running_tool (ui.rs:4924) — Step C reuses the same precedence logic and predicate for the sidebar turn row. LANDED FOUNDATION touch-point (minor): crates/tui/src/resource_telemetry.rs token/time/duration formatting can be reused for the "no output Nm" staleness label in Step B to keep duration formatting consistent; otherwise reuse the local format_duration_ms (sidebar.rs:1907). goal_loop.rs::decide_continuation is intentionally NOT used — it governs cross-turn goal continuation, not within-turn poll/stall reconciliation.

Files: crates/tui/src/tools/shell.rs, crates/tui/src/tui/sidebar.rs, crates/tui/src/tui/ui.rs, crates/tui/src/tui/app.rs, crates/tui/src/tools/shell/tests.rs, crates/tui/src/tui/footer_ui.rs

Test plan: Rust unit tests, colocated as the repo already does (sidebar tests in crates/tui/src/tui/sidebar.rs #[cfg(test)], shell tests in crates/tui/src/tools/shell/tests.rs). Run cargo test -p codewhale-tui (confirm crate name from crates/tui/Cargo.toml; binary is "codewhale").

STEP A (shell.rs/tests.rs): (1) background_shell_marks_stale_after_no_output — start a fake long job (or directly set last_output_at to an Instant 61s in the past via a test-only setter), call job_snapshot, assert stale==true and elapsed_since_output_ms>=60_000. (2) fresh_running_job_is_not_stale — recent output → stale==false. (3) list_jobs_ranks_stale_after_running — one fresh + one stale running job, assert stale sorts after (uses existing job_status_rank).

STEP B (sidebar.rs tests, extend existing patterns near :3328-3460): (4) stale_background_job_row_shows_warning_and_warning_color — build an App with a TaskPanelEntry{status:"running", stale:true,..}, render task_panel_lines, assert a line contains "stale" / "no output" and the cancel hint changes. (5) extend task_panel_actions_make_single_background_job_clickable to assert the stale job's detail action is still /jobs cancel <id>.

STEP C (sidebar.rs tests): (6) turn_row_shows_failed_command_hint_when_no_work_running — App with runtime_turn_status=in_progress, an active_cell whose last tool entry is Failed, no running tool/job → assert turn line contains "last command failed". (7) turn_row_shows_background_wait_hint — in_progress + a running background TaskPanelEntry → assert "waiting on background job". (8) turn_row_unchanged_when_genuinely_running — in_progress + a running tool entry → assert plain "in_progress" with no spurious hint. (9) Guard: a footer test asserting footer_working_strip_active/stall_reason behavior is unchanged (the raw runtime_turn_status is untouched).

STEP D (sidebar.rs tests, extend tasks_panel_collapses_repeated_shell_waits_for_same_job :3880): (10) collapses_shell_waits_even_when_output_differs_between_polls — two task_shell_wait Generic cells, same input_summary:"task_id: shell_x" but DIFFERENT output_summary per poll → assert exactly one "wait Bash" row + "2 waits collapsed" (this fails before Step D's key-hardening). (11) collapses_shell_waits_when_task_id_marker_absent — input_summary without task_id: → assert still one row (name-only fallback). (12) shell_wait_not_duplicated_across_live_and_recent_sections — same wait present as active AND in history → assert it renders once.

MANUAL FIXTURE / SMOKE (acceptance criterion #5 — add to v0.8.40 release smoke checklist, file likely under docs/ or workflows/; grep "smoke" to locate): script task_shell_start a sleep 300, then run a foreground exec_shell that exits 1, then observe: sidebar turn row shows "in_progress · last command failed, retrying" (or "waiting on background job"), the background job row turns warning-colored "stale, no output …" after 60s, and clicking it (or /jobs cancel <id>) cleans it up. Verify no duplicate "wait Bash" rows accumulate while polling.

GATES: cargo fmt --all, cargo clippy -p codewhale-tui -- -D warnings, cargo test -p codewhale-tui (confirm exact package name first via head crates/tui/Cargo.toml).

Risks: - STALENESS THRESHOLD FALSE POSITIVES: a legitimately silent job (e.g. sleep, a server idling, a build between phases) will be flagged stale after 60s. Mitigation: 60s is conservative; the label says "no output", not "stuck", and the action is a non-destructive suggestion (the job keeps running until the user cancels). Consider making the const configurable later; do NOT auto-kill.

  • last_output_at WRITE CONTENTION: updating a timestamp on every byte append in the reader thread adds a lock/atomic write in a hot path. Mitigation: use an AtomicU64 (ms since a base Instant) for lock-free update, or only update in poll() when take_delta observed new bytes (coarser but adequate for a 60s threshold) — the latter avoids touching the reader threads at all and is the recommended low-risk path.
  • TURN-ROW HINT REGRESSION: Step C must NOT change runtime_turn_status itself — many call sites gate on Some("in_progress") (ui.rs:1354/4810/4868/4919, footer_ui.rs:156/239, composer_ui.rs:26/29, session.rs:191). The change is render-only inside task_panel_rows; a test (guard #9) and a grep audit of those sites is required before merge.
  • DEDUP OVER-COLLAPSE: broadening is_shell_wait_poll_row to accept Failed status (Step D-i) could collapse a genuine wait failure with running polls. Mitigation: the group already preserves the latest non-empty summary and shows "N waits collapsed"; a true terminal failure surfaces via the background-job row + turn-row hint, so collapsing the poll spam is acceptable. Keep the existing MAX_VISIBLE_FAILURES path for non-wait tools untouched.
  • CROSS-SECTION DEDUP ORDERING (Step D-iii): filtering recent rows against live keys requires computing live keys first; ensure active_tool_rows is computed before recent_tool_rows in task_panel_rows (it already is, sidebar.rs:885 vs :967) and thread the key set through without recomputing divergently.
  • TaskPanelEntry is serialized for session persistence (grep task_panel in persistence path) — adding stale/elapsed_since_output_ms should be #[serde(default)] to avoid breaking old saved sessions.

#1786 — effort: medium

Current state: The issue has two halves; one is already fixed, the other is a real, citable gap.

ALREADY FIXED (cross-session orphan, the owner's 2026-05-27 comment is now stale): crates/tui/src/task_manager.rs:1516-1547 (load_state) unconditionally reaps every durable task with status == Running on session load -> Failed with error "Interrupted by process restart; prior process is not attached", finalizes any Running tool calls, and appends a recovered timeline entry. The queue is then filtered to only Queued tasks (task_manager.rs:1562-1566). There is NO pid field on TaskRecord (task_manager.rs:182-223) and none is needed — reaping is unconditional, so a prior session's running task can no longer block startup. Test running_tasks_are_not_requeued_after_restart (task_manager.rs:1816+) already covers this. #2468 also clears turn_started_at on inconsistent-busy recovery.

THE REMAINING GAP (the issue's PRIMARY report: shell PID hang -> premature/incorrect LIVE state during a live turn): the in-turn liveness watchdog deliberately stands down whenever a tool cell is Running, so a hung tool produces a permanently-wrong LIVE state with no recovery.

  • The watchdog reconcile_turn_liveness (crates/tui/src/tui/ui.rs:4764-4847) runs each tick (ui.rs:2719). Branch 3 (the 300s TURN_STALL_WATCHDOG_TIMEOUT, ui.rs:163, 4809-4844) is explicitly gated by && !active_turn_has_running_tool(app) (ui.rs:4813). active_turn_has_running_tool (ui.rs:4924-4949) returns true while any active-cell tool is ToolStatus::Running. A hung exec_shell keeps its cell Running until ToolCallComplete (ui.rs:1715) arrives — which never comes — so the watchdog can never fire.
  • Liveness is tracked by turn_last_activity_at (tui/app.rs:1734), refreshed by record_turn_activity (ui.rs:1463, body 4913-4922) on EVERY engine event while loading. The footer projects this: "tools executing" vs "waiting - no recent activity" (tui/footer_ui.rs:148-159), and idle seconds via provider_wait_idle_secs (footer_ui.rs:163-168). So the header animation/"LIVE" feel and the work-list both key off events the engine is no longer producing.
  • The engine awaits each tool with NO per-tool heartbeat and NO per-tool timeout: serial at core/engine/turn_loop.rs:2129 and parallel at turn_loop.rs:1773, both calling execute_tool_with_lock (core/engine/tool_execution.rs:290-351). Only batch-boundary cancellation exists (turn_loop.rs:1700-1721, the #3216 freeze-cancel arm). While one tool future is mid-await, zero events flow, so turn_last_activity_at goes stale.
  • EngineEvent::ToolCallProgress { id, output } (core/events.rs:76-78) is #[allow(dead_code)] — nothing emits it — yet it is fully plumbed: record_turn_activity refreshes activity on it and ui.rs:2557-2559 renders it. This is the unused channel the fix should light up.
  • The foreground shell path is itself bounded and cancel-aware: execute_foreground_via_background (tools/shell.rs:1951-2031) clamps timeout to 1s..600s (shell.rs:1960), polls every 100ms observing cancel_token (shell.rs:1991-2002) and the deadline (shell.rs:2019-2027) -> returns TimedOut. So a single foreground exec_shell self-recovers in <=600s. The hangs that escape are: (a) background:true / task_shell_start jobs the model *_waits on, (b) interactive/PTY held by a grandchild, (c) the engine awaiting a tool whose own internal wait never returns — during ALL of which the UI emits nothing and the watchdog is muzzled by the running-tool guard.
  • ToolContext (crates/tui/src/tools/spec.rs) carries cancel_token but NOT tx_event nor the in-flight tool id, so a heartbeat cannot cleanly originate inside the shell tool; it must originate at the engine await site where the tool id, tx_event, and cancel_token are all in scope.

Design: Two-layer liveness/health probe. Layer A keeps the UI honest while a healthy-but-slow tool runs (emit progress so LIVE state and idle math stay accurate). Layer B guarantees recovery when a tool is genuinely hung (the watchdog must no longer be permanently muzzled). Both are additive and engine-side; the shell tool is untouched.

LAYER A — engine emits a tool-liveness heartbeat (lights up the dead ToolCallProgress channel)

  1. In core/engine/tool_execution.rs, add async fn execute_tool_with_heartbeat(...) wrapping the existing execute_tool_with_lock(...). Signature mirrors the call sites plus tool_id: String. Body:
    • let fut = Self::execute_tool_with_lock(...); tokio::pin!(fut);
    • let mut ticker = tokio::time::interval(TOOL_HEARTBEAT_INTERVAL); ticker.tick().await; (consume immediate first tick)
    • loop tokio::select! { biased; r = &mut fut => return r; _ = ticker.tick() => { let _ = tx_event.send(Event::ToolCallProgress { id: tool_id.clone(), output: String::new() }).await; } }
    • Add const TOOL_HEARTBEAT_INTERVAL: Duration = Duration::from_secs(10); (well under the 300s watchdog and the 30s dispatch watchdog; cheap).
    • Empty output is intentional: record_turn_activity only needs the event to exist to refresh turn_last_activity_at; ui.rs:2557 renders "Tool {id}: ..." and summarize_tool_output("") stays benign. Optionally pass a short elapsed string for nicer UX.
  2. Repoint BOTH await sites to the heartbeat wrapper, threading the tool id already in scope: serial turn_loop.rs:2129 (tool_id local) and the parallel task turn_loop.rs:1773 (plan.id). Remove #[allow(dead_code)] from ToolCallProgress (core/events.rs:77) since it now has a producer. Effect: during any in-flight tool (shell foreground poll, background *_wait, MCP, sub-agent fan-out), a heartbeat lands every 10s -> turn_last_activity_at stays fresh -> footer shows truthful "tools executing"/elapsed instead of drifting to "waiting - no recent activity" and the header animation/LIVE feel stays correct. This directly fixes the user-visible "TUI exits LIVE state due to prolonged inactivity / notifies Work list as done while Agent waits."

LAYER B — make the watchdog able to recover a genuinely hung tool (bounded, never punishes a healthy long tool) The fix is NOT to delete the !active_turn_has_running_tool guard (that would kill healthy long builds). Instead: a Running tool that has produced NO heartbeat AND no activity for a long, hard ceiling is treated as hung and recovered. 3. Add a tool-level staleness ceiling distinct from the turn watchdog: const TOOL_HANG_WATCHDOG_TIMEOUT: Duration = Duration::from_secs(900); in tui/ui.rs (3x the heartbeat-fed turn watchdog; only reachable if Layer-A heartbeats have ALSO stopped, i.e. the engine task itself is wedged or the tool future is truly stuck with the runtime starved). 4. Add Branch 4 to reconcile_turn_liveness (ui.rs, after the existing Branch 3, ~line 4844): same recovery body as Branch 3 (finalize streaming/active cells as interrupted, reset streaming state, clear is_loading/turn_started_at/turn_last_activity_at/runtime_turn_status/runtime_turn_id/dispatch_started_at, error toast "Tool stalled with no progress for 15m — recovered; the command may still be running in the background. Use exec_shell_cancel or retry.") but the trigger is: is_loading && status in_progress && !has_running_agents && active_turn_has_running_tool(app) && turn_last_activity_at (or turn_started_at) older than TOOL_HANG_WATCHDOG_TIMEOUT. This is the mirror of Branch 3 for the running-tool case. Because Layer-A heartbeats refresh turn_last_activity_at every 10s, a healthy long tool NEVER reaches this ceiling; only a tool whose heartbeat has also died does. 5. Keep cancellation as the fast manual path: Esc already cancels the turn (turn_loop.rs:1700 batch-boundary arm + the foreground loop's cancel observation at shell.rs:1991-2002), so users retain an immediate out; Branch 4 is the unattended safety net.

Ordering: implement Layer A first (small, makes the common case correct and is independently shippable), then Layer B (the hard-hang net). Layer A also de-risks Layer B by making the 15m ceiling effectively unreachable for healthy work.

Notes for the implementer: do not add a per-tool tokio::time::timeout that kills the tool future — shells already self-bound (600s) and a hard kill of a legitimately long build/test would regress; the heartbeat + watchdog approach surfaces and recovers UI state without severing in-flight work. No pid liveness probe is needed: TaskRecord reaping (task_manager.rs:1516) already handles cross-session orphans unconditionally; the in-turn problem is a UI/engine liveness-signal gap, not a PID-tracking gap.

Builds on: resource_telemetry.rs (#2666) PressureLevel/elapsed formatting can format the heartbeat's optional elapsed string (e.g. reuse its duration formatter so the ToolCallProgress text reads like the rest of the footer). goal_loop.rs decide_continuation (#3215) is the conceptual sibling — both bound a loop so it cannot spin forever — but the goal loop bounds CONTINUATIONS, not in-flight tool liveness, so this is a parallel, independent watchdog (cite it for design symmetry, no code dependency). The #3216 freeze-cancel arm at turn_loop.rs:1700 is the prior art this extends: that recovers at batch boundaries on explicit cancel; Layer B adds the unattended-timeout recovery for the mid-tool case it doesn't cover. No dependency on worker_profile/model_registry/provider_readiness/context_budget/provider_adapter/request_tuning.

Files: /Volumes/VIXinSSD/codewhale/crates/tui/src/core/engine/tool_execution.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/core/engine/turn_loop.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/core/events.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tui/ui.rs, /Volumes/VIXinSSD/codewhale/crates/tui/src/tui/ui/tests.rs

Test plan: Layer A (heartbeat) — new unit tests in core/engine (alongside tool_execution.rs or an engine tests module):

  1. heartbeat_emitted_for_slow_tool: wrap a future that sleeps > TOOL_HEARTBEAT_INTERVAL behind execute_tool_with_heartbeat with a tokio mpsc tx; assert >=1 Event::ToolCallProgress with the supplied id arrives before the result, and the final Ok(result) is returned unchanged. Use tokio::time::pause()/advance to make it deterministic.
  2. no_heartbeat_for_fast_tool: a future that resolves immediately emits zero ToolCallProgress events.
  3. heartbeat_stops_on_completion: after the result resolves, no further ToolCallProgress events are sent (ticker dropped).

Layer B (watchdog) — extend crates/tui/src/tui/ui/tests.rs (mirrors existing reconcile_turn_liveness tests at lines 3160-3280): 4. running_tool_with_recent_heartbeat_does_not_recover: is_loading + in_progress + a Running tool cell + turn_last_activity_at = now-10s -> reconcile_turn_liveness returns false (healthy long tool protected). Asserts Layer A keeps it alive. 5. running_tool_stale_past_hang_ceiling_recovers: same but turn_last_activity_at = now - TOOL_HANG_WATCHDOG_TIMEOUT - 1s -> returns true, is_loading cleared, runtime_turn_status cleared, an error toast pushed, and the active tool cell finalized as interrupted (assert via active_turn_has_running_tool(app) == false afterward). 6. running_tool_just_under_ceiling_does_not_recover: now - TOOL_HANG_WATCHDOG_TIMEOUT + 5s -> false (boundary). 7. existing Branch-3 tests (no running tool) must still pass unchanged — regression guard that the new branch didn't alter the no-tool path.

Integration / manual (matches issue repro): run a tdd-style skill whose step backgrounds a long command and *_waits, or exec_shell background:true "sleep 600" then a wait; observe the footer stays "tools executing" with advancing elapsed (Layer A) instead of going stale, and that Esc still cancels promptly. Simulate a true hang (e.g. SIGSTOP the child / drop the heartbeat) and confirm Branch 4 recovers the UI at the 15m ceiling with the toast.

Gates: cargo test -p codewhale-tui (or the crate's package name from crates/tui/Cargo.toml) for the new tests; cargo clippy -p codewhale-tui (the wrapper must not introduce a hot-loop/clippy lint); cargo fmt --check.

Risks: - Heartbeat noise: emitting ToolCallProgress every 10s updates app.status_message (ui.rs:2558) and is fed to record_turn_activity. Mitigation: empty/short output, 10s cadence is far below any redraw budget; ToolCallProgress is already in the redraw-permitted set (ui.rs:8540) and agent_progress style throttling exists (ui.rs:4854). Verify it doesn't spam the transcript — it only sets status_message, not a HistoryCell, so it won't.

  • Parallel batch: the heartbeat wrapper is applied per tool future in the FuturesUnordered (turn_loop.rs:1767-1784); multiple concurrent tools each tick independently -> multiple ToolCallProgress with distinct ids. That's correct (each refreshes activity) but means N events/10s; fine at MAX_PARALLEL_SHELL_EXEC scale.
  • Watchdog false-positive on a legitimately 15m+ silent tool that ALSO stopped heartbeating: only possible if the engine task itself is wedged (heartbeat is a tokio timer on the same task that awaits the tool — if that task is starved, both stop together, which is exactly the hang we want to recover). Recovery message says "may still be running in the background" and points to exec_shell_cancel, so it's non-destructive to the actual process; it only corrects UI/turn state. Acceptable.
  • Heartbeat is on the same task as the tool await, so a fully CPU-blocked (non-yielding) synchronous tool on the runtime thread would also block the timer. Existing code already routes blocking work via spawn_blocking (e.g. hooks at turn_loop.rs:1498, shell wait_timeout note at turn_loop.rs:1495); shells run via async background polling. So the heartbeat fires for the realistic hang shapes. If a future tool blocks the thread synchronously, Layer B (15m) is still the backstop. Document this boundary in the wrapper doc-comment.
  • Constant tuning: TOOL_HEARTBEAT_INTERVAL (10s) must stay < TURN_STALL_WATCHDOG_TIMEOUT (300s) and < DISPATCH_WATCHDOG_TIMEOUT (30s); TOOL_HANG_WATCHDOG_TIMEOUT (900s) must stay > TURN_STALL_WATCHDOG_TIMEOUT. Encode the relationship with a debug_assert or a comment so a later edit can't invert them.

#3212 — effort: medium

Current state: The substrate the issue describes is largely present on codex/v0.8.61; the missing slice is (a) safer wait defaults, (b) "returns immediately / you'll be notified" wording, and (c) an automatic completion-notification path for background shell/verifier jobs (today only sub-agents get this).

What EXISTS:

  • exec_shell supports background=true, returns a task_id immediately (crates/tui/src/tools/shell.rs:2058 schema, :2415-2422 result text), and is input-aware for parallel read-only commands (shell.rs:1918-1936 exec_shell_input_is_parallel_readonly; :2099-2117 approval_requirement_for/is_read_only_for/supports_parallel_for/starts_detached_for). Parallel read-only concurrency is already proven (shell/tests.rs:151 exec_shell_parallel_flags_are_input_aware).
  • task_shell_start wraps background shell + tags durable task (tasks.rs:397-460). run_verifiers already has background=true that starts gates as background shell jobs and returns task_ids (verifier.rs:100,263-267,325-326,392 start_background_gates; tests at verifier.rs:1135,1251).
  • Cancel-safe waits already hold: exec_shell_wait via wait_for_shell_delta_cancellable returns on cancel WITHOUT killing the job and sets wait_canceled (shell.rs:2623-2694, :2912-2924). execute_foreground_via_background supports foreground->background detach on request_foreground_background (shell.rs:1951-2031; ShellManager flag at shell.rs:763,830-842).
  • Timeout recovery already prefers background, not "longer timeout": FOREGROUND_TIMEOUT_RECOVERY_HINT (shell.rs:1784-1786) plus metadata.foreground_timeout_recovery (shell.rs:2482-2495).
  • agent_eval already defaults block=false (subagent/mod.rs:3571-3573 schema, :3601-3604 logic).
  • Automatic completion injection EXISTS for sub-agents: the end-of-turn tool_uses.is_empty() block drains rx_subagent_completion and injects subagent_completion_runtime_message runtime events, then resumes the turn (turn_loop.rs:1065,1078-1175; message builder at :2468-2492; channel at engine.rs:530-534,680,836,1830).

The GAPS (what to build):

  1. exec_shell_wait defaults wait=true (shell.rs:2871-2874 schema, :2894 logic) — opposite of the issue's "safer default = nonblocking snapshot". It is also not visibly distinguished from a barrier. (task_shell_wait already defaults wait=false, tasks.rs:477.)
  2. No tool result/description states the contract "returns immediately; you will be notified automatically when done; do not poll/wait unless you need early output." Background-start text is terse ("Background task started: ", shell.rs:2421).
  3. NO automatic transcript event when a background shell/verifier job finishes. ShellManager.list_jobs() only feeds the sidebar//jobs (ui.rs:1006,1031,7171; sidebar.rs:908-912). The engine holds shell_manager: SharedShellManager (engine.rs:519) and tx_event but never drains finished jobs into the turn the way it drains sub-agent completions — so the model is forced to poll, which is the exact UX pain.

Design: Five focused changes. Order: 1-2 are independent text/default flips; 3 is the core notification path; 4-5 wire it in and adjust the timeout hint.

  1. Flip exec_shell_wait to a nonblocking snapshot default + barrier wording (shell.rs).
  • In ShellWaitTool::input_schema (shell.rs:2859) change wait default to false; description: "Snapshot the latest background output and return immediately (default). Background jobs notify the transcript automatically on completion, so you normally do NOT need to wait. Set wait=true ONLY for a deliberate barrier at a true dependency or final gate."
  • In execute (shell.rs:2894) change optional_bool(&input, "wait", true) -> ..., false.
  • Update the tool description string (shell.rs:2855-2857) to say it inspects progress by default and only blocks with wait=true. (task_shell_wait already routes through ShellWaitTool::new("exec_shell_wait") at tasks.rs:496 and passes its own wait default through input, so it stays correct.)
  1. Add the "returns immediately / auto-notify" contract to background-start results (shell.rs).
  • In ExecShellTool::execute, the ShellStatus::Running branches (shell.rs:2415-2422): append to both messages a sentence: "Returns immediately; you will be notified in the transcript when it finishes. Keep working; only call exec_shell_wait if you need early output, or with wait=true at a true dependency." Set metadata["auto_notify_on_completion"]=true and metadata["background_policy"]="nonblocking" next to metadata["backgrounded"] (shell.rs:2481).
  • Mirror the same metadata flag in run_verifiers start_background_gates output (verifier.rs:392+, near verifier_background at :504) and in RunVerifiersBackgroundOutput.summary (verifier.rs:473): add "...you will be notified automatically when each gate finishes; continue inspecting/implementing while they run."
  • Tighten the background description in exec_shell schema (shell.rs:2060) to add "Returns immediately and notifies on completion; default for builds, test suites, servers, broad searches, polling, sleeps, and long diagnostics."
  1. Core: emit a transcript event when a background shell/verifier job finishes (shell.rs).
  • Add field completion_reported: bool to BackgroundShell (shell.rs:475-495); init false at construction (shell.rs:1408-1427, beside stdout_cursor: 0).
  • Add a public type ShellCompletionEvent { task_id: String, command: String, status: ShellStatus, exit_code: Option<i32>, duration_ms: u64, stdout_tail: String, stderr_tail: String, linked_task_id: Option<String> } (near ShellJobSnapshot, shell.rs:99).
  • Add ShellManager::drain_finished_jobs(&mut self) -> Vec<ShellCompletionEvent>: poll all processes (reuse the loop in list_jobs, shell.rs:1636-1639), then for each whose status != Running and !completion_reported, set completion_reported = true and build a ShellCompletionEvent from job_snapshot() data (tails via existing tail_from_buffer). This dedupes so a job is reported exactly once. Place it as a ShellManager method right after list_jobs (shell.rs:1655). Crucially this is additive: list_jobs//jobs/sidebar continue to read snapshots unchanged.
  1. Drain finished shell jobs into the turn, mirroring sub-agent completions (turn_loop.rs + engine.rs).
  • Add a free fn shell_completion_runtime_message(events: &[ShellCompletionEvent]) -> Message next to subagent_completion_runtime_message (turn_loop.rs:2468). Reuse the exact role="user" + <codewhale:runtime_event kind="shell_completion" visibility="internal"> envelope (same template-compat reasoning). Body: a compact list "task () -> exit= in ms" plus the stderr/stdout tail when failed, and the instruction "Use this to continue; it replaces manual exec_shell_wait polling."
  • In the end-of-turn tool_uses.is_empty() block (turn_loop.rs:1065), BEFORE the sub-agent drain at :1086, add: let shell_done = { if let Ok(mut m) = self.shell_manager.lock() { m.drain_finished_jobs() } else { Vec::new() } }; and if non-empty, self.add_session_message(shell_completion_runtime_message(&shell_done)).await; turn.next_step(); continue; (so a finished build/test/verifier resumes the turn automatically). Also drain once more in the late-completion path (turn_loop.rs:1279-1297) so a job finishing during inference isn't missed.
  • engine.rs already exposes self.shell_manager (engine.rs:519) and add_session_message, so no new channel is required — drain-on-pull is simpler and avoids a producer thread. (Optional future: a tx channel for mid-turn notification; not needed for AC.)
  1. Keep the timeout hint as-is (already correct, shell.rs:1784-1786, :2482-2495) — it already says "restart in background and continue," satisfying that AC. Just verify the new wording in (2) doesn't duplicate it.

Non-goals guarded: do NOT change agent_eval block default (already false, subagent/mod.rs:3604); do NOT make foreground verifier gates the default (run_verifiers stays foreground unless background=true). Permission profiles (worker_profile.rs background field at :124, intersected in derive_child :212) are unchanged; this issue is execution defaults, #3211 owns profile gating.

Builds on: - worker_profile.rs WorkerRuntimeProfile.background (:124) + derive_child intersection (:180-212): the new auto-notify path respects the child's foreground/background preference; no escalation introduced. (#3217/#3211)

  • goal_loop.rs decide_continuation / goal_continuation_message_if_needed: the new shell_completion runtime message resumes the turn the same way sub-agent completions do, feeding the existing goal-continuation logic rather than bypassing it. (#3215)
  • resource_telemetry.rs / context_budget.rs PressureLevel: the completion event carries duration_ms; can later be folded into telemetry rows, but not required here. (#2666/#3086)
  • Existing sub-agent completion machinery (turn_loop.rs:1078-1175,2468; engine.rs:530-534) is the direct template for the new shell-completion injection.

Files: crates/tui/src/tools/shell.rs, crates/tui/src/tools/verifier.rs, crates/tui/src/core/engine/turn_loop.rs, crates/tui/src/tools/shell/tests.rs, crates/tui/src/tools/tasks.rs

Test plan: Unit (crates/tui/src/tools/shell/tests.rs — has env_lock, sleep/echo command helpers, ToolContext::new, wait_for_completed_shell):

  1. exec_shell_wait_defaults_to_snapshot: assert ShellWaitTool::new("exec_shell_wait").input_schema()["properties"]["wait"]["default"] == false; start a sleep job via ExecShellTool background=true, immediately call exec_shell_wait with no wait arg, assert it returns status Running quickly (does not block) and metadata.stream_delta true.
  2. exec_shell_wait_barrier_blocks: same job, call with wait=true and a small timeout; assert it blocks then returns.
  3. background_start_advertises_auto_notify: ExecShellTool background=true result metadata has auto_notify_on_completion==true and content contains "notified". Same assertion for run_verifiers background=true (extend verifier.rs:1251 test to check metadata.auto_notify_on_completion and summary wording).
  4. drain_finished_jobs_reports_once: ShellManager::new in tmp; spawn sleep 0/echo background; poll until completed (wait_for_completed_shell); first drain_finished_jobs() returns 1 event with correct task_id+status Completed; second call returns empty (dedupe via completion_reported).
  5. drain_finished_jobs_ignores_running: spawn a long sleep; drain returns empty while Running.
  6. cancel_does_not_kill_then_completes: existing cancel-safe behavior (shell.rs:2912) — assert wait_canceled job is still Running, then later appears exactly once in drain_finished_jobs.
  7. parallel readonly unchanged: keep/extend shell/tests.rs:151 to confirm flips didn't alter approval/parallel flags.

Engine/turn (turn_loop.rs tests, mirror subagent_completion_handoff_is_internal_user_message at :2762): shell_completion_runtime_message produces role=="user" with kind="shell_completion" and visibility="internal" and includes the task id + status; a unit test asserting envelope shape (full turn-resume is covered structurally by reusing the proven sub-agent path).

Gates: cargo fmt; cargo clippy -p codewhale-tui -- -D warnings; cargo test -p codewhale-tui tools::shell tools::verifier core::engine::turn_loop.

Risks: - Double-notification: if drain_finished_jobs is called from multiple turn-loop sites without the completion_reported flag, a job could be injected twice. Mitigation: the flag is set inside drain under the ShellManager mutex, so all call sites are deduped.

  • Missed notification: list_jobs already evicts completed jobs >1h (cleanup, shell.rs:1689-1698). A job that finishes and is evicted before the next end-of-turn drain would be lost. Risk is low (turns end far more often than 1h) but the drain must run on every turn end AND in the late-completion path (turn_loop.rs:1279). Optionally skip cleanup until after first drain.
  • Changing exec_shell_wait default from wait=true to false is a behavior change for any caller that relied on implicit blocking. task_shell_wait passes wait through and already defaults false, so the durable path is unaffected; the model is explicitly taught (description) to set wait=true for barriers. Existing tests that assumed blocking must be updated.
  • Lock contention: drain_finished_jobs polls all processes under the mutex on every turn end; this is the same cost already paid by list_jobs in the UI loop, and uses tail (not full_output), so no new O(total_bytes) hazard.
  • Prompt-channel discipline: the new runtime_event must reuse role="user" + visibility="internal" (not "system") to avoid the vLLM/Qwen "system message must be first" 400 documented at turn_loop.rs:2469-2476.

#1541 — effort: small

Current state: The cancellation-reason groundwork is ALREADY LANDED on branch codex/v0.8.61 (working tree under crates/tui/src — note the repo root is /Volumes/VIXinSSD/codewhale, NOT npm/codewhale; "core/engine" in the prompt = crates/tui/src/core/engine/). What exists:

  • The taxonomy. CancelReason { User, External, Preempted, Internal } + describe() is defined at crates/tui/src/core/engine.rs:455-481. Every variant except User is #[allow(dead_code)] — they have NO call site yet. describe() strings already match the acceptance criteria verbatim ("request cancelled by external caller" :476, "request was preempted by a new turn" :477).
  • The plumbing. EngineHandle::cancel_with_reason(reason) latches the reason then cancels the token; cancel() delegates to cancel_with_reason(CancelReason::User) (crates/tui/src/core/engine/handle.rs:26-42). Public cancel() signature is preserved. The reason latch is stored twice and mirrored: handle side engine.rs:495, engine side engine.rs:541; it is CLEARED on each fresh turn in reset_cancel_token (engine.rs:596-602). So the latch is read-once-per-turn and self-resets.
  • The only consumer today. The latched reason is surfaced ONLY in the approval / user-input waits, via cancel_reason_suffix() (crates/tui/src/core/engine/approval.rs:60-69), appended at approval.rs:78 (await_tool_approval) and approval.rs:125 (await_user_input). It produces " (reason: )" on the "Request cancelled while awaiting approval/user input" error.

THE GAP (what #1541 asks for):

  1. External not wired. The runtime HTTP interrupt path calls the plain (User) cancel: RuntimeThreadManager::interrupt_turn does active_thread.engine.cancel() at crates/tui/src/runtime_threads.rs:1832. The engine field IS a crate::core::engine::EngineHandle (declared runtime_threads.rs:755, imported :31), so cancel_with_reason is already in scope. HTTP entry point: interrupt_thread_turn (crates/tui/src/runtime_api.rs:2449-2459) routed as POST /v1/threads/{id}/turns/{turn_id}/interrupt (route table runtime_api.rs ~647). This is the prime, one-line External fix.
  2. Internal not wired. Op::CancelRequest cancels the raw token with no reason: engine.rs:1177-1180 (self.cancel_token.cancel(); self.reset_cancel_token();). Engine teardown/drop and channel-close also use the bare token. None stamp Internal.
  3. Preempted has no trigger. The TUI does NOT preempt — Enter-while-busy parks on app.queued_messages, drained AT TurnComplete (ui.rs:7510 comment + ui.rs:5103-5109 drain). So in the TUI there is no live preemption to stamp. A genuine preemption exists only if the runtime starts a turn on a thread with an active turn (runtime_threads start-turn path). Honest call: wire Preempted there if such a guard exists, else leave it reserved and documented.
  4. task_manager is a SEPARATE subsystem. task_manager.rs:949 cancels a standalone CancellationToken pulled from running_cancel: HashMap<String, CancellationToken> (task_manager.rs:719) — NOT an EngineHandle. So "task manager stop" in the issue body does NOT currently flow through CancelReason and cannot be fixed by a one-liner; it needs the EngineHandle (or a reason-carrying token) threaded into the task runner. Treat as out-of-scope/follow-up unless that handle is available.
  5. UI cancels are correctly User. Ctrl+C (ui.rs:3755), Esc (ui.rs:3824, 3832), Abort-from-review (ui.rs:8507), and the TurnStarted re-assert race (ui.rs:1454) are all genuine user actions — they must keep calling .cancel(). No change.
  6. Reason never reaches the transcript. Cancellation yields TurnOutcomeStatus::Interrupted with error: None at every turn_loop arm (turn_loop.rs:83, 410, 900, 1114, 1351, 2382). The main turn end folds (status, error) into Event::TurnComplete at engine.rs:1918-1949. The UI consumes it at ui.rs:1816-1900: it maps Interrupted -> the string "interrupted" (ui.rs:1880-1888) and flatly labels cells "[interrupted]" (app.rs:3396 finalize_active_cell_as_interrupted, app.rs:5053-5062 finalize_streaming_assistant_as_interrupted), DISCARDING any cause. So even once External/Internal are stamped, the only place a non-approval user sees the reason today is the approval-wait error string. CancelReason is single-crate (no copies in protocol/core/tui-core/app-server — verified).

Design: Build strictly ON the landed groundwork (the enum, the latch, cancel_with_reason, cancel_reason_suffix). Do not re-architect. Land in 4 ordered, independently-reviewable slices. Slices A and B satisfy the acceptance criteria; C is the UI surfacing the issue title calls for; D is honest scoping.

SLICE A — External (the headline fix; ~1 line + test). File: crates/tui/src/runtime_threads.rs.

  • In interrupt_turn (around :1819-1846) replace active_thread.engine.cancel(); (:1832) with active_thread.engine.cancel_with_reason(crate::core::engine::CancelReason::External);. Import CancelReason alongside the existing use crate::core::engine::{EngineConfig, EngineHandle, spawn_engine}; at :31.
  • Effect: when the engine is awaiting approval/input, cancel_reason_suffix() now appends " (reason: request cancelled by external caller)" exactly as the AC requires. Zero new types, the latch + suffix already do the work.

SLICE B — Internal (teardown/channel-close). File: crates/tui/src/core/engine.rs.

  • At the Op::CancelRequest handler (engine.rs:1177-1180): this Op is the in-process cancel mailbox. It is NOT user teardown per se — but the issue scopes Internal to "drop, channel close, shutdown". Decision: leave Op::CancelRequest as-is (it is the generic cancel mailbox and callers who want a reason use cancel_with_reason on the handle). Instead, stamp Internal at the genuine teardown sites. Concretely: add a tiny private helper on Engine, fn latch_cancel_reason(&self, reason: CancelReason) (mirrors the lock pattern in reset_cancel_token at engine.rs:599-602) that writes *self.cancel_reason.lock() = Some(reason). Call it immediately BEFORE the bare-token cancels that represent teardown — the engine-side self.cancel_token.cancel() paths that fire on shutdown/drop. The await_tool_approval channel-closed arm (approval.rs:84-91) and await_user_input channel-closed arm (approval.rs:145-149) already emit a teardown-specific message; optionally route those through CancelReason::Internal describe() for consistency, but they are reached via recv()==None (not the token), so they do NOT read the latch — leave their bespoke strings, they are already correct and clearer than the generic Internal text. Net: Internal is stamped at the shutdown token-cancel; channel-close races keep their existing distinct messages (which is exactly the AC's "where it can be distinguished from user cancellation").
  • Honest note for implementer: confirm the exact shutdown site. The cancel at engine.rs:1178 is reset immediately (:1179) and is the user/runtime CancelRequest mailbox, so it should NOT be force-stamped Internal (that would mislabel a runtime cancel that arrives as an Op). The Internal stamp belongs only on Drop/Shutdown. If no distinct Drop path cancels the token (search shows none beyond reset_cancel_token and the Op), then Internal stays reserved and you document that the channel-closed arms already cover the teardown-race case — do not invent a stamp just to use the variant.

SLICE C — Surface the reason in the transcript/UI (the issue title: "Track ... through engine call sites"). This is the piece that makes External/Internal visible to a non-approval user.

  • Carry the reason out of the turn. In crates/tui/src/core/engine/turn_loop.rs, the cancel arms return (TurnOutcomeStatus::Interrupted, None). Add a private method on Engine fn interrupted_outcome(&self) -> (TurnOutcomeStatus, Option<String>) that reads the latch (reuse cancel_reason_suffix logic, or read self.cancel_reason.lock()) and returns (Interrupted, Some(reason.describe().to_string())) when a non-User reason is latched, else (Interrupted, None) (User cancellation stays quiet — current behavior preserved). Replace the six literal return (TurnOutcomeStatus::Interrupted, None); sites (turn_loop.rs:83, 410, 900, 1114, 1351, 2382) with return self.interrupted_outcome();. The value flows untouched through engine.rs:1918 into Event::TurnComplete.error (engine.rs:1942-1948).
  • Render it. In crates/tui/src/tui/ui.rs TurnComplete handler (:1816-1900): where Interrupted is handled (the finalize_*_as_interrupted calls at :1846/:1851 and the status map at :1880-1888), if error is Some on an Interrupted turn, set app.status_message = Some(format!("Request cancelled: {error}")) (or thread it into the cell label). Minimal change: one if let Some(reason) = &error { app.status_message = Some(...) } in the Interrupted branch. Optionally enrich app.finalize_streaming_assistant_as_interrupted (app.rs:5053-5062) to accept an optional reason and render "[interrupted: ]" instead of bare "[interrupted]" — but keep the no-reason path byte-identical so User cancels are unchanged.
  • Runtime API surfacing: interrupt_turn returns a TurnRecord (runtime_api.rs:2449-2459). No change needed for the AC, but the emitted turn.interrupt_requested event (runtime_threads.rs:1836-1843) MAY carry the reason in its json payload for API observability — additive, optional.

SLICE D — Preempted + task_manager (scope honestly).

  • Preempted: only stamp it where a real preemption exists. Audit the runtime start-turn path (runtime_threads.rs start/steer around the active_turn guard). If starting a turn while active_turn.is_some() cancels the prior turn, stamp CancelReason::Preempted there. If (as the TUI does) the system queues instead of preempting, DO NOT fabricate a call site — leave Preempted reserved and add a doc comment at the enum (engine.rs:463-466) pointing to where it would be wired. Do not touch the TUI queue path.
  • task_manager: explicitly out of scope for the latch (it owns a standalone CancellationToken, not an EngineHandle — task_manager.rs:719, 949). Document as a follow-up: to honor "task manager stop -> External" the task runner must hold the EngineHandle (or carry a reason-aware token wrapper). Note it in the issue; do not force a fake bridge.

Ordering: A (trivial, high-value, unblocks the External AC) -> C (makes A/Internal visible) -> B (Internal, only if a real teardown site exists) -> D (doc/scoping). A and C are the must-haves.

Builds on: Directly builds on the landed cancellation-reason groundwork on codex/v0.8.61: the CancelReason enum + describe() (crates/tui/src/core/engine.rs:455-481), EngineHandle::cancel_with_reason + the User-preserving cancel() (crates/tui/src/core/engine/handle.rs:26-42), the mirrored cancel_reason latch with per-turn reset (engine.rs:495/541/596-602), and cancel_reason_suffix() in approval.rs:60-69. It also rides the turn_loop freeze-cancel / between-batch cancellation observation from #3216/#2211 (turn_loop.rs:1688-1711) — that work already returns Interrupted at the cancel arms this design enriches. No dependency on goal_loop, worker_profile, model_registry, provider_readiness, context_budget, provider_adapter, resource_telemetry, or request_tuning — those foundations are orthogonal to cancellation-reason propagation.

Files: crates/tui/src/runtime_threads.rs, crates/tui/src/core/engine/turn_loop.rs, crates/tui/src/tui/ui.rs, crates/tui/src/core/engine.rs, crates/tui/src/core/engine/tests.rs, crates/tui/src/tui/app.rs

Test plan: All tests are Rust unit/integration tests in-tree; run with cargo test -p codewhale-tui (workspace is the npm-adjacent root; the crate is codewhale-tui per target/package naming). Honesty: no file edits performed here — this is the test list a follow-up implementer should add/run.

  1. External stamp (Slice A) — extend crates/tui/src/runtime_threads.rs tests. There is already interrupt_turn_marks_interrupted_after_cleanup (runtime_threads.rs:4457) and an interrupt test at :4518 asserting the turn.interrupt_requested event (:4531). Add a sibling that, after interrupt_turn, asserts the engine's latched reason is External. Easiest hook: assert via a new test-only accessor mirroring handle.is_cancelled (handle.rs:47) — add #[cfg(test)] fn cancel_reason(&self) -> Option<CancelReason> on EngineHandle reading the latch — then assert == Some(CancelReason::External). Mirrors existing engine_handle_cancel_tracks_latest_turn_token (tests.rs:297-308).

  2. Reason suffix end-to-end (Slice A + existing approval path) — in crates/tui/src/core/engine/tests.rs, drive an engine to the await_tool_approval state, call handle.cancel_with_reason(CancelReason::External), and assert the returned ToolError message contains "request cancelled by external caller". This proves describe() -> suffix -> approval.rs:78 wiring with the External variant (the User path is already implicitly covered).

  3. Interrupted carries reason (Slice C) — unit-test the new Engine::interrupted_outcome: with no latch -> (Interrupted, None); with External latched -> (Interrupted, Some("request cancelled by external caller")); with User latched -> (Interrupted, None) (quiet). Place in core/engine/tests.rs.

  4. User cancel stays quiet/regression — assert the existing engine_handle_cancel_tracks_latest_turn_token (tests.rs:297) still passes and that a User cancel produces TurnComplete.error == None (no behavior change). Guards the AC "Existing public EngineHandle::cancel() behavior remains user-cancel compatible".

  5. UI surfacing (Slice C) — in crates/tui/src/tui/ui/tests.rs add a test that feeds Event::TurnComplete { status: Interrupted, error: Some("request cancelled by external caller"), .. } and asserts app.status_message reflects the reason; and a companion with error: None asserting the message is the unchanged "interrupted"/empty path. If the cell-label enrichment is taken, extend the existing finalize_streaming_assistant test (app.rs:7760) to assert "[interrupted: ...]" vs bare "[interrupted]".

  6. Internal (Slice B, only if a real teardown site is wired) — assert that the shutdown/drop path latches Internal; if Internal stays reserved, no test, and the channel-closed arms keep their existing assertions.

Gates: cargo test -p codewhale-tui, cargo clippy -p codewhale-tui --all-targets (the enum currently has #[allow(dead_code)]; once External/Internal are constructed, REMOVE the now-unnecessary allow on those variants — clippy will confirm they are live), and cargo fmt.

Risks: - Mislabeling a runtime cancel as Internal. The biggest trap: Op::CancelRequest (engine.rs:1177) is the shared cancel mailbox; a runtime/external cancel may arrive through it. Do NOT blanket-stamp Internal there or you will overwrite External and violate the AC. Internal belongs ONLY on Drop/Shutdown sites that are provably not user/runtime-driven. If no such distinct site exists, leave Internal reserved (the channel-closed arms in approval.rs already give a teardown-distinct message).

  • Latch is single-slot and reset per turn (engine.rs:596-602). If two cancels race (e.g., user Esc then runtime interrupt), last-writer-wins. Acceptable today (cancellation is terminal for the turn) but note it; do not add complexity to serialize.
  • Preempted may be unbuildable in the TUI (it queues, ui.rs:7510) — forcing a call site would be dishonest and could mislabel a queue-drain as a preemption. Keep it reserved unless the runtime start-turn path genuinely preempts.
  • task_manager bridge is non-trivial (separate token, no EngineHandle — task_manager.rs:719/949). Pretending the one-liner covers it would fail the AC silently. Scope it out explicitly.
  • Surfacing the reason on EVERY Interrupted turn risks adding noise to ordinary user Esc cancels. Mitigation baked into the design: User cancel keeps error: None (quiet); only External/Internal/Preempted emit text. Preserves current UX.
  • Removing #[allow(dead_code)] on the enum once variants are live is required or clippy stays noisy; conversely if a variant remains unused (Preempted), its allow must stay — get this per-variant or the build warns.

#3146 — effort: medium

Current state: NOTE ON LOCATION: cwd /Volumes/VIXinSSD/codewhale/npm/codewhale is only the npm publish wrapper (JS installer, 15 files, no Rust). The actual Rust source on branch codex/v0.8.61 is its grandparent /Volumes/VIXinSSD/codewhale (has crates/, all foundations confirmed present). All paths below are absolute against that root; cite as crates/....

The transcript-collapse substrate is already built and tested (#2692). The gap is exactly the three Cursor-style affordances the issue names: edit-group diff stats, per-file rows, in-flight/waiting rows.

What EXISTS (crates/tui/src/tui/history.rs):

  • ToolRun { start, count, tool_families, activity } (history.rs:782-792) + ToolRunActivitySummary { files, searches, commands, edits, delegates, metadata, other } (794-803). Buckets are per-cell COUNTS only — no diff stats, no per-file detail.
  • detect_tool_runs[_from_slices] (837-896): walks contiguous is_collapsible_tool_cell cells, builds the activity summary, emits a ToolRun when count >= min_size. Works over committed history + active in-flight tail via cell_at_virtual_index (898-906).
  • is_collapsible_tool_cell (908-910) = tool.is_success() && !tool.is_collapsible_guard().
  • is_collapsible_guard (734-746): GUARDS running, failed, AND every Exec, PatchSummary, Review, DiffPreview, PlanUpdate typed cell, plus Generic whose name hits generic_tool_name_is_collapse_guard (912-926: patch/write/edit/delete/remove/commit/push/review) or is_diff. THIS IS THE KEY GAP: shell + edit cells never collapse, so the issue's "Edited 3 files +X -Y" / "Ran N commands" / "Running ... cd, cargo, rg" rows can never appear today.
  • tool_run_summary (1030-1075): emits "Explored N files, M searches", "ran K commands: ", "edited N files", "delegated N tasks", else "Updated metadata", with counted() pluralization (1094-1097) + sentence_case_activity (1099). The commands/edits clauses are effectively DEAD CODE for typed cells (guarded out); they only fire for generic-named command/edit tools. Edit clause has NO +X -Y.

Rendering (crates/tui/src/tui/widgets/mod.rs):

  • render() builds tool_runs (133-141, gated by app.tool_collapse_active()), computes collapsed_run_starts / collapsed_tool_indices (142-154), then in the slow path (185-263) replaces each run's first cell with tool_run_summary_cell(run) and hides offsets 1..count, recording collapsed_cell_map (filtered->original).
  • tool_run_summary_cell (390-401): a single GenericToolCell { name:"activity_group", input_summary: Some(tool_run_summary(run)) } — ONE muted line, no per-file rows. tool_run_summary_revision (403-418) hashes member revisions for cache invalidation.

App state/expansion (crates/tui/src/tui/app.rs):

  • expanded_tool_runs: HashSet<usize> (1526) keyed by run.start; rebased/pruned on history shift/truncate (2928, 3048, 3061, 3100). tool_collapse_threshold + tool_collapse_mode (ToolCollapseMode Compact/Expanded/Calm, 401-437; from settings.rs:207).
  • toggle_tool_run_expansion_at (3122-3134) via tool_run_start_for_history_index (3111-3120). Keyboard ui.rs:3571/3602, mouse mouse_ui.rs:65.
  • Alt+V/detail: cell_has_detail_target (3208-3214), detail_cell_index_for_viewport (3219-3252), tool_detail_record_for_cell (3195-3202) via original_cell_index_for_rendered + collapsed_cell_map. The summary cell maps back to run.start ONLY — Alt+V reaches the FIRST original cell of a group, not every member (acceptable but worth a test).

Diff data (recoverable but discarded):

  • PatchSummaryCell { path:String, summary:String, status, error } (history.rs:1390-1395) — NO structured add/remove/per-file counts.
  • At construction tool_routing.rs:165-180 calls parse_patch_summary (1009-1051) which ALREADY computes count_patch_changes -> (adds,removes) (1135-1149) and extract_patch_paths (1053+), then collapses them into the free-text summary ("Changes: +12 / -3") and a single path/"N files" label. Structured numbers are thrown away.
  • diff_render.rs ALREADY has the exact output primitives: diff_summary_label -> "N files +X -Y" (192-204) and render_diff_summary per-file rows " path +A -D N hunks" (221-251).
  • Patch completion tool_routing.rs:590-602 flips status and overwrites summary; raw patch JSON survives in ToolDetailRecord { input: Value } (app.rs:1884-1889).

In-flight/waiting: running cells are guards (split runs) and render as full cards. command_header_summary (history.rs:2997) and exploring_header_summary (3007) already extract a short "purpose" from a command; no aggregated in-flight metadata row exists.

Existing tests to extend: history.rs:5982-6131 (detect_tool_runs_*, tool_run_summary_*); widgets/mod.rs:2849-2982 (add_dense_tool_run, chat_widget_collapses_dense_tool_runs_by_default, chat_widget_expands_dense_tool_runs_on_demand, detail_highlight_uses_original_index_map_for_collapsed_rows).

Design: Build ON the existing #2692 substrate; do not rebuild it. Four ordered changes, each independently testable.

STEP 1 — Capture structured diff stats on edit cells (crates/tui/src/tui/history.rs + crates/tui/src/tui/tool_routing.rs).

  • Add fields to PatchSummaryCell (history.rs:1390): added: usize, removed: usize, files: Vec<PatchFileStat> where PatchFileStat { path: String, added: usize, removed: usize } (new small struct). Keep path/summary for back-compat rendering.
  • In tool_routing.rs: refactor parse_patch_summary (1009) to also return the structured (added, removed, Vec<PatchFileStat>) it already computes via count_patch_changes (1135) + extract_patch_paths (1053). Easiest: add fn parse_patch_file_stats(input)->Vec<PatchFileStat> that walks per-+++ hunks (reuse the +/- counting loop, but reset per file). Populate the new fields at the construction site (165-180). On completion (590-602), if a richer diff/patch is in the result, recompute; otherwise keep construction-time stats.
  • This is the load-bearing data change: everything downstream reads these numbers instead of re-parsing strings.

STEP 2 — Make successful edit + shell groups collapsible while keeping danger prominent (crates/tui/src/tui/history.rs).

  • Narrow is_collapsible_guard (734-746): REMOVE the blanket ToolCell::Exec(_) and ToolCell::PatchSummary(_) arms from the typed-guard match, so SUCCESSFUL exec/patch cells can join runs. KEEP guarding running/failed (the is_running()||is_failed() head stays), KEEP Review/DiffPreview/PlanUpdate, KEEP generic_tool_name_is_collapse_guard for push/commit/publish/delete/remove/review names and cell.is_diff.
  • Add a focused safety predicate so destructive/sensitive SHELL and PATCH cells still never collapse. New fn exec_is_collapse_guard(cell: &ExecCell) -> bool returning true when the command string matches a sensitive set (git push, git commit, rm -rf, publish, npm/cargo publish, force flags, gh release, kubectl/terraform apply, curl|sh) — reuse/extend the keyword spirit of generic_tool_name_is_collapse_guard. New fn patch_is_collapse_guard(cell): keep prominent if cell.error.is_some() (already covered by is_failed) — patches are otherwise safe to summarize. Wire both into is_collapsible_guard's Exec/PatchSummary arms (guard only when sensitive). Net effect: routine edits/shell collapse; pushes/destructive stay full cards. This is the single highest-risk change — gate it behind the predicate, not a blanket removal.

STEP 3 — Edit-group diff stats + optional per-file rows in the summary (crates/tui/src/tui/history.rs + crates/tui/src/tui/widgets/mod.rs).

  • Extend ToolRunActivitySummary (794-803): add edit_added: usize, edit_removed: usize, and edit_files: Vec<PatchFileStat> (capped, e.g. 6). In record() (816-828) when the cell is ToolCell::PatchSummary, add its added/removed and extend edit_files.
  • Upgrade tool_run_summary edit clause (1053-1058): when activity.edits>0, emit format!("edited {} {}", counted(edits,"file","files"), diff_stat_suffix(added,removed)) where suffix renders +{added} -{removed} (reuse the exact +A -D shape from diff_render.rs:200-201). Keep "edited N files" when both are 0. Pluralization tests extend the existing tool_run_summary_* tests.
  • Per-file rows: change the summary cell from one line to a small multi-line cell. Replace tool_run_summary_cell (widgets/mod.rs:390-401) construction: introduce a dedicated typed cell ToolCell::ActivityGroup(ActivityGroupCell { headline: String, detail_rows: Vec<String> }) (new variant in history.rs ToolCell enum at 666-677) OR, lower-risk, keep GenericToolCell but set output: Some(rows.join("\n")) so existing Generic rendering shows the per-file rows under the headline. Recommend the typed ActivityGroupCell for clean muted styling (render via render_tool_header_with_summary for the headline + render_compact_kv/muted wrap_plain_line per file row using palette::TEXT_MUTED, mirroring render_diff_summary's per-file row format). Per-file rows are gated: only emit when activity.edits>0 && edit_files.len()<=cap; otherwise headline only. Update tool_display_name (942-955), status() (handle new variant -> Success), and render dispatch (764-777) for the new variant. tool_run_summary_revision already hashes members, so cache stays correct.

STEP 4 — In-flight / waiting metadata rows (crates/tui/src/tui/history.rs + crates/tui/src/tui/widgets/mod.rs).

  • This is the only NEW grouping (the rest reuse the collapse pipeline). Running cells are guards, so they don't enter detect_tool_runs. Add a parallel detector fn detect_inflight_group(history, active_entries) -> Option<InflightGroup> over the TAIL: scan trailing contiguous is_running() tool cells; classify each via the existing classify_tool_run_activity; produce InflightGroup { running: usize, command_families: Vec<String>, waiting_background: usize, purpose: Option<String> }. Purpose: reuse command_header_summary (2997) on the first running exec, or exploring_header_summary for reads.
  • Add fn inflight_summary(group) -> String: "Running {purpose} {families.join(", ")}" (e.g. "Running fmt check cd, cargo, rg") and when waiting_background>0 "Waiting for {counted(n,"command","commands")} to finish" with a trailing "Run in background" affordance hint (mirror existing "Ctrl+B backgrounds this command", history.rs:1192). Keep approval-required/sensitive running commands as their OWN full card (do not fold into the count) — reuse exec_is_collapse_guard from Step 2 to decide.
  • Render: in widgets/mod.rs render(), after building tool_runs, if app.tool_collapse_active() and the tail is a running group of size >= threshold (or >=2 for in-flight, configurable constant), append/replace the active tail entries with one muted ActivityGroupCell headline = inflight_summary. This row is transient (recomputed each frame from the active tail); it does not touch expanded_tool_runs. Expansion to see the live commands = the existing per-cell Alt+V on the underlying running cells, which already render live output.

CROSS-CUTTING (preserve existing acceptance criteria):

  • Expansion (keyboard ui.rs:3571/3602, mouse mouse_ui.rs:65) is unchanged — toggle_tool_run_expansion_at still keys on run.start; edit/shell groups now produce runs so they expand identically.
  • Alt+V/detail: cell_has_detail_target/detail_cell_index_for_viewport already resolve via collapsed_cell_map to run.start. Add ONE improvement so Alt+V from a collapsed edit/shell group can reach ANY member: extend detail_cell_index_for_viewport (or a helper) so when the targeted cell is an ActivityGroupCell/run-start, it returns the run.start but the detail pager can page across run.start..run.start+count (verify the pager already supports next/prev; if not, leave run.start targeting and add a test documenting first-cell behavior — matches today's contract).
  • The tool_collapse_mode Expanded path (is_active=false) must still leave everything visible — Steps 2-4 only act when tool_collapse_active() is true.

Builds on: Directly on the #2692 transcript-collapse substrate (ToolRun / ToolRunActivitySummary / detect_tool_runs / tool_run_summary / tool_run_summary_cell / expanded_tool_runs / collapsed_cell_map / Alt+V detail targeting). On diff_render.rs formatting primitives (diff_summary_label, render_diff_summary) for the +X -Y shape and per-file row format. On tool_routing.rs parse_patch_summary/count_patch_changes/extract_patch_paths which already compute the diff numbers (Step 1 just stops discarding them). On command_header_summary/exploring_header_summary for in-flight purpose text.

It does NOT need the v0.8.61 foundations listed in the brief (worker_profile, goal_loop, model_registry, provider_readiness, context_budget, provider_adapter, resource_telemetry, request_tuning) — this is a transcript-rendering feature, orthogonal to runtime/profile/budget. The only shared touchpoint is palette styling. Confirmed by reading all eight foundation files' purpose; none feed the history/render path. Flagging explicitly so the implementer does not wire unrelated foundations.

Files: crates/tui/src/tui/history.rs, crates/tui/src/tui/tool_routing.rs, crates/tui/src/tui/widgets/mod.rs, crates/tui/src/tui/diff_render.rs, crates/tui/src/tui/app.rs

Test plan: All in-crate #[cfg(test)] (existing pattern); no new files needed.

history.rs (extend the tests mod near 5982-6131):

  • patch_cell_carries_structured_diff_stats: build PatchSummaryCell from a 2-file patch JSON; assert added/removed/files populated (Step 1).
  • edit_group_summary_includes_diff_stats: ToolRun with edits=3, edit_added=18, edit_removed=7 -> tool_run_summary == "Edited 3 files +18 -7"; singular case edits=1 -> "Edited 1 file +6 -1" (Step 3, pluralization).
  • edit_group_emits_per_file_rows_under_cap / _omits_rows_over_cap: assert ActivityGroupCell detail_rows match path +A -D and are empty when files exceed cap (Step 3).
  • successful_edit_and_shell_cells_now_collapse: extend detect_tool_runs_keeps_failed_running_and_shell_cells_visible (6030) — a run of 3 successful shell + 1 successful patch now yields ONE run; assert activity.commands/edits/edit_added correct.
  • destructive_shell_stays_visible: git push, rm -rf, cargo publish exec cells split the run / remain full cards (Step 2 exec_is_collapse_guard). failed_patch_stays_visible (already covered by is_failed, add explicit case).
  • inflight_group_summary_formats_running_and_waiting: InflightGroup{running:2, families:[cargo,rg], waiting_background:1} -> "Running ... cargo, rg" + "Waiting for 1 command to finish" (Step 4); approval-sensitive running command excluded from the count.

widgets/mod.rs (extend tests near 2849-2982, reuse add_dense_tool_run):

  • chat_widget_collapses_edit_group_with_diff_stats: history with dense successful patch cells -> rendered transcript contains "Edited N files +X -Y" and not the raw Patch cards.
  • chat_widget_renders_per_file_edit_rows: assert per-file path +A -D rows appear in the collapsed group lines.
  • chat_widget_renders_inflight_running_row: active tail of running exec cells -> "Running ..." row present (Step 4).
  • Reuse chat_widget_expands_dense_tool_runs_on_demand (2959) to prove edit/shell groups still expand to original cells.
  • Reuse detail_highlight_uses_original_index_map_for_collapsed_rows (2879) + add alt_v_from_collapsed_edit_group_reaches_member to lock the Alt+V detail-target contract.

Gate commands (run from /Volumes/VIXinSSD/codewhale): cargo test -p codewhale-tui tui::history, cargo test -p codewhale-tui tui::widgets, then cargo fmt --check and cargo clippy -p codewhale-tui. Map each issue acceptance checkbox to a named test above (collapsed render, expansion, detail targeting, command summary, edit summary+diff stats, per-file rows, running/waiting guard, failure/destructive visibility).

Risks: - HIGHEST: relaxing is_collapsible_guard for Exec/PatchSummary (Step 2). If the sensitive-command predicate is too loose, a git push/rm -rf/publish could be summarized away — violating the non-goal "Do not hide dangerous or failed actions." Mitigation: guard-by-default for the sensitive keyword set, collapse only the clearly-safe remainder; dedicated destructive_shell_stays_visible test; err toward MORE guarding. Failed/running already stay (untouched head of the predicate).

  • Diff-stat accuracy: count_patch_changes (tool_routing.rs:1135) counts raw +/- lines including context-adjacent edits and does not dedupe +++/--- per-file headers perfectly across multi-file patches; per-file attribution requires resetting counts at each +++ boundary. Mitigation: parse_patch_file_stats walks per-file; treat numbers as indicative (matches how diff_render already labels). Add a multi-file patch test.
  • Summary cell shape change (single-line Generic -> multi-line ActivityGroupCell): touches tool_display_name, status(), render dispatch, and the revision/cache path. Risk of cache staleness or panic on the new enum variant if a match arm is missed. Mitigation: exhaustive match (no wildcard) so the compiler flags every site; tool_run_summary_revision already folds member revisions.
  • In-flight row (Step 4) is transient and recomputed each frame from the active tail — risk of flicker or interaction with active_cell_revision caching, and of double-rendering if the tail also enters detect_tool_runs. Mitigation: in-flight detector only matches RUNNING cells (which detect_tool_runs excludes), so the two groupings are disjoint by construction; add a test asserting no overlap.
  • Alt+V reaching only run.start (first member) is a pre-existing contract, not a regression; if reviewers want per-member paging it is a small follow-up. Documented by test.
  • Settings: tool_collapse_mode Expanded must remain fully expanded — all four steps are inside tool_collapse_active() branches; add a regression assert reusing chat_widget_expanded_mode_leaves_dense_tool_runs_visible (2982).

#2054 — effort: medium

Current state: The composer's pending-input system is already well-structured but has interaction-level gaps the issue names. Code lives under /Volumes/VIXinSSD/codewhale/crates (the npm/codewhale cwd is a wrapper; git toplevel is /Volumes/VIXinSSD/codewhale on branch codex/v0.8.61 @ 505974d31).

WHAT EXISTS:

  • Per-row mode labels are ALREADY rendered. crates/tui/src/tui/widgets/pending_input_preview.rs:27-30 defines distinct prefixes: "Steer pending: " (PENDING_STEER_PREFIX), "Rejected steer: " (REJECTED_STEER_PREFIX), "Queued follow-up: " (QUEUED_MESSAGE_PREFIX), "Editing queued follow-up: " (EDITING_QUEUED_PREFIX). lines() (89-176) renders four buckets under one "Pending inputs" header, and when editing it adds an "Esc restores queued follow-up" hint (151-154) and suppresses the "edit last queued message" hint. Tests at 393-531 (editing_queued_message_renders_explicit_state_and_restore_hint, pending_input_rows_label_each_delivery_mode, all_pending_inputs_render_as_one_list) already assert this. So the "explicit, test-covered state label" criterion is largely met for the WIDGET; the steer-vs-queue distinction is the bucket each row lands in.
  • Backing state: crates/tui/src/tui/app.rs:1713-1728 holds queued_messages (VecDeque), queued_draft (Option), pending_steers, rejected_steers. Methods: pop_last_queued_into_draft (4967-4980), cancel_queued_draft_edit (4984-4992, re-appends draft to tail), remove_queued_message/queued_message_count (4951-4957), decide_submit_disposition (5028-5043: offline->Queue, idle->Immediate, busy+not-streaming->Steer, busy+streaming->Queue). SubmitDisposition enum at 1867-1880.
  • /queue command EXISTS with list|edit |drop |clear (crates/tui/src/commands/groups/core/queue.rs; registered crates/tui/src/commands/groups/core/mod.rs:97-102, usage string line 100). There is NO "send now".
  • Send-now plumbing EXISTS to reuse: AppAction enum (crates/tui/src/tui/app.rs:5316) has SendMessage(String); its handler (crates/tui/src/tui/ui.rs:6525-6527) routes build_queued_message -> submit_or_steer_message, which already steers-vs-queues by disposition (ui.rs:7317-7358). send_ctrl_s_queued_message_now (ui.rs:5378-5422) is a near-exact template: it steers if is_loading else dispatches, with offline/failed-steer fallbacks. Commands can defer async work via CommandResult::action(AppAction) (crates/tui/src/commands/mod.rs:28-63; consumed in ui.rs match at 6474+).

THE GAP (citations):

  1. High-risk single-key entry. crates/tui/src/tui/ui.rs:3924-3937: a BARE Up (no modifiers) with empty input + cursor at 0 + non-empty queue silently calls pop_last_queued_into_draft(), pulling the last queued message into the editable composer and flipping into queued-draft edit mode — with no confirmation. During a loading turn a user pressing Up to recall prompt history instead enters this confusing "edit last queued message" state. This is the exact accidental path the issue reports.
  2. No row-level send/steer-now. /queue (queue.rs:20-25) offers edit/drop/clear only. There is no per-row "deliver this queued message into the current turn now" despite the disposition machinery existing.
  3. Footer hint not state-aware. crates/tui/src/tui/footer_ui.rs:962-972 (footer_state_label) returns the SAME static "draft" label for a normal typed draft and for queued-draft editing, and never advertises what Enter / Ctrl+Enter / Esc will do in the current disposition (steer vs queue vs send). Issue asks keyboard help to match the actual action.
  4. Status strings are scattered and inconsistent ("Queued: ... — ↑ to edit" ui.rs:7310; "{count} queued — ↑ to edit, /queue list" 7334; "Steering into current turn" 7349; queue.rs CmdQueueEditingMessage "press Enter to re-queue/send"). No single helper describes current disposition.

CONSTRAINT discovered: crates/tui/src/localization.rs — all 429 MessageId variants are matched EXHAUSTIVELY in all 7 locale functions (english + japanese/chinese_simplified/traditional_chinese/portuguese_brazil/spanish_latin_america/vietnamese), with ZERO wildcard arms. Every NEW MessageId requires 7 new match arms or the crate fails to compile. Non-English funcs return Option and fall back to English via translation() (1868-1878), but the arm must still exist.

Design: Goal: make the queued/steer/edit state legible and recoverable, add row-level send-now/drop/clear, and label live-steer vs queued — building on the already-correct per-row widget labels. Five focused changes, in order.

STEP 1 — Disarm the accidental single-key edit path (highest priority; the actual repro). In crates/tui/src/tui/ui.rs at the bare-Up arm (3924-3937):

  • Replace the silent pop_last_queued_into_draft() with a TWO-STEP confirm, OR gate it so it only fires when NOT loading. Recommended: introduce App state queue_edit_armed_until: Option<Instant> (mirror of existing quit_armed_until pattern, app.rs:1807 + arm_quit/quit_is_armed helpers ~near there). First qualifying Up sets the arm + status "Press ↑ again to edit last queued message"; a second Up within ~2s (and arm still valid) actually calls pop_last_queued_into_draft(). Add arm_queue_edit() / queue_edit_is_armed() to app.rs next to the quit-arm helpers. Also add the same guard !app.is_loading || armed so that during a running turn Up never enters edit mode on the first press. This satisfies "avoid a high-risk single-key path ... unless the state is visually obvious."
  • Keep the existing guards (input empty, cursor 0, no draft, no menus, no attachment) intact.

STEP 2 — Add /queue send <n> row-level action (reuses existing disposition plumbing).

  • crates/tui/src/tui/app.rs: add AppAction variant SendQueuedNow { index: usize } to the enum (5316+).
  • crates/tui/src/tui/ui.rs: add a match arm in the AppAction handler (near 6525) that: removes the queued message at index via app.remove_queued_message(index); if None -> status error and return; else route the SAME way send_ctrl_s_queued_message_now does — submit_or_steer_message(app, config, engine_handle, message) (which already chooses Steer when busy-waiting, Queue when streaming, Immediate when idle). Reuse, do not duplicate: factor the body of send_ctrl_s_queued_message_now (5378-5422) into async fn send_queued_message_now(app, config, engine_handle, message: QueuedMessage) and call it from both the Ctrl+S path and the new arm.
  • crates/tui/src/commands/groups/core/queue.rs: in queue() match (20-25) add "send" | "now" => send_queue(app, parts.next()). send_queue parses the index (reuse parse_index), validates against queued_messages.len() WITHOUT removing (the async arm removes), and returns CommandResult::action(AppAction::SendQueuedNow { index }). Guard: if queued_draft.is_some() and the user means the draft, return an instructive error pointing at Enter.
  • crates/tui/src/commands/groups/core/mod.rs:100: update usage string to "/queue [list|edit |drop |send |clear]".

STEP 3 — Make the footer hint disposition-aware (keyboard help matches action). In crates/tui/src/tui/footer_ui.rs:

  • footer_state_label (943-975) returns &'static str so it can't carry a dynamic disposition; instead add a NEW span builder footer_disposition_hint_spans(app) -> Vec<Span> that renders a compact hint only when relevant: if queued_draft.is_some() -> "editing queued · Enter re-queues · Esc restores"; else if !input.is_empty() && is_loading -> derive from decide_submit_disposition(): Steer -> "Enter steers turn · Ctrl+Enter steers"; Queue -> "Enter queues · Ctrl+Enter steers"; else if !input.is_empty() -> "Enter sends". Wire it into footer_auxiliary_spans (749-801) as the highest-priority cluster (so it survives width truncation), or into render_footer near the state_label assignment. Distinguish the queued-edit "draft" from a normal "draft" by returning "queued-edit" from footer_state_label when queued_draft.is_some() (962-964) instead of the generic "draft".

STEP 4 — Centralize status strings so Up-hint / queue-count / steer text are consistent. Add a small helper module fn (in app.rs or ui.rs) queue_status_line(app) -> String and pending_input_summary(app) used by queue_follow_up (7306-7315), submit_or_steer_message Queue arm (7327-7337), queue_current_draft_for_next_turn (5347-5363), and send_queued_message_now. This removes the divergent "↑ to edit" vs "/queue list" phrasings and lets all of them say the same thing including the new "/queue send ".

STEP 5 — Localization (REQUIRED for compile). In crates/tui/src/localization.rs add MessageIds and 7 arms each (english + 6 locales; non-English may mirror English text but the arm MUST exist or the crate won't build — all matches are exhaustive, no wildcards):

  • CmdQueueSent ("Sent queued message {index} into current turn"), CmdQueueSendUsage / update CmdQueueUsage to include send, CmdQueueTip update to mention send. Add the variants to the MessageId enum (331-346 region), the ALL iteration list (769-784), and english() + all 6 locale fns. Footer hint strings in STEP 3 can be plain &'static str (footer already mixes literals) to avoid widening the localization surface, but prefer MessageIds if you want zh/ja coverage.

WHY this shape: the per-row mode labels the issue asks for already exist and are tested in the widget; the real fixes are (a) closing the accidental single-key trap, (b) giving rows an actionable send-now via the existing AppAction->submit_or_steer_message disposition path, and (c) making the footer/Esc affordances state-accurate. No engine/control-plane changes — this is purely TUI composer UX, aligned with #874 which owns consumption timing.

Builds on: None of the named landed foundations (worker_profile, goal_loop, model_registry, provider_readiness, context_budget, provider_adapter, resource_telemetry, request_tuning, turn_loop cancel) are in this composer-UX path — they are engine/provider/runtime concerns. This issue builds instead on the EXISTING (pre-foundation) composer scaffolding already on codex/v0.8.61: the pending_input_preview widget's mode-labeled buckets, the /queue command family, App.decide_submit_disposition + SubmitDisposition, the AppAction::SendMessage -> submit_or_steer_message deferred-action path, and send_ctrl_s_queued_message_now as the send-now template. It deliberately stays UI-only and links back to #874 (which owns queued-input consumption timing) per the issue's acceptance criteria.

Files: crates/tui/src/tui/ui.rs, crates/tui/src/tui/app.rs, crates/tui/src/commands/groups/core/queue.rs, crates/tui/src/commands/groups/core/mod.rs, crates/tui/src/tui/footer_ui.rs, crates/tui/src/localization.rs, crates/tui/src/tui/widgets/pending_input_preview.rs

Test plan: Unit + widget tests (cargo test -p codewhale-tui or the crate's package name; existing tests in these files show the harness):

  1. Single-key disarm (ui.rs / app.rs unit tests, mirror submit_disposition_* tests at app.rs:7584+): with is_loading=true + one queued message + empty input, first bare-Up does NOT set queued_draft and DOES arm queue_edit; assert queue_edit_is_armed(); a second Up within window sets queued_draft to the popped message. Assert that an expired arm + single Up is a no-op. Add a regression test named like accidental_up_during_loading_requires_confirm.
  2. /queue send (queue.rs tests, mirror test_queue_drop_success at 280-295): /queue send 1 on a 2-item queue returns CommandResult with action == Some(AppAction::SendQueuedNow { index: 0 }) and does NOT mutate the queue (removal happens in the async arm); /queue send with no index returns CmdQueueMissingIndex; out-of-range returns CmdQueueNotFound; with queued_draft set returns the instructive error.
  3. send_queued_message_now behaviour (ui.rs async test if one exists, else assert via decide_submit_disposition path): factored helper steers when is_loading + not streaming, queues when streaming, dispatches when idle, and re-queues on offline — assert status strings come from the centralized queue_status_line helper.
  4. Footer hint (footer_ui.rs unit test, mirror existing footer_state_label tests): footer_state_label returns "queued-edit" (not "draft") when queued_draft.is_some(); footer_disposition_hint_spans yields the steer-phrasing when is_loading + non-streaming + non-empty input, the queue-phrasing when streaming, and the editing/Esc-restores phrasing when queued_draft set. Assert the hint cluster is not dropped before lower-priority clusters at a representative width.
  5. Widget (pending_input_preview.rs) — existing tests already cover the four labels and the Esc-restore hint (393-531); add one asserting the steer bucket and queued bucket render with their distinct prefixes simultaneously (extends all_pending_inputs_render_as_one_list) to lock the live-vs-queued distinction.
  6. Localization: a test that tr(locale, CmdQueueSent/CmdQueueUsage) is non-empty for every Locale variant (guards the exhaustive-arm requirement). cargo build alone will catch missing arms. Manual repro (documented in PR per acceptance criteria): start a long turn, type a draft, press Up — confirm it no longer silently enters edit mode; use /queue send 1 to deliver a queued item live; press Esc while editing a queued draft and confirm the follow-up is restored and unrelated draft text is not dropped.

Risks: 1. Localization is the main build-breaker: localization.rs matches all 429 MessageIds exhaustively in 7 locale functions with no wildcard (verified). Any new MessageId needs an arm in english() + japanese/chinese_simplified/traditional_chinese/portuguese_brazil/spanish_latin_america/vietnamese plus the enum (331-346) and ALL list (769-784). Mitigate by mirroring English text in non-English arms initially, or keep footer hint strings as plain &'static str literals (footer already mixes literals) to minimize new MessageIds. 2. Up-key arm interaction: the bare-Up arm (ui.rs:3924-3937) sits among several other Up arms (attachment selection 3908-3923, history, slash/mention menus). The new arm/confirm must preserve arm ordering so it only triggers in the exact idle-empty-cursor0 state and never shadows attachment/history navigation. Reuse the proven quit-arm timeout pattern to avoid a new bug class. 3. Async/sync boundary: queue.rs is sync and cannot dispatch; it MUST return CommandResult::action(AppAction::SendQueuedNow) and let the async ui.rs arm do the work (same pattern as other AppActions). Removing the message must happen in the async arm, not in send_queue, to avoid losing the message if the action is dropped. 4. send_ctrl_s_queued_message_now refactor: factoring it into send_queued_message_now must keep the Ctrl+S path (ui.rs:4436) behaving identically (offline re-queue, failed-steer fallback, status text) — cover with the existing/extended tests. 5. Footer width: adding a disposition-hint cluster competes for footer space; if not prioritized it may be truncated away exactly when needed (during loading). Place it ahead of cost/cache/git clusters in footer_auxiliary_spans. 6. Scope discipline: do NOT touch engine consumption timing (owned by #874); keep changes TUI-only so this stays mergeable independently.


#3075 — effort: medium

Current state: /model opens ModelPickerView (crates/tui/src/tui/model_picker.rs). Today it is a two-pane "Model & thinking" modal with NO search and a provider-narrow row source:

  • Rows come from picker_model_rows_for_app (model_picker.rs:382-448): auto, then model_completion_names_for_provider(app.api_provider) (the ACTIVE provider only), then the active provider's saved model, then a sorted tail of saved models from OTHER providers in app.provider_models (#2596). Non-current-provider catalog models (kimi, qwen, glm, gpt, codex, trinity) are NOT discoverable unless a user previously saved them.
  • ModelPickerRow { id, provider: Option<ApiProvider>, hint } (model_picker.rs:62-67) has no provenance and no rich metadata; hint is a hard-coded match (picker_model_hint, model_picker.rs:465-485).
  • State is index-only: selected_model_idx, selected_effort_idx, focus, show_custom_model_row, model_rows (model_picker.rs:47-60). No query field.
  • handle_key (model_picker.rs:496-549) handles navigation/Tab/Enter/Esc but DROPS KeyCode::Char/Backspace/Delete (the final arm is _ => ViewAction::None), so typing does nothing.
  • Title is hard-coded " Model & thinking " (model_picker.rs:601); footer hints have no "type to search" (605-614).

What already works and must be reused unchanged (the apply path is DONE):

  • The event ViewEvent::ModelPickerApplied { model, provider: Option<ApiProvider>, effort, previous_model, previous_effort } (crates/tui/src/tui/views/mod.rs:150-156) already carries an optional cross-provider provider.
  • build_event (model_picker.rs:217-228) already sets provider = resolved_provider().filter(|p| *p != initial_provider), and resolved_provider/resolved_model (model_picker.rs:122-139) already read the highlighted row's provider.
  • The handler apply_model_picker_choice (crates/tui/src/tui/ui.rs:5972-6092) already calls switch_provider(...) (ui.rs:6139) when target_provider is Some, differs, and the model is not auto — so cross-provider switch + persist + engine respawn is fully wired. Tests at model_picker.rs:1081-1171 already prove cross-provider rows carry their provider.

The catalog backing data exists but has ZERO production consumers (grep confirms): crate::model_registry (crates/tui/src/model_registry.rs, declared in main.rs:55) exposes seeded_model_ids() -> Vec<&'static str> (line 226) and lookup(model) -> Option<ModelMetadata> (line 189). ModelMetadata (lines 80-93) has id, provider: ModelProvider, context_window: Option<u32>, max_output: Option<u32>, supports_reasoning: bool. Its own doc comment (line 224) names "#3075 a future provider-aware model picker" as the intended first consumer.

THE GAP: (1) rows are built only from the active provider; (2) there is no live search/filter; (3) ModelProvider (coarse registry hint, model_registry.rs:49-74) has no mapping to a concrete config::ApiProvider that can actually serve the model, which is required because ModelPickerApplied.provider must be an ApiProvider.

Design: Confine all changes to crates/tui/src/tui/model_picker.rs plus one small additive helper in model_registry.rs. The apply path, event shape, and switch_provider are reused unchanged.

STEP 1 — Add a ModelProvider->ApiProvider bridge (crates/tui/src/model_registry.rs, additive). Add pub fn serving_provider(p: ModelProvider) -> crate::config::ApiProvider returning the provider that actually serves that family by default: DeepSeek->Deepseek, Anthropic->Anthropic, OpenAi->Openai, OpenAiCodex->OpenaiCodex, Moonshot->Moonshot, Zai->Zai, Minimax->Minimax, XiaomiMimo->XiaomiMimo, Qwen->Openrouter, Arcee->Arcee, Other->(caller falls back to active provider). This is the load-bearing decision: Qwen/Other ids have no first-class provider, so they route through Openrouter / the current provider. Keep #![allow(dead_code)] already on the module. Unit-test the map in model_registry.rs.

STEP 2 — Enrich the row type (model_picker.rs:62-67). Add to ModelPickerRow: provider_label: String (from ApiProvider::display_name() or "auto"), and source: RowSource where enum RowSource { Auto, Catalog, Saved, Custom }. Keep id, provider: Option<ApiProvider>, hint. The hint becomes metadata-driven (Step 4).

STEP 3 — Rebuild rows from the ALL-PROVIDER catalog (rewrite picker_model_rows_for_app, model_picker.rs:382-448). Build the full set once at new() and store it as all_rows: Vec<ModelPickerRow> on the view: a. Row 0 = auto (source Auto, provider None) — always present, but ensure search never lets it dominate (it only matches when the query is a prefix of "auto"; never inject it into non-empty filtered results unless it matches). b. For each id in model_registry::seeded_model_ids(): lookup(id) -> meta; provider = serving_provider(meta.provider); push a Catalog row with provider_label = provider.display_name(). c. For the active provider, also fold in model_completion_names_for_provider(app.api_provider) (covers ids the registry has not seeded yet, e.g. the long Minimax/OpenRouter lists) as Catalog rows tagged with the active provider — until #3071/#3072 fully supersede them (matches the issue's "bundled fallback rows until #3071/#3072 land"). d. Saved rows: active-provider saved model + every parseable other-provider entry in app.provider_models (preserve the existing #2596 behavior and dedupe via push_model_row, model_picker.rs:450-463) tagged source Saved. e. Dedupe on (id, provider) (existing push_model_row already does this). Keep show_custom_model_row for the active model when it matches nothing.

STEP 4 — Metadata-driven hint. Replace the hard-coded picker_model_hint match with a builder that composes from existing facts: model_registry::lookup(id) for context window + supports_reasoning; crate::pricing::has_pricing_for_model(id) (pricing.rs:116) for "priced"/"price unknown". Render e.g. "1M ctx · reasoning · priced" / "262K ctx · price unknown". Keep the saved/custom suffix. Reuse context_window formatting already used elsewhere (round to K/M). Falls back gracefully to provider label when lookup returns None.

STEP 5 — Add search state + live filter (mirror command_palette.rs exactly). Add query: String and filtered: Vec<usize> (indices into all_rows) to ModelPickerView. Add fn refilter(&mut self) modeled on command_palette.rs:605-627: lowercase, split_whitespace, every term must match. Match haystack = format!("{provider_label} {id} {source_tag}") lowercased, PLUS provider aliases so the issue's keywords work (provider scope token like openrouter qwen, moonshot kimi, codex gpt): port parse_section_term (command_palette.rs:360-378) to a provider-scope parser keyed on provider as_str()/aliases (kimi->Moonshot, glm->Zai, qwen->Openrouter, codex->OpenaiCodex). Keep auto out of non-empty result sets unless it literally matches. Replace selected_model_idx semantics so the highlighted item indexes into filtered (keep a helper current_row() returning all_rows[filtered[sel]]). Update resolved_model/resolved_provider/model_row_count/move_up/move_down/Home/End to operate over filtered.

STEP 6 — Custom escape hatch (acceptance: unknown typed id usable). When filtered is empty (no catalog/saved match for a non-empty query), synthesize a trailing custom row labeled Use "<query>" with <provider> where <provider> = the active provider, or the scoped provider if the query carried a provider: scope. Selecting it emits ModelPickerApplied { model: query, provider: Some(scoped_or_active) }. Preserve the existing show_custom_model_row (current active model) behavior for the empty-query case.

STEP 7 — handle_key: add typing (model_picker.rs:496-549). Before the final _ arm add: KeyCode::Char(c) (no CONTROL/ALT) -> push to query, refilter(), clamp selection; KeyCode::Backspace / Ctrl+H -> pop; KeyCode::Delete or Esc-with-nonempty-query -> clear query first (so Esc clears search, second Esc closes), matching command_palette behavior. Keep Up/Down/PageUp/Down/Tab/Enter/Esc. Note: Char navigation shortcuts must NOT steal keys (unlike palette's j/k) because every char is a search term here.

STEP 8 — Render (model_picker.rs:571-679). Add a one-line search input box above the Model pane showing Search: <query> (reuse palette's input styling). Change the modal title from " Model & thinking " to " Model · provider · thinking " (model_picker.rs:601). Add provider label as a styled prefix/column in each model row (extend picker_row_spans, model_picker.rs:308-341, to optionally show provider_label dimmed before the id). Add a footer hint "type to search" (model_picker.rs:605-614). Keep the Thinking pane logic untouched — current_efforts()/resolved_provider() already recompute efforts when the highlighted row's provider changes (model_picker.rs:151-166), so cross-provider effort normalization keeps working.

No changes needed to ui.rs, views/mod.rs, config.rs, or settings — the emitted event and its handler already do provider switch + model persist + engine respawn.

Builds on: model_registry (#3071/#3073): primary new data source via seeded_model_ids() and lookup() — this issue is its first production consumer (its doc names #3075). Adds the serving_provider(ModelProvider) -> ApiProvider bridge there. pricing.rs (has_pricing_for_model, from #3201/mvanhorn) and models.rs (model_supports_reasoning, context_window_for_model) feed the metadata hint. Reuses the already-landed cross-provider apply path: ViewEvent::ModelPickerApplied { provider: Option<ApiProvider> } + apply_model_picker_choice + switch_provider (ui.rs:5972/6139) — unchanged. provider_readiness.rs is NOT a hard dependency but its ModelProvenance enum (Default/Saved/Custom/Catalog) is the naming precedent for RowSource; a follow-up could badge rows with readiness, out of scope here. Search/filter mechanics are copied from command_palette.rs (refilter/parse_section_term/char-key handling).

Files: crates/tui/src/tui/model_picker.rs, crates/tui/src/model_registry.rs

Test plan: All new tests live in model_picker.rs mod tests (reuse create_test_app at model_picker.rs:733 and the lock_test_env guard) plus one in model_registry.rs. Verification commands from the issue: cargo test -p codewhale-tui model_picker, cargo test -p codewhale-tui model_metadata (registry), cargo test -p codewhale-tui config.

model_registry.rs:

  1. serving_provider_maps_each_family — every ModelProvider variant maps to the expected ApiProvider (Qwen->Openrouter, Moonshot->Moonshot, Zai->Zai, Codex->OpenaiCodex, ...).

model_picker.rs (new): 2. picker_lists_cross_provider_catalog_without_search — with active=Deepseek, visible rows include catalog ids from OTHER providers (e.g. a kimi/glm/qwen seeded id), each carrying its serving_provider, proving discovery no longer requires a saved entry. (Strengthens the existing #2596 test.) 3. search_filters_rows_live — push KeyCode::Char for "kimi"; assert filtered rows all contain kimi and a moonshot/openrouter id is selectable; backspacing restores rows. 4. provider_scoped_search_matches — query "openrouter qwen" yields only OpenRouter qwen rows; "codex gpt" yields only Codex gpt rows (proves alias/scope parsing). 5. selecting_cross_provider_row_emits_provider — filter to a non-active-provider catalog row, Enter, assert ModelPickerApplied { model, provider: Some(that_provider) } (mirror existing model_picker.rs:1081-1128 assertion style). 6. same_provider_selection_emits_no_provider — pick an active-provider row, assert provider: None (mirror model_picker.rs:1047-1078). 7. no_match_offers_custom_row_with_provider — type an id absent from catalog/saved (e.g. "my-private/model-x"); assert a custom row exists and Enter emits { model: "my-private/model-x", provider: Some(active) }; with a provider: scope, the scoped provider is emitted. 8. auto_does_not_dominate_search — non-empty query that does not match "auto" excludes the auto row; empty query keeps auto first. 9. thinking_effort_still_available_and_normalizes_cross_provider — after highlighting a saved Codex row via search, Thinking pane shows codex tiers and remaps effort (re-uses logic proven at model_picker.rs:861-917). 10. metadata_hint_reports_context_and_pricing — assert a known seeded id's hint contains its context window (e.g. "1M") and a reasoning/pricing token, and an unknown id degrades gracefully. 11. Regression: keep ALL existing model_picker tests green (esp. saved-model #2596 tests 1081-1143, passthrough/ollama tests 988-1044, custom-row tests 920-927/1330-1341) — selection now indexes through filtered, so update those that assert literal selected_model_idx values to assert via resolved_model()/resolved_provider() instead.

Risks: 1. Selection-index refactor: switching selected_model_idx from indexing model_rows to indexing filtered touches move_up/down/Home/End/resolved_model/resolved_provider/render; several existing tests assert literal index values (e.g. model_picker.rs:1179, 1252, 1295, 1310, 1337) and must be re-expressed via resolved_model()/resolved_provider(). Mitigate by keeping a current_row() accessor and making resolved_* the single source of truth. 2. ModelProvider->ApiProvider correctness: mapping a family to a provider the user has no key for could switch to an unusable provider. But switch_provider (ui.rs:6139) already validates credentials and aborts the switch if missing (apply_model_picker_choice returns early when app.api_provider != target_provider), so worst case is a no-op + status message — acceptable. Qwen/Other -> Openrouter is the only judgment call; document it. 3. auto dominating results — explicitly excluded from non-empty filtered sets; covered by test 8. 4. Catalog vs active-provider list overlap (Step 3c) can create near-duplicate ids with different providers; the existing push_model_row dedupe is on (id, provider) so distinct-provider duplicates are intentional (same id servable by two providers) — verify ordering keeps the active provider first to preserve the #2596 ordering guarantee (model_picker.rs:1116-1127). 5. Registry coverage gaps: seeded list is curated and smaller than the full provider lists; Step 3c (fold active-provider model_completion_names_for_provider) keeps parity until #3072 hydrates a live catalog. No behavior regression for existing providers. 6. Scope: deliberately NOT touching the apply/persist path or adding live network model hydration (#3072) — keeps the change contained to two files.


#3072 — effort: medium

Current state: Model facts today come from THREE hard-coded sites, all of which converge on the same three models.rs functions, which is the key to a clean design:

  • crates/tui/src/models.rs:238 context_window_for_model, :325 max_output_tokens_for_model, :366 model_supports_reasoning — big match tables (known_context_window_for_model at models.rs:264 lists ~40 OpenRouter ids with literal windows; pricing literals live separately).
  • crates/tui/src/pricing.rs:144 known_pricing_for_model — hard-coded USD per-million rows whose comments literally say "mirror the curated OpenRouter catalog ... sourced from OpenRouter's per-token API fields."
  • crates/config/src/lib.rs:40-89 — curated *_MODEL id constants per provider.
  • crates/tui/src/tui/model_picker.rs:465 picker_model_hint — hard-coded match of UI hints; rows come from model_completion_names_for_provider (config.rs:954), not a refreshable catalog.

The convergence: config::provider_capability (config.rs:415) — the canonical capability resolver consumed by model_inventory.rs:76 and provider_readiness.rs — itself delegates to those same three models.rs fns (config.rs:422/424/426, 455/457/458, 498/511, etc.). model_registry.rs:99 ModelMetadata::seed also seeds from the same three fns and is guarded against drift (model_registry.rs:242 registry_context_window_matches_models_rs). So all 12 call sites (config.rs, model_registry.rs, context_report.rs, compaction.rs, core/engine/context.rs, client/chat.rs, client/anthropic.rs, prompts.rs) read facts through one chokepoint.

THE GAP (issue #3072): there is no catalog cache/provenance layer. Nothing can refresh context-length/pricing from OpenRouter /models; nothing records "this came from OpenRouter on date X" vs "bundled" vs "user override." Crucially, the foundation already RESERVED this work: provider_readiness.rs:106-117 defines ModelProvenance with a Catalog variant documented "Selected from a live-hydrated provider catalog (reserved; #3072)", label()=="catalog" (provider_readiness.rs:127), already a field on ProviderReadinessRow (provider_readiness.rs:153) and set by resolve_model (provider_readiness.rs:299) — but never populated. No model_catalog/catalog* module exists yet (confirmed no name clash).

Design: Add a metadata-override layer, NOT a rewrite of the lookups. Build a new module crates/tui/src/model_catalog.rs (declare mod model_catalog; in main.rs near mod model_registry; at line 55) plus a small override hook inside models.rs. Implement in this order:

(1) TYPES in model_catalog.rs:

  • pub enum MetadataProvenance { ProviderApi, Bundled, UserOverride, Unknown } with as_str() returning exactly provider_api/bundled/user_override/unknown (the acceptance-criteria strings). Map to existing provider_readiness::ModelProvenance::Catalog when ProviderApi.
  • #[derive(Serialize,Deserialize)] pub struct CatalogEntry { id, context_window: Option<u32>, max_output: Option<u32>, supports_reasoning: Option<bool>, input_usd_per_million: Option<f64>, output_usd_per_million: Option<f64>, modalities: Vec<String>, supported_parameters: Vec<String>, provider_model_id: Option<String>, provenance: MetadataProvenance }. NOTE: NO api-key/token/auth fields — the struct physically cannot carry secrets (satisfies "no secrets in cache").
  • #[derive(Serialize,Deserialize)] pub struct CatalogCache { schema_version: u32, source: String /* "openrouter" */, fetched_at: chrono::DateTime<Utc>, ttl_secs: u64, entries: BTreeMap<String,CatalogEntry> } with fn is_stale(&self, now) -> bool.

(2) OPENROUTER FETCH + NORMALIZE (model_catalog.rs): async fn fetch_openrouter_catalog(timeout_ms) -> Result<Vec<CatalogEntry>>. Use the established template at web_run.rs:778-793: crate::tls::reqwest_client_builder().timeout(Duration::from_millis(timeout_ms)).build()?.get("https://openrouter.ai/api/v1/models").send().await?.json::<OpenRouterModelsResponse>(). OpenRouter /models is UNAUTHENTICATED — send no Authorization header (reinforces no-secret path). Define serde structs for the OpenRouter shape: { data: [{ id, context_length, top_provider:{max_completion_tokens}, pricing:{prompt,completion}, architecture:{input_modalities,output_modalities}, supported_parameters:[..] }] }. Normalize: context_length->context_window; top_provider.max_completion_tokens->max_output; pricing strings are per-TOKEN USD → multiply by 1_000_000 for per-million (matches pricing.rs comment at :150); set provenance=ProviderApi. Filter to ids CodeWhale curates (intersect with model_registry::seeded_model_ids()) so the cache stays bounded and first-class.

(3) DISK CACHE with TTL + provenance (model_catalog.rs):

  • Path: config::ensure_state_dir("catalog")?.join("openrouter.json") (writes under ~/.codewhale/catalog/, satisfying "cache file under .codewhale/ or platform config dir"; resolvers at config/lib.rs:3469-3489).
  • fn load_cached() -> Option<CatalogCache>: read+serde_json::from_str; tolerate missing/parse errors by returning None (never panic, never block).
  • fn store(cache:&CatalogCache) -> Result<()>: serialize and crate::utils::write_atomic(path, bytes) (utils.rs:175, NamedTempFile+fsync+persist). Serializing only the secret-free struct is the no-token guarantee.
  • pub async fn refresh_openrouter(ttl_secs) -> Result<()>: fetch → build CatalogCache{fetched_at:Utc::now()} → store. Called opt-in (e.g. a /catalog refresh debug command and/or background task); NEVER on the startup hot path.

(4) BUNDLED OFFLINE SNAPSHOT (model_catalog.rs): const BUNDLED_CATALOG_JSON: &str = include_str!("../assets/model_catalog.bundled.json") — a committed snapshot covering the curated OpenRouter ids with provenance=Bundled. fn bundled() -> CatalogCache parses it via serde_json::from_str(...).expect(...) with a unit test that proves it parses (so a malformed snapshot fails CI, never users). This keeps picker/context/pricing working air-gapped.

(5) DETERMINISTIC MERGE + IN-PROCESS HANDLE (model_catalog.rs): fn active_catalog() -> &'static RwLock<MergedCatalog> (OnceLock). MergedCatalog::resolve(id) -> Option<&CatalogEntry> applies the issue's order: user_override > fresh provider cache (load_cached, only if !is_stale) > bundled > none. A stale provider cache is skipped for facts but its provenance can still be surfaced as "stale" in debug. User overrides come from an optional [[model_catalog.override]] table parsed from config (add to config.example.toml docs + a catalog_overrides: Vec<CatalogEntry> field threaded in at init; keep minimal — even an empty-vec default satisfies the merge-order requirement and the test).

(6) WIRE THE OVERRIDE INTO THE CHOKEPOINT (models.rs) — this is what makes every consumer benefit with one edit per fn. At the TOP of context_window_for_model (models.rs:238), max_output_tokens_for_model (:325), model_supports_reasoning (:366), consult model_catalog::resolved_context_window(model) / _max_output / _supports_reasoning first; if Some, return it; else fall through to today's literal tables unchanged. Because config::provider_capability and model_registry::seed already delegate here, the whole capability/registry/pricing/picker stack inherits hydrated values with zero changes at those sites. IMPORTANT: keep the bundled/override defaults numerically equal to the current literals for the drift-guarded sample, or the design degrades the registry guard (model_registry.rs:242) — verify and, if a bundled value intentionally differs, update that test's expectations in the same change.

(7) PRICING (pricing.rs): add a thin catalog_pricing_for_model(lower) -> Option<ModelPricing> consulted at the top of known_pricing_for_model (pricing.rs:144) returning usd_only_pricing(...) built from the catalog entry's per-million fields; fall through to existing rows otherwise. Satisfies "OpenRouter metadata can populate/refresh pricing."

(8) PROVENANCE SURFACING (provider_readiness.rs + a debug command): in resolve_model (provider_readiness.rs:299), when the resolved model's facts came from the provider cache, set ModelProvenance::Catalog (the reserved variant at :114). Add the metadata-level MetadataProvenance::as_str() to whatever debug//catalog status output exists so the UI/debug can show provider_api|bundled|user_override|unknown (acceptance criterion). Optionally enrich model_picker::picker_model_hint to append a catalog-sourced context/price hint when present, but that is non-blocking polish.

Builds on: Directly builds on the landed foundations: (a) provider_readiness.rs ModelProvenance::Catalog was reserved specifically for #3072 (provider_readiness.rs:113-114) — populate it. (b) model_registry.rs is the canonical metadata registry; reuse seeded_model_ids() (model_registry.rs:226) to bound the catalog to curated ids, and respect its drift guard (model_registry.rs:242). (c) provider_adapter.rs CapabilityDescriptor::from_capability (provider_adapter.rs:92) and config::provider_capability (config.rs:415) automatically inherit hydrated facts because they delegate to the same models.rs chokepoint — no edits needed there. Reuses non-foundation primitives: crate::utils::write_atomic (utils.rs:175), crate::tls::reqwest_client_builder (web_run.rs:778 template), and config::ensure_state_dir/resolve_state_dir/codewhale_home (config/lib.rs:3437-3489).

Files: crates/tui/src/model_catalog.rs (NEW module: types, OpenRouter fetch+normalize, TTL disk cache via write_atomic, bundled snapshot, deterministic merge), crates/tui/assets/model_catalog.bundled.json (NEW committed offline snapshot, provenance=bundled), crates/tui/src/main.rs (declare mod model_catalog; near line 55), crates/tui/src/models.rs (consult catalog override at top of context_window_for_model:238, max_output_tokens_for_model:325, model_supports_reasoning:366), crates/tui/src/pricing.rs (consult catalog at top of known_pricing_for_model:144), crates/tui/src/provider_readiness.rs (set ModelProvenance::Catalog in resolve_model:299 when facts are catalog-sourced), crates/config/src/lib.rs (optional [[model_catalog.override]] config field + parse; reuse ensure_state_dir), config.example.toml (document catalog refresh TTL + override table)

Test plan: Match the issue's verification commands by naming the test module/tests accordingly:

  • cargo test -p codewhale-tui model_catalog — in model_catalog.rs mod tests: (1) openrouter_response_normalizes_context_and_pricing feeds a sample OpenRouter /models JSON literal and asserts context_window/max_output and per-MILLION pricing (per-token * 1e6); (2) bundled_snapshot_parses_and_is_nonempty parses include_str! snapshot; (3) merge_order_is_user_override_then_provider_then_bundled builds a MergedCatalog with conflicting values across the three layers and asserts the winner per the issue order; (4) stale_cache_is_ignored_for_facts — construct CatalogCache with fetched_at far in the past + small ttl, assert is_stale() true and resolve() skips it / falls back to bundled (acceptance: "stale cache behavior is tested"); (5) cache_roundtrip_writes_no_secret_fields — serialize a cache, assert the JSON string contains none of api_key|authorization|token|secret and round-trips (acceptance: "no secrets in cache"); (6) fetch_failure_falls_back_to_bundled — simulate fetch error (bad/unreachable URL or inject error) and assert resolve() still yields bundled facts and never panics (acceptance: "fetch failures never block").
  • cargo test -p codewhale-tui model_metadata — assert hydrated facts flow through: a test that, with a catalog override present for a curated id, models::context_window_for_model and config::provider_capability(...).context_window both reflect it; and a guard that with NO catalog present the values are byte-identical to today (protects model_registry.rs:242 drift guard — also re-run registry_context_window_matches_models_rs).
  • cargo test -p codewhale-tui pricing — existing pricing tests must still pass unchanged; add catalog_pricing_overrides_known_row_when_present and pricing_unchanged_without_catalog.
  • cargo test -p codewhale-tui model_picker — existing picker tests pass; add an assertion that a catalog-sourced model surfaces ModelProvenance::Catalog/"catalog" provenance through provider_readiness. Plus workspace gates: cargo fmt --all, cargo clippy -p codewhale-tui, full cargo test -p codewhale-tui. The OpenRouter live fetch itself stays out of CI (fed by JSON fixtures); only the normalizer/cache/merge are unit-tested.

Risks: 1. DRIFT-GUARD REGRESSION: bundled snapshot values must equal current literals for the drift sample, or model_registry.rs:242 fails. Mitigation: seed the bundled snapshot FROM today's known_context_window_for_model numbers; if intentionally diverging, update the guard's expectations in the same PR. 2. ORDERING SUBTLETY: models.rs consults the override first — a buggy/empty override returning Some(wrong) would silently change every consumer. Mitigation: only return Some when an entry truly carries that field (Option fields), and the no-catalog identity test (test_plan) locks behavior. 3. SECRET LEAK: must never write keys. Mitigation: CatalogEntry has no auth fields by construction; OpenRouter /models is unauthenticated (no header sent); explicit no-secret round-trip test. 4. STARTUP/OFFLINE: refresh must be opt-in/async, never blocking. Mitigation: load path is pure-disk+bundled fallback; network only in explicit refresh_*/background task with a timeout (web_run.rs:779 pattern). 5. SCOPE CREEP across OpenAI-compatible providers — issue says start with OpenRouter only; keep the fetch trait single-provider now (an internal fn, not a broad trait) to avoid over-design. 6. UI hint enrichment in model_picker is the riskiest-for-least-value piece; treat as optional polish, not blocking.