You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .planning/REQUIREMENTS.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ This file tracks the active milestone's requirements at the top, with previous m
25
25
26
26
### Worker Lifecycle (WLIFE)
27
27
28
-
-[]**WLIFE-01:** Workers spawn lazily on the first claude-code fallback request for their model — no workers spawn at proxy boot.
28
+
-[x]**WLIFE-01:** Workers spawn lazily on the first claude-code fallback request for their model — no workers spawn at proxy boot.
29
29
-[x]**WLIFE-02:** An idle worker is evicted (subprocess exits, RAM freed) after a configurable idle timeout (default 30 min); a subsequent request lazily respawns it.
30
30
-[x]**WLIFE-03:** A worker that exits unexpectedly is marked dead, its in-flight request is surfaced as RETRYABLE (not a hard error), and it respawns lazily on the next request — never auto-restarted in a tight loop.
31
31
-[x]**WLIFE-04:** Client disconnect / request abort propagates to the worker — the in-flight stream-JSON request is cancelled (protocol cancel if supported, else SIGTERM + respawn) so a dead client never pins a concurrency-1 worker.
@@ -59,10 +59,10 @@ This file tracks the active milestone's requirements at the top, with previous m
59
59
| POOL-03 | Phase 62 | Complete |
60
60
| POOL-04 | Phase 62 | Complete |
61
61
| GUARD-01 | Phase 62 | Complete |
62
-
| WLIFE-01 | Phase 63 |Live-pending (63-05 `--live` SC-1 cold-start case authored + mock-green; ROADMAP discharge gated on the operator `LLM_PROXY_LIVE=1` run — 63-05-SUMMARY § Operator Live-Run) |
-[x]**Phase 63: Worker Lifecycle — Lazy Spawn, Idle Eviction, Crash Recovery & Cancellation** — lazy spawn on first fallback, idle-evict after configurable timeout (default 30 min), crash → RETRYABLE + lazy respawn (no spin-loop), client-disconnect aborts the in-flight stream-JSON request. (all 5 plans landed 2026-06-21; mechanisms UNIT-proven in Plans 01-04 and the `--live` SC-1..SC-4 verification suite authored + mock-green in Plan 05. **SC-1..SC-4 ROADMAP live discharge for WLIFE-01..04 is PENDING the operator `LLM_PROXY_LIVE=1` run** — Plan 05 is `autonomous: false`; see 63-05-SUMMARY § Operator Live-Run — PENDING. WLIFE-01 stays live-pending.)
1198
+
-[x]**Phase 63: Worker Lifecycle — Lazy Spawn, Idle Eviction, Crash Recovery & Cancellation** — lazy spawn on first fallback, idle-evict after configurable timeout (default 30 min), crash → RETRYABLE + lazy respawn (no spin-loop), client-disconnect aborts the in-flight stream-JSON request. (all 5 plans landed 2026-06-21; mechanisms UNIT-proven in Plans 01-04 and confirmed by the `--live` SC-1..SC-4 verification suite in Plan 05. **WLIFE-01..04 ROADMAP-discharged 2026-06-21** by the operator `LLM_PROXY_LIVE=1` run — 9/9 tests PASS, exit 0, zero orphaned workers; SC-1..SC-4 all PASS. See 63-05-SUMMARY § Operator Live-Run — PASSED.)
1199
1199
-[ ]**Phase 64: Worker Hygiene — CLI Version Pinning & stderr Throttling** — record `claude --version` at boot, recycle worker on version drift to keep prompt-cache assumptions valid; drain + throttle worker stderr to once-per-minute-per-worker so persistent-worker CLI warnings don't flood logs.
1200
1200
-[ ]**Phase 65: Steady-State Latency & Crash-Survival Acceptance** — warm-worker sonnet `say OK` probe completes ≤3s steady-state (cold first-spawn may still be ~10s); pool survives a worker SIGKILL without dropping subsequent same-model requests; idle-eviction observable via `ps`; escape hatch reverts cleanly.
1201
1201
-[ ]**Phase 66: Dashboard Latency Observability** — the dashboard's claude-code/sonnet median latency column shows the ~14s → ≤3s drop within 24h of rollout.
**Core value:** A self-learning coding environment that captures every session, builds knowledge, prevents mistakes, and makes observations browsable -- across all AI coding agents.
Plan: 5 of 5 (Plans 01, 02, 03, 04 complete; 05 AUTO complete — live run PENDING operator)
58
-
Status: Plan 63-05 AUTO portion done (suite authored, mock-green, committed b40bc23 in rapid-llm-proxy). `autonomous: false` — WLIFE-01..04 ROADMAPlive discharge gated on the operator `LLM_PROXY_LIVE=1` run (see 63-05-SUMMARY § Operator Live-Run — PENDING).
Plan: 5 of 5 (all complete; 63-05 live-discharged)
58
+
Status: Phase 63 done. Plan 63-05 live suite run by the operator 2026-06-21 with `LLM_PROXY_LIVE=1` — 9/9 tests PASS, exit 0, ~35.2s, zero orphaned `claude -p` workers; SC-1..SC-4 all PASS. WLIFE-01..04 ROADMAP-discharged (WLIFE-01 live-proven; WLIFE-02/03/04 live-confirmed). Test file committed b40bc23 in rapid-llm-proxy (external repo, not pushed). Next phase: 64 (worker hygiene — CLI version pinning + stderr throttle, GUARD-02/03).
59
59
Last activity: 2026-06-21
60
60
61
61
## Performance Metrics
@@ -99,7 +99,7 @@ Last activity: 2026-06-21
99
99
-[63-02]: Stray-result generation guard (D-02 / WLIFE-04) — per-worker monotonic `_generation` integer bumped each `_dispatch`, stamped on `_pending.generation`, echoed on the outgoing `_gen` envelope field; `_onEvent` drops a result whose `_gen` mismatches the live `_pending.generation`. `_pending===null` stays the primary defense (operative on the live CLI path, which never echoes `_gen`); the generation echo closes the narrow in-flight window and is exercised deterministically by the unit suite without adding request-id correlation to the protocol (Pattern 1: concurrency-1 needs none). Belt-and-suspenders behind Plan 04's dispose-on-cancel.
100
100
- [63-03]: Crash-cooldown respawn-storm guard (D-06 / WLIFE-03) — per-key crash-frequency `_crashesByKey` Map<key, number[]> + `_recordCrash(key)` (push `Date.now()` + prune to window) wired into the `_spawnWorker` `once('exit')` handler so EVERY exit on a freshly-spawned worker is a crash candidate; a cooldown gate at the TOP of `complete()` (after GUARD-01, before lazy-spawn) routes a key with >= `LLM_PROXY_WORKER_CRASH_THRESHOLD` (default 3) crashes within `LLM_PROXY_WORKER_CRASH_WINDOW_MS` (default 60s) straight to the execFile overflow (`overflowFn(body, abortSignal)` — same shape as the all-busy path), then lazily spawns again after the window (`_isInCooldown` prune-to-window lifts the gate). EARLY-EXIT HEURISTIC: record every exit, no per-worker served-ok flag — the window+threshold (not exit-site classification) filters healthy spaced churn from a broken key's burst. Suite 39/39 green (33 baseline + 6 new). Reuses the SAME `completeClaudeCodeViaCLI` route the milestone already uses for overflow, not a new degraded path.
101
101
- [63-04]: Client-disconnect cancellation (D-01/D-03/WLIFE-04) — the `complete()` abort handler now SIGTERM+disposes+synchronously-drops the IN-FLIGHT worker via `_disposeAndDrop` (D-08 reuse) and rejects it RETRYABLE (next same-key request cold-respawns), and dequeues+rejects ONLY a QUEUED job via new `ClaudeWorker.abortQueuedJob(job)` (worker + in-flight `_pending` untouched, FIFO preserved). Discrimination mechanism: `ClaudeWorker.writeTracked(content,timeoutMs) -> { promise, job }` exposes the job handle; the handler tests `abortQueuedJob(job)` (true iff `_queue.includes(job)`) then falls through to `_disposeAndDrop`. The dead Phase-62 `worker.cancel()` protocol interrupt (live-HANGS, 62-HUMAN-UAT test 6) is NEVER invoked on the disconnect path (only in comments). `write()` delegates to `writeTracked`; the pool falls back to `write()` for workers lacking it (existing fakes unaffected). server.mjs VERIFY-ONLY: `reqAbort.signal` already threads to `complete` (:1237-1238, :1639) — no change. Suite 46/46 green (39 baseline + 7 new). Commits 959f6d3 (RED) / a33629b (GREEN) on external main, not pushed.
102
-
-[63-05]: `--live` lifecycle verification suite (SC-1..SC-4) authored in `tests/integration/worker-pool-live.test.mjs` — a `ps`/`pgrep``countClaudeWorkers()` helper + an `afterEach` zero-orphan teardown (T-63-11), the SC-1 cold-start probe (closes the Phase-62 PARTIAL, D-09), SC-2 idle-evict (tiny `idleMs`), SC-3 crash (SIGKILL → WORKER_RETRYABLE, no respawn-storm), and SC-4 cancel (real `controller.abort()` → SIGTERM+dispose, new pid — the live inverse of the Phase-62 cancel HANG, replacing the old SAME-pid case). Reuses the existing `--live` gate; mock gate exits 0 with the live block skipped. Committed `b40bc23` on rapid-llm-proxy `main` (not pushed). `autonomous: false` — the `LLM_PROXY_LIVE=1` run is an OPERATOR checkpoint; WLIFE-01..04 ROADMAP discharge is gated on that live run (see 63-05-SUMMARY § Operator Live-Run — PENDING). WLIFE-01 stays live-pending; WLIFE-02/03/04 remain UNIT-proven (Plans 01..04) with live confirmation pending.
102
+
- [63-05]: `--live` lifecycle verification suite (SC-1..SC-4) authored in `tests/integration/worker-pool-live.test.mjs` — a `ps`/`pgrep` `countClaudeWorkers()` helper + an `afterEach` zero-orphan teardown (T-63-11), the SC-1 cold-start probe (closes the Phase-62 PARTIAL, D-09), SC-2 idle-evict (tiny `idleMs`), SC-3 crash (SIGKILL → WORKER_RETRYABLE, no respawn-storm), and SC-4 cancel (real `controller.abort()` → SIGTERM+dispose, new pid — the live inverse of the Phase-62 cancel HANG, replacing the old SAME-pid case). Reuses the existing `--live` gate; mock gate exits 0 with the live block skipped. Committed `b40bc23` on rapid-llm-proxy `main` (not pushed). `autonomous: false` — the `LLM_PROXY_LIVE=1` run was an OPERATOR checkpoint; it was performed 2026-06-21 (9/9 tests PASS, exit 0, ~35.2s, zero orphaned `claude -p` workers — SC-1 5.0s, SC-2 7.8s, SC-3 5.6s, SC-4 7.2s). **WLIFE-01..04 ROADMAP-discharged 2026-06-21** — WLIFE-01 live-proven (cold-start), WLIFE-02/03/04 live-confirmed (UNIT-proven in Plans 01..04). Phase 63 complete (5/5 plans).
103
103
-[v6.0 start]: Agent-agnostic architecture -- retrieval service is standalone HTTP API, each coding agent has its own adapter
104
104
-[v6.0 start]: Use existing Qdrant instance for vector storage (not LibSQL vector)
105
105
-[v6.0 start]: All four knowledge tiers as sources (observations, digests, insights, KG entities)
0 commit comments