Skip to content

Commit 0ebfae8

Browse files
fwornleclaude
andcommitted
docs(63-05): record live UAT PASS (SC-1..SC-4) + discharge WLIFE-01..04
- Operator ran the --live lifecycle suite 2026-06-21 (LLM_PROXY_LIVE=1): 9 tests, 9 pass, 0 fail, 0 skipped, exit 0, ~35.2s, zero orphaned claude -p workers. - SC-1 cold-start 5.0s, SC-2 idle-evict 7.8s, SC-3 crash 5.6s, SC-4 cancel 7.2s — all PASS. - 63-05-SUMMARY: filled Operator Live-Run results table, status PENDING -> PASSED/COMPLETE, requirements-completed frontmatter = WLIFE-01..04. - REQUIREMENTS: WLIFE-01 marked Complete (live-proven); WLIFE-02/03/04 live-confirmed. - ROADMAP: Phase-63 line no longer live-pending/operator-gated; plan 63-05 complete (5/5). - STATE: Phase 63 complete (5/5, 8/8 milestone plans, 40%); next phase 64. - External rapid-llm-proxy test file (b40bc23) untouched; nothing pushed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent c59a698 commit 0ebfae8

4 files changed

Lines changed: 40 additions & 36 deletions

File tree

.planning/REQUIREMENTS.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ This file tracks the active milestone's requirements at the top, with previous m
2525

2626
### Worker Lifecycle (WLIFE)
2727

28-
- [ ] **WLIFE-01:** Workers spawn lazily on the first claude-code fallback request for their model — no workers spawn at proxy boot.
28+
- [x] **WLIFE-01:** Workers spawn lazily on the first claude-code fallback request for their model — no workers spawn at proxy boot.
2929
- [x] **WLIFE-02:** An idle worker is evicted (subprocess exits, RAM freed) after a configurable idle timeout (default 30 min); a subsequent request lazily respawns it.
3030
- [x] **WLIFE-03:** A worker that exits unexpectedly is marked dead, its in-flight request is surfaced as RETRYABLE (not a hard error), and it respawns lazily on the next request — never auto-restarted in a tight loop.
3131
- [x] **WLIFE-04:** Client disconnect / request abort propagates to the worker — the in-flight stream-JSON request is cancelled (protocol cancel if supported, else SIGTERM + respawn) so a dead client never pins a concurrency-1 worker.
@@ -59,10 +59,10 @@ This file tracks the active milestone's requirements at the top, with previous m
5959
| POOL-03 | Phase 62 | Complete |
6060
| POOL-04 | Phase 62 | Complete |
6161
| GUARD-01 | Phase 62 | Complete |
62-
| WLIFE-01 | Phase 63 | Live-pending (63-05 `--live` SC-1 cold-start case authored + mock-green; ROADMAP discharge gated on the operator `LLM_PROXY_LIVE=1` run — 63-05-SUMMARY § Operator Live-Run) |
63-
| WLIFE-02 | Phase 63 | Complete |
64-
| WLIFE-03 | Phase 63 | Complete (63-02 EPIPE-as-crash fold-in; 63-03 crash-cooldown respawn-storm guard) |
65-
| WLIFE-04 | Phase 63 | Complete (63-02 stray-result generation guard + 63-04 D-01/D-03 SIGTERM+dispose+drop in-flight / dequeue queued; commits 959f6d3/a33629b) |
62+
| WLIFE-01 | Phase 63 | Complete (live-proven 2026-06-21 — 63-05 `--live` SC-1 cold-start PASS in the operator `LLM_PROXY_LIVE=1` run; 9/9 exit 0, zero orphans) |
63+
| WLIFE-02 | Phase 63 | Complete (live-confirmed 2026-06-21 — 63-05 SC-2 idle-evict PASS) |
64+
| WLIFE-03 | Phase 63 | Complete (63-02 EPIPE-as-crash fold-in; 63-03 crash-cooldown respawn-storm guard; live-confirmed 2026-06-21 — 63-05 SC-3 crash PASS) |
65+
| WLIFE-04 | Phase 63 | Complete (63-02 stray-result generation guard + 63-04 D-01/D-03 SIGTERM+dispose+drop in-flight / dequeue queued; commits 959f6d3/a33629b; live-confirmed 2026-06-21 — 63-05 SC-4 cancel PASS) |
6666
| GUARD-02 | Phase 64 | Not started |
6767
| GUARD-03 | Phase 64 | Not started |
6868
| PERF-01 | Phase 65 | Not started |

.planning/ROADMAP.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1195,7 +1195,7 @@ Plans:
11951195
### Phases
11961196

11971197
- [x] **Phase 62: Worker Pool Core & stream-JSON Transport** — persistent `claude -p --input-format stream-json --output-format stream-json` workers, per-model pinned, concurrency-1, serving ONLY the CLI-fallback path; `LLM_PROXY_DISABLE_WORKER_POOL=1` escape hatch wired first. (completed 2026-06-20)
1198-
- [x] **Phase 63: Worker Lifecycle — Lazy Spawn, Idle Eviction, Crash Recovery & Cancellation** — lazy spawn on first fallback, idle-evict after configurable timeout (default 30 min), crash → RETRYABLE + lazy respawn (no spin-loop), client-disconnect aborts the in-flight stream-JSON request. (all 5 plans landed 2026-06-21; mechanisms UNIT-proven in Plans 01-04 and the `--live` SC-1..SC-4 verification suite authored + mock-green in Plan 05. **SC-1..SC-4 ROADMAP live discharge for WLIFE-01..04 is PENDING the operator `LLM_PROXY_LIVE=1` run**Plan 05 is `autonomous: false`; see 63-05-SUMMARY § Operator Live-Run — PENDING. WLIFE-01 stays live-pending.)
1198+
- [x] **Phase 63: Worker Lifecycle — Lazy Spawn, Idle Eviction, Crash Recovery & Cancellation** — lazy spawn on first fallback, idle-evict after configurable timeout (default 30 min), crash → RETRYABLE + lazy respawn (no spin-loop), client-disconnect aborts the in-flight stream-JSON request. (all 5 plans landed 2026-06-21; mechanisms UNIT-proven in Plans 01-04 and confirmed by the `--live` SC-1..SC-4 verification suite in Plan 05. **WLIFE-01..04 ROADMAP-discharged 2026-06-21** by the operator `LLM_PROXY_LIVE=1` run — 9/9 tests PASS, exit 0, zero orphaned workers; SC-1..SC-4 all PASS. See 63-05-SUMMARY § Operator Live-Run — PASSED.)
11991199
- [ ] **Phase 64: Worker Hygiene — CLI Version Pinning & stderr Throttling** — record `claude --version` at boot, recycle worker on version drift to keep prompt-cache assumptions valid; drain + throttle worker stderr to once-per-minute-per-worker so persistent-worker CLI warnings don't flood logs.
12001200
- [ ] **Phase 65: Steady-State Latency & Crash-Survival Acceptance** — warm-worker sonnet `say OK` probe completes ≤3s steady-state (cold first-spawn may still be ~10s); pool survives a worker SIGKILL without dropping subsequent same-model requests; idle-eviction observable via `ps`; escape hatch reverts cleanly.
12011201
- [ ] **Phase 66: Dashboard Latency Observability** — the dashboard's claude-code/sonnet median latency column shows the ~14s → ≤3s drop within 24h of rollout.

.planning/STATE.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ gsd_state_version: 1.0
33
milestone: v7.3
44
milestone_name: LLM Proxy Performance — Claude CLI Worker Pool
55
status: executing
6-
stopped_at: Phase 63 context gathered
7-
last_updated: "2026-06-21T07:48:22.905Z"
6+
stopped_at: Phase 63 complete (5/5 plans; WLIFE-01..04 live-discharged)
7+
last_updated: "2026-06-21T08:30:00.000Z"
88
last_activity: 2026-06-21
99
progress:
1010
total_phases: 5
11-
completed_phases: 1
11+
completed_phases: 2
1212
total_plans: 8
13-
completed_plans: 7
14-
percent: 20
13+
completed_plans: 8
14+
percent: 40
1515
---
1616

1717
# Project State
@@ -21,7 +21,7 @@ progress:
2121
See: .planning/PROJECT.md (updated 2026-04-24)
2222

2323
**Core value:** A self-learning coding environment that captures every session, builds knowledge, prevents mistakes, and makes observations browsable -- across all AI coding agents.
24-
**Current focus:** Phase 63 — worker-lifecycle-lazy-spawn-idle-eviction-crash-recovery-can
24+
**Current focus:** Phase 63 COMPLETE (5/5 plans, WLIFE-01..04 live-discharged 2026-06-21) → next: Phase 64 (worker hygiene)
2525

2626
**v7.1 milestone status (KM-Core unification — 10 of 10 phases done; one Phase 46 ONBOARDING.md operator UAT remains):**
2727

@@ -53,9 +53,9 @@ Phase 50 ships the LSL primitives (`lib/lsl/window.mjs` + `lib/lsl/scan-and-conv
5353

5454
## Current Position
5555

56-
Phase: 63 (worker-lifecycle-lazy-spawn-idle-eviction-crash-recovery-can) — EXECUTING
57-
Plan: 5 of 5 (Plans 01, 02, 03, 04 complete; 05 AUTO complete — live run PENDING operator)
58-
Status: Plan 63-05 AUTO portion done (suite authored, mock-green, committed b40bc23 in rapid-llm-proxy). `autonomous: false`WLIFE-01..04 ROADMAP live discharge gated on the operator `LLM_PROXY_LIVE=1` run (see 63-05-SUMMARY § Operator Live-Run — PENDING).
56+
Phase: 63 (worker-lifecycle-lazy-spawn-idle-eviction-crash-recovery-can) — COMPLETE (5/5 plans)
57+
Plan: 5 of 5 (all complete; 63-05 live-discharged)
58+
Status: Phase 63 done. Plan 63-05 live suite run by the operator 2026-06-21 with `LLM_PROXY_LIVE=1` — 9/9 tests PASS, exit 0, ~35.2s, zero orphaned `claude -p` workers; SC-1..SC-4 all PASS. WLIFE-01..04 ROADMAP-discharged (WLIFE-01 live-proven; WLIFE-02/03/04 live-confirmed). Test file committed b40bc23 in rapid-llm-proxy (external repo, not pushed). Next phase: 64 (worker hygiene — CLI version pinning + stderr throttle, GUARD-02/03).
5959
Last activity: 2026-06-21
6060

6161
## Performance Metrics
@@ -99,7 +99,7 @@ Last activity: 2026-06-21
9999
- [63-02]: Stray-result generation guard (D-02 / WLIFE-04) — per-worker monotonic `_generation` integer bumped each `_dispatch`, stamped on `_pending.generation`, echoed on the outgoing `_gen` envelope field; `_onEvent` drops a result whose `_gen` mismatches the live `_pending.generation`. `_pending===null` stays the primary defense (operative on the live CLI path, which never echoes `_gen`); the generation echo closes the narrow in-flight window and is exercised deterministically by the unit suite without adding request-id correlation to the protocol (Pattern 1: concurrency-1 needs none). Belt-and-suspenders behind Plan 04's dispose-on-cancel.
100100
- [63-03]: Crash-cooldown respawn-storm guard (D-06 / WLIFE-03) — per-key crash-frequency `_crashesByKey` Map<key, number[]> + `_recordCrash(key)` (push `Date.now()` + prune to window) wired into the `_spawnWorker` `once('exit')` handler so EVERY exit on a freshly-spawned worker is a crash candidate; a cooldown gate at the TOP of `complete()` (after GUARD-01, before lazy-spawn) routes a key with >= `LLM_PROXY_WORKER_CRASH_THRESHOLD` (default 3) crashes within `LLM_PROXY_WORKER_CRASH_WINDOW_MS` (default 60s) straight to the execFile overflow (`overflowFn(body, abortSignal)` — same shape as the all-busy path), then lazily spawns again after the window (`_isInCooldown` prune-to-window lifts the gate). EARLY-EXIT HEURISTIC: record every exit, no per-worker served-ok flag — the window+threshold (not exit-site classification) filters healthy spaced churn from a broken key's burst. Suite 39/39 green (33 baseline + 6 new). Reuses the SAME `completeClaudeCodeViaCLI` route the milestone already uses for overflow, not a new degraded path.
101101
- [63-04]: Client-disconnect cancellation (D-01/D-03/WLIFE-04) — the `complete()` abort handler now SIGTERM+disposes+synchronously-drops the IN-FLIGHT worker via `_disposeAndDrop` (D-08 reuse) and rejects it RETRYABLE (next same-key request cold-respawns), and dequeues+rejects ONLY a QUEUED job via new `ClaudeWorker.abortQueuedJob(job)` (worker + in-flight `_pending` untouched, FIFO preserved). Discrimination mechanism: `ClaudeWorker.writeTracked(content,timeoutMs) -> { promise, job }` exposes the job handle; the handler tests `abortQueuedJob(job)` (true iff `_queue.includes(job)`) then falls through to `_disposeAndDrop`. The dead Phase-62 `worker.cancel()` protocol interrupt (live-HANGS, 62-HUMAN-UAT test 6) is NEVER invoked on the disconnect path (only in comments). `write()` delegates to `writeTracked`; the pool falls back to `write()` for workers lacking it (existing fakes unaffected). server.mjs VERIFY-ONLY: `reqAbort.signal` already threads to `complete` (:1237-1238, :1639) — no change. Suite 46/46 green (39 baseline + 7 new). Commits 959f6d3 (RED) / a33629b (GREEN) on external main, not pushed.
102-
- [63-05]: `--live` lifecycle verification suite (SC-1..SC-4) authored in `tests/integration/worker-pool-live.test.mjs` — a `ps`/`pgrep` `countClaudeWorkers()` helper + an `afterEach` zero-orphan teardown (T-63-11), the SC-1 cold-start probe (closes the Phase-62 PARTIAL, D-09), SC-2 idle-evict (tiny `idleMs`), SC-3 crash (SIGKILL → WORKER_RETRYABLE, no respawn-storm), and SC-4 cancel (real `controller.abort()` → SIGTERM+dispose, new pid — the live inverse of the Phase-62 cancel HANG, replacing the old SAME-pid case). Reuses the existing `--live` gate; mock gate exits 0 with the live block skipped. Committed `b40bc23` on rapid-llm-proxy `main` (not pushed). `autonomous: false` — the `LLM_PROXY_LIVE=1` run is an OPERATOR checkpoint; WLIFE-01..04 ROADMAP discharge is gated on that live run (see 63-05-SUMMARY § Operator Live-Run — PENDING). WLIFE-01 stays live-pending; WLIFE-02/03/04 remain UNIT-proven (Plans 01..04) with live confirmation pending.
102+
- [63-05]: `--live` lifecycle verification suite (SC-1..SC-4) authored in `tests/integration/worker-pool-live.test.mjs` — a `ps`/`pgrep` `countClaudeWorkers()` helper + an `afterEach` zero-orphan teardown (T-63-11), the SC-1 cold-start probe (closes the Phase-62 PARTIAL, D-09), SC-2 idle-evict (tiny `idleMs`), SC-3 crash (SIGKILL → WORKER_RETRYABLE, no respawn-storm), and SC-4 cancel (real `controller.abort()` → SIGTERM+dispose, new pid — the live inverse of the Phase-62 cancel HANG, replacing the old SAME-pid case). Reuses the existing `--live` gate; mock gate exits 0 with the live block skipped. Committed `b40bc23` on rapid-llm-proxy `main` (not pushed). `autonomous: false` — the `LLM_PROXY_LIVE=1` run was an OPERATOR checkpoint; it was performed 2026-06-21 (9/9 tests PASS, exit 0, ~35.2s, zero orphaned `claude -p` workers — SC-1 5.0s, SC-2 7.8s, SC-3 5.6s, SC-4 7.2s). **WLIFE-01..04 ROADMAP-discharged 2026-06-21** — WLIFE-01 live-proven (cold-start), WLIFE-02/03/04 live-confirmed (UNIT-proven in Plans 01..04). Phase 63 complete (5/5 plans).
103103
- [v6.0 start]: Agent-agnostic architecture -- retrieval service is standalone HTTP API, each coding agent has its own adapter
104104
- [v6.0 start]: Use existing Qdrant instance for vector storage (not LibSQL vector)
105105
- [v6.0 start]: All four knowledge tiers as sources (observations, digests, insights, KG entities)

0 commit comments

Comments
 (0)