Skip to content

Commit edae3a7

Browse files
claude-code-bestunraidclaude
authored
feat: harden autonomy lifecycle, OOM bounds, and provider-boundary finalization (#386)
* feat: harden autonomy lifecycle, OOM bounds, and provider-boundary finalization This PR consolidates a coordinated batch of fixes around autonomy run/flow lifecycle, scheduled task deduplication, provider-boundary state finalization, and matching memory-bound treatments for adjacent long-running subsystems (REPL fullscreen scrollback, skill-search/skill-learning runtime activation). All changes were developed and reviewed together because they touched the same lifecycle invariants and were uncovered by the same long-running session reproductions. ## Lifecycle correctness - Queued autonomy prompts are not injected unless the persisted run was successfully claimed; queued run claiming is now terminal-safe so a once-consumed/cancelled/failed run can not slip back into `queued`. - Autonomy run/flow finalization happens on completion, provider error, generator close, and cancellation — not just the happy path. New `src/__tests__/queryAutonomyProviderBoundary.test.ts` covers these provider-boundary transitions. - `requestManagedAutonomyFlowCancel` and `resumeManagedAutonomyFlowPrompt` carry `rootDir` and `currentDir` explicitly across detached async boundaries (proactive-tick, cron, daemon restart) instead of inferring from process state. - Active runs/flows are protected from janitor pruning so a running step can not be garbage-collected mid-flight (`src/utils/autonomyAuthority.ts`). - Heartbeat parser now ignores fenced code blocks; the two-phase commit window for autonomy state transitions is documented in `docs/internals/autonomy-jira.md`. ## Ownership and dedup - `src/utils/autonomyRuns.ts`: ownership stamping (run id + rootDir carried end-to-end), source-based dedup against active runs. - `src/hooks/useScheduledTasks.ts`: scheduled ticks deduplicate against runs already active on the same source label. - `src/utils/processUserInput/processSlashCommand.tsx`: forked slash commands now thread the autonomy `runId` so completion finalizers can find the originating run for deferred completion. - New `src/utils/autonomyQueueLifecycle.ts` and tests collect the queue-side lifecycle invariants in one place. ## Memory bounds (related, same review pass) - `src/screens/REPL.tsx`: caps fullscreen scrollback after the compact boundary and updates trailing progress rows in place. Long-running fullscreen sessions could otherwise retain thousands of post-compaction messages and duplicate progress rows, keeping Ink trees alive long after their useful context had moved on. - `src/services/skillSearch/*` and `src/services/skillLearning/*`: runtime activation is strictly opt-in via existing env toggles; session caches are capped so long-running processes can not grow them forever. Build presence is preserved so operators can still discover and opt into the slash commands. ## CI / test contract - `tests/integration/dependency-overrides.test.ts`: smoke test no longer drives Mermaid's browser renderer; it validates the package-resolution contract directly so CI does not regress on unrelated browser timing. - New `tests/integration/autonomy-lifecycle-user-flow.test.ts`: end-to-end CLI subprocess flow exercising `status --deep`, `flows`, `flow <id>`, `flow resume`, `flow cancel` against persisted state. - `src/entrypoints/cli.tsx`: `claude autonomy …` routes through an entrypoint fast path that reuses the slash-command formatter without booting the full interactive CLI. Stdout is flushed before forced exit so coverage subprocesses do not terminate with empty stdout. - `packages/builtin-tools/src/tools/RemoteTriggerTool/__tests__/RemoteTriggerTool.test.ts`: stabilized to prevent audit flake under coverage. ## Tests added - `src/__tests__/queryAutonomyProviderBoundary.test.ts` - `src/hooks/__tests__/useScheduledTasks.test.ts` - `src/utils/__tests__/autonomyAuthority.test.ts` - `src/utils/__tests__/autonomyFlows.test.ts` (extended) - `src/utils/__tests__/autonomyPersistence.test.ts` (extended) - `src/utils/__tests__/autonomyQueueLifecycle.test.ts` - `src/utils/__tests__/autonomyRuns.test.ts` (extended) - `src/utils/processUserInput/__tests__/processSlashCommand.test.ts` - `tests/integration/autonomy-lifecycle-user-flow.test.ts` ## Docs - `docs/agent/sur-loop-scheduled-oom.md`: System Understanding Report covering the scheduled/loop OOM problem, the call graphs investigated, and the lifecycle invariants this PR establishes. - `docs/agent/sur-skill-overflow-bugs.md`: SUR for the related skill-overflow context. - `docs/internals/autonomy-jira.md`: documents the two-phase commit window and ownership stamping invariants. - `docs/memory-leak-audit.md`: audit notes covering the REPL/scrollback and skill-search bounds. ## Invariants this PR establishes 1. Queued autonomy prompts are not injected unless the persisted run was successfully claimed. 2. Terminal run/flow states are terminal — completion, failure, and cancellation all finalize state regardless of which provider/error path triggered them. 3. Autonomy run/flow `rootDir` is carried explicitly across detached async boundaries instead of inferred from a shared singleton. 4. State-only CLI subcommands (`autonomy status|runs|flows|flow …`) bypass full interactive bootstrap so they do not hold unrelated handles open. 5. REPL fullscreen scrollback and skill-search/skill-learning session caches are explicitly bounded. ## Validation ```bash bun run typecheck CI=true GITHUB_ACTIONS=true bun test # 3996 pass / 0 fail across 305 files bun test src/__tests__/queryAutonomyProviderBoundary.test.ts \ src/hooks/__tests__/useScheduledTasks.test.ts \ src/utils/__tests__/autonomy{Runs,Flows,Authority,QueueLifecycle,Persistence}.test.ts \ src/utils/processUserInput/__tests__/processSlashCommand.test.ts \ tests/integration/autonomy-lifecycle-user-flow.test.ts ``` ## Origin This PR is the consolidated, upstream-targeted version of two fork-side review PRs (fix/loop-scheduled-autonomy-oom and fix/autonomy-lifecycle). The fork-side review history is preserved at amDosion#7 . The fork's own internal `chore: keep fork current with upstream` sync commits and the `docs: update contributors` automation are intentionally not included in this PR. The autonomy CLI handler `rootDir` threading that the fork added (78f64d8, 98d04dd) is intentionally omitted here because upstream `a2cfaf91` (fix: 修复 RemoteTriggerTool 和 autonomy 测试的全量运行失败) already performed the equivalent change with an additional `currentDir` option. Keeping the upstream version avoids regressing that improvement. * fixup: address CodeRabbit review on PR #386 Twelve actionable items (7 Major + 5 Minor) from the CodeRabbit review on #386: - docs/internals/autonomy-jira.md: typo "due input close" → "due to input close". - src/utils/autonomyRuns.ts: - selectPersistedAutonomyRuns no longer evicts active (queued/running) runs when the combined list exceeds AUTONOMY_RUNS_MAX. Active runs are kept in full and the inactive history is capped to the remaining budget so persisted ownership for live work survives. - isValidOwnerProcessId now allows pid <= 4_194_304 so a live run owned by the maximum Linux PID is not treated as stale. - src/utils/autonomyAuthority.ts: maskCodeFencedLines tracks the active fence length and only closes the fence when a same-character run of equal-or- greater length appears with no trailing content, so a nested ```yaml inside an outer ```` block no longer leaks fake `tasks:` entries into the parser. - src/cli/print.ts: late-shutdown branches in the cron and scheduled-task paths now call cancelQueuedAutonomyCommands({ commands: [command] }) instead of markAutonomyRunCancelled(...). Updating run state alone left the queue-side record orphaned for resume/recovery. - src/utils/processUserInput/processSlashCommand.tsx: scheduled-task-result notification is enqueued before finalizeAutonomyRunCompleted (which queues follow-up autonomy commands) so both at priority: 'later' land in order and the next autonomy step can not run before the worker's output is observed. - src/screens/REPL.tsx + src/utils/handlePromptSubmit.ts: - onQuery now returns Promise<boolean>: false from the concurrent-guard skip path, true otherwise. Other call sites use `void onQuery(...)` and are unaffected. handlePromptSubmit's onQuery prop type matches. - The autonomy-prompt callsite captures the executed flag, finalizes claim.claimedCommands as { type: 'completed' } only when onQuery actually ran, and runs the completed-finalize in its own try/catch so a failure there does not propagate into the outer catch and trigger a second finalize as { type: 'failed' } for the same commands. - Removed the unsafe `command.value as string` cast; createUserMessage already accepts `string | ContentBlockParam[]`. - createUserMessage mock in src/__tests__/handlePromptSubmit.test.ts now matches the new Promise<boolean> shape. - packages/builtin-tools/src/tools/RemoteTriggerTool/__tests__/ RemoteTriggerTool.test.ts: - Inline auth mock replaced with the shared tests/mocks/auth (added). - The full mock of src/constants/oauth.js is replaced by a narrow side-effect-only mock that overrides the env-reading helpers (getOauthConfig, fileSuffixForOauthConfig, MCP_CLIENT_METADATA_URL) and delegates pure data exports to the real module. - tests/integration/dependency-overrides.test.ts: - mermaid does not export `./package.json` in its exports map, so require.resolve('mermaid/package.json') throws ERR_PACKAGE_PATH_NOT_EXPORTED in runtimes that honor exports semantics. The test now resolves the package entry and walks up to the package root via a small findPackageJson helper. - readFileSync from node:fs is replaced with `await Bun.file(...).text()` to match the project's Bun-API requirement. Validation: - bun run typecheck (clean). - bun test → 3996 pass / 0 fail across 305 test files. Targets PRs: - amDosion#8 (fork-internal review) - #386 (upstream review, same head branch) * fixup: address CodeRabbit second-round review on PR #386 Four inline + one outside-diff actionable comment from the second CodeRabbit review on #386: - tests/mocks/auth.ts: align mock return contracts with src/utils/auth.ts. checkAndRefreshOAuthTokenIfNeeded resolves to a Promise<boolean> and getClaudeAIOAuthTokens returns the full token shape (refreshToken, expiresAt, scopes, subscriptionType, rateLimitTier) so tests that branch on these values can not silently drift away from production. - src/utils/handlePromptSubmit.ts (461-468): clear the freshly-published abortController before the early return when every claimed autonomy command was skipped as non-consumable, so this turn's stale controller does not leak into the next turn. - src/utils/handlePromptSubmit.ts (621-649): separate execution failure from finalizer failure. The turn body now writes to a `turnError` slot; a single pass after the inner try decides whether to finalize claimed commands as `completed` or `failed`, with each finalize call wrapped in its own try/catch so a failure inside finalize does not flip a successful turn into `failed` and double-finalize the same commands. The outer catch only rethrows the original turn error. - src/utils/processUserInput/processSlashCommand.tsx (228-276): wrap the post-success `finalizeDeferredAutonomyRunCompleted()` call in its own try/catch so a finalize failure no longer falls into the worker-failure catch path and emits a contradictory `<scheduled-task-result status="failed">` for a slash command that actually succeeded. Outside scope (not changed) — the CodeRabbit suggestion to add a `.ts` extension to the shared `tests/mocks/auth` import contradicts the project's existing convention: every other test imports the shared mocks without the extension (e.g. `tests/mocks/log`, `tests/mocks/debug`, `tests/mocks/file-system`), and the project's tsconfig does not enable `allowImportingTsExtensions`, so adding the extension fails typecheck. The import is kept extension-less to match the rest of the suite. Validation: - bun run typecheck (clean). - bun test → 3996 pass / 0 fail across 305 test files. * docs: 给 sur-skill-overflow-bugs 的代码块加 bash 标签 应用 PR #386 review 的剩余 nit。pid_max 边界、REPL cast、autonomy-jira typo 三处与远端 fixup (452a7e6) 内容相同,rebase 时已去重,本次提交仅包含 code fence 语言标签这一项。 * fixup: 处理 PR #386 review 中尚未覆盖的 4 项 - src/cli/print.ts: cron onFire 改用 createAutonomyQueuedPromptIfNoActiveSource 并以 prompt 文本作为 sourceId,避免同一定时提示在前一次 run 仍活跃时被重复 入队叠加;顺手移除 4 个已没人引用的 dead import (commitAutonomyQueuedPrompt / prepareAutonomyTurnPrompt / markAutonomyRunCancelled / createAutonomyQueuedPrompt) - src/services/compact/postCompactCleanup.ts: 在 void import().then() 处加 注释,明确 sweepFileContentCache 是有意的 fire-and-forget,函数对外保持 同步签名是设计而非疏忽 - src/utils/autonomyFlows.ts: 给 selectPersistedAutonomyFlows 的两阶段排序 加文档注释(先按 active+updatedAt 选 top-N,再统一按 updatedAt 重排) - tests/integration/autonomy-lifecycle-user-flow.test.ts: stderr 断言失败时 把实际 stderr 内容写进 message,方便 CI 失败时定位 * refactor: 简化/复用/防御 — 清理 PR #386 审计发现 简化 (S1, S2): - src/cli/print.ts: 抽出 dispatchHeadlessCronCommand 本地 helper,把 cron 三个入口(onFire / onFireTask agent / onFireTask 非-agent)共享的 「dedup-claim → input-close-recheck → onSuccess」管线集中到一处, 避免三个分支在「claim 与 dispatch 之间发生 inputClosed」的处理上漂移。 enqueueAndRun 再抽出来,使两个非-agent 分支共用一个 onSuccess 回调。 约 -55 行重复模板。 - src/utils/autonomyPersistence.ts: 新增 retainActiveFirst<T> 泛型 helper —— active 记录无条件保留(不参与 cap),inactive 按 timestamp desc 填满剩余预算;统一 selectPersistedAutonomyRuns / Flows 的两阶段 排序语义。 - src/utils/autonomyRuns.ts、autonomyFlows.ts: 改用 retainActiveFirst, 删掉重复的内联两阶段排序逻辑。 复用 (R1, review #8): - tests/mocks/file-system.ts: 新增 readTempFile / tempPathExists 两个 Bun.file 包装,补齐 Node fs.readFileSync / existsSync 在测试里的 Bun-only 等价物。 - src/utils/__tests__/autonomyRuns.test.ts: 把全部 Node fs/path 导入 (existsSync, readFileSync, mkdir, writeFile, path.join/resolve)替换为 tests/mocks/file-system 的共享 helper + node:path(带 node: 前缀)。 不再有 6 处 mkdir + writeFile 模板,统一用 writeTempFile(自带 mkdir-p)。 解决 review #8 (Major) 的 Bun-only 运行时契约违反。 防御 (D1, OOM 早期信号): - src/services/compact/postCompactCleanup.ts: 在 void import().then() 末尾 补 .catch(logError)。当前 attributionHooks 是 stub,但当真实现被恢复 且 sweepFileContentCache 抛错时,这个 .catch 阻止它变成 unhandled rejection(函数返回值是 void,调用者无从观察异步失败)。 - src/utils/autonomyRuns.ts: 给 active runs 加 100 条软上限 + 一次性 warn。selectPersistedAutonomyRuns 仍然永不淘汰 active 记录,但跨过 阈值时 logError 一次,作为 finalize-leak 早期信号——避免 active 无限 增长悄悄使 AUTONOMY_RUNS_MAX 失效。 --------- Co-authored-by: unraid <local@unraid.local> Co-authored-by: Claude <noreply@anthropic.com>
2 parents 4f1649e + 7a6e65c commit edae3a7

53 files changed

Lines changed: 5130 additions & 1048 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/agent/sur-loop-scheduled-oom.md

Lines changed: 492 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# System Understanding Report — Skill Search / Skill Learning Overflow Bugs
2+
3+
- **Flow id**: `recurring-bug-skill-overflow` (sibling pilot to `recurring-bug-loop-oom`)
4+
- **Branch**: `fix/loop-scheduled-autonomy-oom` (folded into the OOM PR — same audit-and-cap pattern)
5+
- **Trigger**: post-merge review of the autonomy OOM fix surfaced unbounded module-level state in adjacent `EXPERIMENTAL_SKILL_SEARCH` and `SKILL_LEARNING` subsystems. The user explicitly asked for a `肯定也有同类溢出` audit.
6+
7+
---
8+
9+
## 1. Problem
10+
11+
The autonomy OOM bug came from unbounded module-level state (run records, scheduler queues, heartbeat timestamps) growing for the lifetime of the process. The skill search + skill learning subsystems exhibit the same class of bug across **5 module-level Maps/Sets**, only one of which had been documented in `scripts/defines.ts` ("projectContext cache 无淘汰机制(非 GB 级主因)").
12+
13+
These bugs were latent because:
14+
15+
- `EXPERIMENTAL_SKILL_SEARCH` / `SKILL_LEARNING` were enabled-by-default in `DEFAULT_BUILD_FEATURES`, but tests pass because they exercise short paths.
16+
- None of the unbounded caches grow per-tool-call; they grow per **distinct query** / **distinct cwd** / **distinct skill name** / **distinct gap signal** / **distinct promotion**, which is sub-linear in session length but monotone forever.
17+
- A long-running daemon-style process (KAIROS sessions, multi-day worktrees) would observe the growth.
18+
19+
## 2. Module-level state audit
20+
21+
| File:Line | Symbol | Pre-fix bound | Pre-fix evict |
22+
|---|---|---|---|
23+
| `intentNormalize.ts:52` | `cache: Map<query, keywords>` | none | only `clearIntentNormalizeCache()` for tests |
24+
| `prefetch.ts:17` | `discoveredThisSession: Set<skillName>` | none | none |
25+
| `prefetch.ts:18` | `recordedGapSignals: Set<gapKey>` | none | none |
26+
| `projectContext.ts:48` | `contextCache: Map<cwd, ProjectContext>` | none | only `resetProjectContextCacheForTest()` |
27+
| `promotion.ts:26` | `sessionPromotedIds: Set<instinctId>` | none | only `resetPromotionBookkeeping()` for tests |
28+
| `runtimeObserver.ts:61` | `lastProcessedMessageIds: Set<msgKey>` | **MAX 1000** | FIFO trim ✓ already bounded |
29+
| `toolEventObserver.ts:50` | `emittedTurns: Map<sid, Set<turn>>` | **MAP_MAX 50, SET_MAX 100** | LRU prune via `pruneEmittedTurns()` called inside `markTurn` ✓ already bounded |
30+
| `observerBackend.ts:21` | `registry: Map<name, Backend>` | fixed N | n/a — registry pattern, finite ✓ |
31+
32+
**5 unbounded out of 8 module-level mutables.** All 5 are addressed in this PR.
33+
34+
## 3. Severity rationale
35+
36+
Per-entry cost is small (key strings + small objects), so OOM in days is unlikely on a normal workstation. But the canary scenarios:
37+
38+
- **`intentNormalize.cache`**: every distinct Chinese query → Haiku call → cached. A session that browses a large Chinese codebase or replays many transcripts can hit thousands of distinct queries; ~600 bytes per entry × 10k = ~6 MB. Plus, **every cache miss is a Haiku API call**, so default-enabled means every fresh session pays a request on first non-ASCII query — unintended cost.
39+
- **`projectContext.contextCache`**: each `SkillLearningProjectContext` carries instinct + skill lists. Multi-worktree orchestrators (this very repo!) blow past the typical "1 cwd per session" assumption.
40+
- **`prefetch` Sets**: in chatty sessions thousands of skill discovery names accumulate.
41+
- **`sessionPromotedIds`**: smallest practical risk (single-digit promotions per session normally), but a long-lived sandbox could push it; a defensive cap is cheap.
42+
43+
The fix bounds all 5 with FIFO/LRU eviction at sensible sizes (200–1000 entries). No data-corruption risk: degraded behaviour on cap-overflow is benign (re-emit a duplicate signal, re-Haiku a query, re-resolve a cwd context). Same risk profile as the autonomy stale-recovery design.
44+
45+
## 4. Fix surface
46+
47+
| File | Change |
48+
|---|---|
49+
| `src/services/skillSearch/intentNormalize.ts` | `setCachedQueryIntent()` helper, `CACHE_MAX_ENTRIES=200` / `CACHE_TRIM_TO=150`, LRU touch on hit |
50+
| `src/services/skillSearch/prefetch.ts` | `addBoundedSessionEntry()` helper, `SESSION_TRACKING_MAX=1000` / `TRIM_TO=750`; `discoveredThisSession` and `recordedGapSignals` route through it |
51+
| `src/services/skillLearning/projectContext.ts` | `setProjectContextCache()` helper, `PROJECT_CONTEXT_CACHE_MAX=32` / `TRIM_TO=24`, LRU touch on hit |
52+
| `src/services/skillLearning/promotion.ts` | `recordSessionPromoted()` helper, `SESSION_PROMOTED_IDS_MAX=256` / `TRIM_TO=192` |
53+
| `src/services/skillSearch/featureCheck.ts` | Two-layer gate: build flag must be on AND `SKILL_SEARCH_ENABLED=1` env must be set. Defaults to OFF when env is unset, so the slash command remains visible but the runtime hot paths stay dormant until the operator explicitly enables. |
54+
| `src/services/skillLearning/featureCheck.ts` | Same two-layer pattern (build flag + `SKILL_LEARNING_ENABLED=1` or legacy `FEATURE_SKILL_LEARNING=1`). |
55+
| `scripts/defines.ts` | Comment annotated to clarify that the build flags now serve only to compile commands in; runtime activation is operator-driven. |
56+
57+
## 5. Why default-off (without removing from build)?
58+
59+
Three reasons aside from the unbounded-cache concern:
60+
61+
1. **Implicit cost**: `intentNormalize` calls Haiku on cache miss. Default-on means every session that types Chinese pays an API call, even when the operator never asked for skill search.
62+
2. **Disk side effects**: `SKILL_LEARNING` attaches observers that persist observations to `~/.claude` storage. Storage volume should be opt-in, not background.
63+
3. **Experimental status**: the flag is literally named `EXPERIMENTAL_*`. Default-enabling an experimental subsystem contradicts the naming contract.
64+
65+
**The fix is NOT to remove the flags from `DEFAULT_BUILD_FEATURES`** — doing so would also strip the `/skill-search` and `/skill-learning` slash commands from the build, leaving operators with no UI to opt in. Instead the activation logic in `featureCheck.ts` was changed to a two-layer gate:
66+
67+
- **Layer 1 (compile-time)**: `feature('EXPERIMENTAL_SKILL_SEARCH')` / `feature('SKILL_LEARNING')` must be on. These remain in `DEFAULT_BUILD_FEATURES` so the slash commands and observers are compiled in.
68+
- **Layer 2 (runtime)**: `SKILL_SEARCH_ENABLED=1` / `SKILL_LEARNING_ENABLED=1` (or `FEATURE_SKILL_LEARNING=1`) env var must be set. Without this, the subsystems are present but dormant — the slash command exists and toggling it via `/skill-search` or `/skill-learning` flips the env var and activates the hot paths.
69+
70+
Net result: operators see the toggle in the UI but the subsystem is **off until they flip it**.
71+
72+
## 6. Out of scope (filed for follow-up)
73+
74+
- **Test failures on CI** (`prefetch.test.ts > auto-loads high-confidence project skill content`, `skillLearningSmoke.test.ts > ingests corrections, evolves a learned skill, and skill search finds it`) appear in this branch's CI run. Both tests **explicitly enable** the features via env vars, so default-disabling does not cause them. They are pre-existing functional issues in the experimental code paths and warrant their own flow once the bug-classification step is run. Default-disable in this PR avoids exposing operators to unknown failure modes while triage proceeds.
75+
- **Persistence-layer bounds** (observation files, instinct registry): `observationStore.ts` already has 30-day purge and 1MB archive thresholds; `skillGapStore.ts` uses a finite-state lifecycle. Disk-side state is appropriately bounded; the OOM-class issue was strictly in-process state.
76+
77+
## 7. Verification
78+
79+
Local checks (full suite covers cap behaviour via existing tests; the caps degrade gracefully so no test should break):
80+
81+
```bash
82+
bun run typecheck # 0 errors
83+
bun test src/services/skillSearch/__tests__/intentNormalize.test.ts
84+
bun test src/services/skillSearch/__tests__/prefetch.extractQuery.test.ts
85+
bun test src/services/skillLearning/__tests__/projectContext.test.ts
86+
bun test src/services/skillLearning/__tests__/promotion.test.ts
87+
bun run lint
88+
bun run build
89+
```
90+
91+
The new caps are observable behaviour: under sustained load the Map/Set sizes plateau at the configured maxima rather than monotone-growing.

0 commit comments

Comments
 (0)