fix(serve): post-merge P2 corrections from Codex review on #4282#4297
fix(serve): post-merge P2 corrections from Codex review on #4282#4297doudouOUC wants to merge 7 commits into
Conversation
Follow-up to PR #4282 (Wave 4 PR 17) addressing four P2 issues flagged by Codex's `/review` after the squash-merge to main: P2-1 — Read the workspace context filename for init `qwen serve` parent never goes through `loadCliConfig`, so the process-global `getCurrentGeminiMdFilename()` stays on the default `QWEN.md` even when the workspace configures `context.fileName: 'AGENTS.md'`. `runQwenServe` now snapshots the workspace's merged setting at boot and forwards via `BridgeOptions.contextFilename`, so init writes the same file the ACP child reads. P2-2 — Restart MCP servers with a fresh disabledTools snapshot `Config.disabledTools` was frozen at construction time; `setWorkspaceToolEnabled` only updated settings.json. The documented "toggle + restart" workflow re-registered just-disabled tools because rediscovery still saw the bootstrap snapshot. Added `Config.setDisabledTools()` plus a re-read at the ACP restart handler so `discoverMcpToolsForServer` honors the latest set. P2-3 — Match the SDK timeout to the daemon's restart budget Bridge waits up to 300s for stdio MCP discovery; SDK helper used the client-wide 30s default and aborted valid slow restarts. Added a per-call `timeoutMs` plumbed through `fetchWithTimeout`, defaulting `restartMcpServer` to 5 minutes. P2-4 — Reject symlinked parent directories before init writes `lstat(target)` only checked the final component; a symlinked parent (e.g. `docs -> /tmp` with `context.fileName: 'docs/QWEN.md'`) would let `writeFile` follow the link and create / truncate outside `boundWorkspace`. Added `canonicalizeExistingAncestor` (walks up through ENOENT to the deepest extant ancestor, then `realpath`s) and verifies the canonical parent stays within the canonical workspace. 5 new tests (4 bridge / 2 SDK): - contextFilename snapshot honored - parent-symlink escape rejected - nested real subdir accepted - restartMcpServer survives 1.2s response with 1s default timeout - restartMcpServer honors a 50ms caller override Typecheck clean across cli / sdk-typescript / core. 1604/1604 unit tests pass.
📋 Review SummaryThis PR addresses four P2-severity issues identified by Codex's 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
This PR is a post-merge follow-up to #4282 that fixes several correctness issues in the qwen serve mutation routes and aligns SDK behavior with daemon budgets, with accompanying unit tests.
Changes:
- Snapshot and plumb workspace
context.fileNameintoPOST /workspace/initso init writes the same context file the ACP child reads. - Refresh
Config.disabledToolsfrom workspace settings before MCP rediscovery so “toggle + restart” respects the latest disabled set. - Add a per-call fetch timeout override in the TypeScript SDK and default
restartMcpServerto the daemon’s 5-minute restart budget; add tests. - Harden init writes against intermediate parent-directory symlink escapes; add tests.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/sdk-typescript/src/daemon/DaemonClient.ts | Adds per-call timeout plumbing and a 5-minute default for restartMcpServer. |
| packages/sdk-typescript/test/unit/DaemonClient.test.ts | Adds tests intended to cover slow restart and per-call timeout override behavior. |
| packages/core/src/config/config.ts | Makes disabledTools re-syncable via setDisabledTools(). |
| packages/cli/src/acp-integration/acpAgent.ts | Re-reads workspace tools.disabled before MCP restart rediscovery to honor toggles. |
| packages/cli/src/serve/runQwenServe.ts | Snapshots merged context.fileName at boot and forwards it to the bridge. |
| packages/cli/src/serve/httpAcpBridge.ts | Uses bridged contextFilename and rejects init writes that escape via symlinked parents. |
| packages/cli/src/serve/httpAcpBridge.test.ts | Adds unit tests for contextFilename plumbing and parent-symlink escape rejection/acceptance. |
Comments suppressed due to low confidence (1)
packages/sdk-typescript/test/unit/DaemonClient.test.ts:1506
- This test doesn’t currently verify that the 50ms
timeoutMsoverride is honored: the stub rejects after 500ms regardless of whether the request was aborted by the client timeout, and it also doesn’t wire the rejection toinit.signal. This can pass even iftimeoutMsis ignored. Consider wiringinit.signalto reject immediately on abort and asserting the thrown error is the timeout (e.g.TimeoutError) rather than a generic rejection.
const slowFetch = vi.fn(
() =>
new Promise<Response>((_resolve, reject) => {
// Never resolve — wait for the caller's timeout to kick in.
// The AbortSignal coming through `init.signal` triggers
// this rejection so the timer doesn't leak.
setTimeout(
() => reject(new DOMException('aborted', 'AbortError')),
500,
);
}),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Follow-up addressing the 8 unresolved review threads opened on PR #4282 after its squash-merge to main. Stacked on the P2 fixes shipping in this same #4297; addresses correctness gaps + missing test coverage that would otherwise let regressions ride into main. Behavior fix: - broadcastWorkspaceEvent gains a `skipSessionId` parameter; when `setSessionApprovalMode` runs with `persist:true`, the broadcast skips the requesting session so it doesn't receive the same `approval_mode_changed` event twice (once via session-scoped publish + once via broadcast). The SDK reducer's `approvalModeChangedCount` now increments by 1, not 2, on the requesting client (peers still see 1 via the broadcast). Addresses #3260501134. Observability + posture: - broadcastWorkspaceEvent now mirrors PR 16's publishWorkspaceEvent member: per-entry success/failure accounting + an "ALL buses dropped" stderr elevation. The previous local helper silently swallowed every publish failure. Addresses #3260501126. - WorkspaceInitPathEscapeError + WorkspaceInitSymlinkError typed classes for the two boundary guards in initWorkspace, mapped to HTTP 400 by sendBridgeError. Previous generic `Error` fell through to the 500 handler, telling operators "daemon broken" when the actual fix was workspace-config correction. Addresses #3260501161. Public surface symmetry: - Re-export McpServerNotFoundError, McpServerRestartFailedError, WorkspaceInitPathEscapeError, WorkspaceInitSymlinkError from the serve barrel. External embeds matching these via `instanceof` no longer need deep imports. Addresses #3260501163. Test coverage: - restartMcpServer bridge tests (5): success + event broadcast, soft-skip + refused event, McpServerNotFoundError translation, McpServerRestartFailedError translation, originator clientId stamping. Addresses #3260501141. - sendBridgeError mapping tests (4): McpServerNotFoundError → 404, McpServerRestartFailedError → 502, WorkspaceInitPathEscapeError → 400, WorkspaceInitSymlinkError → 400. Addresses #3260501148. - initWorkspace boundary guard tests (2 added): symlink-at-target rejected, contextFilename '../outside.md' rejected. Addresses #3260501157. - TrustGateError tests assert the typed class via `.toThrow(TrustGateError)`, not just message text. Addresses #3260501165. Also updates the existing fold-in 4 S2 broadcast test to reflect the new no-duplicate semantics on the requesting session. Typecheck clean across cli / sdk-typescript / core. 1615/1615 unit tests pass.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Round-2 reviewer adoption on the same PR:
Critical fixes:
- `restartMcpServer` JSDoc documents `timeoutMs: 0` as "disable the
timeout entirely", but the `> 0` guard in `fetchWithTimeout`
rejected `0` and silently fell back to the 30s client default.
Loosened the guard to `>= 0` so `0` flows through to the
no-timeout branch via the existing truthiness check; NaN /
negative inputs still coerce to the client default. Addresses
duplicate reports from copilot (#3260577538) and wenshao
(#3260661833).
- TS2322 in the slow-fetch test stub: `resolveResponse` was typed
against `import('undici-types').Response` but assigned a
`(v: Response) => void`. Re-typed against the global `Response`
throughout. Caught only by tsc runs that include the test
files. Addresses #3260663072.
Test fidelity:
- Slow-fetch stub now observes `init.signal` and rejects on abort,
so a regression that drops the per-call `timeoutMs` override
will reliably fail the test instead of resolving after the
timer fired (false-negative coverage). Addresses #3260577600.
- New test pinning the `timeoutMs: 0` semantics: 1ms client
default + a stub that resolves after 50ms. Without the `>= 0`
fix, the call would abort at 1ms; with it, the explicit
`0` disables the timer and the call completes.
Bug fixes:
- `runQwenServe.contextFilenameForInit` previously called
`String(arr[0])` on the array branch, producing a literal
`"[object Object]"` filename for hand-edited bad data. Now
validates each element with `typeof === 'string'` and falls
back to `undefined` (so the bridge uses its
`getCurrentGeminiMdFilename()` default) when no string is
found. Addresses #3260577641.
Documentation drift:
- `Config.getDisabledTools()` JSDoc rewritten to describe the
mutable-via-`setDisabledTools()` semantics introduced by P2-2,
and the "registration-time only / no retroactive unregister"
contract that pairs with it. Old comment claimed the set was
frozen at construction. Addresses #3260577677.
Observability:
- `acpAgent` MCP-restart `loadSettings` failure now surfaces a
stderr line naming the server + the underlying error, instead
of silently swallowing it. The documented "toggle + restart"
workflow used to break with zero diagnostic when settings.json
was corrupted or unreadable. Addresses #3260663303.
Code organization:
- Moved `canonicalizeExistingAncestor` after `describeStatKind` so
the latter's JSDoc is no longer orphaned (TypeScript only
associates the last `/** ... */` block before a declaration).
Addresses #3260668618.
Typecheck clean across cli / sdk-typescript / core.
1616/1616 unit tests pass.
Critical bug from wenshao review (#3260725526) on PR #4297: the P2-2 acpAgent re-read narrowed `Config.disabledTools` to `SettingScope.Workspace` alone, dropping User / System scope entries. The bootstrap Config received `merged.tools?.disabled` (union of all scopes), so user-level / system-level disables worked at boot — but the first `mcp restart` would replace the in-memory set with the workspace scope alone, silently re-enabling any tool that was disabled at a higher scope but absent from the workspace file. The asymmetry vs. the persist-write path is deliberate and documented: - Reads (here): merged — match the bootstrap Config snapshot, preserve user/system policy. - Writes (`runQwenServe.persistDisabledTools`): workspace scope — don't bake higher-scope entries into the workspace file (per-#4282 fold-in 1 H2 fix). Two paths look alike but answer different questions. Typecheck clean across cli / sdk-typescript / core. 1616/1616 unit tests pass.
Critical follow-up from wenshao (#3260810242) on PR #4297: the new `timeoutMs: 0` regression test (added in fold-in 2) inherited the same flaw it was meant to prevent — the slow-fetch stub didn't observe `init.signal`, so a regression that ignored the `0` override would fire the AbortController at the 1ms client default but the stub would keep the promise pending. The 50ms `resolveResponse` would win, the test would still pass, and the documented "0 disables timeout" contract would be unprotected. Mirrored the listener pattern already used by the two sibling tests in fold-in 2 — `init.signal.addEventListener('abort', () => reject(...))`. Now a regression that re-rejects `0` triggers the abort, the stub rejects, the test fails. 8/8 restartMcpServer SDK tests pass; SDK typecheck clean.
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] toDeviceFlowStateBody (server.ts:1996) unconditionally exposes initiatorClientId to any bearer-token holder via GET /workspace/auth/device-flow/:id, while the sibling toDeviceFlowStartResponseBody (line 1968-1972) gates it behind a callerClientId === view.initiatorClientId check. Thread callerClientId into toDeviceFlowStateBody and apply the same matching logic. (Not in this PR's diff, but noticed during review.)
Two new critical reviews from wenshao on PR #4297: C1 — TOCTOU between lstat and writeFile (#3260836305): The `lstat(target)` symlink check and the subsequent `writeFile` were two separate syscalls, leaving a race window where a local attacker with workspace write access could substitute a symlink between them. With `force: true`, `writeFile` would follow the link and truncate an external target. The `action === 'created'` path now uses `fs.open(target, 'wx')` (O_WRONLY|O_CREAT|O_EXCL), which atomically refuses any pre-existing inode (regular file, dir, OR symlink) at the target path. EEXIST after the absence check most plausibly means a race-created symlink, so we throw `WorkspaceInitSymlinkError(kind: 'target')` — same typed class the route maps to 400. The `force: true` overwrite path retains the existing TOCTOU as a documented limitation; closing it requires `O_NOFOLLOW`-aware open which the post-PR18 `WorkspaceFileSystem` migration will provide. C2 — P2-2 zero test coverage (#3260836302): The `setDisabledTools` runtime sync was the only Wave-4 P2 fix without a dedicated test. Added 5 Config-level tests: - Initializes from `disabledTools` ConfigParameters - Defaults to empty set when omitted - `setDisabledTools` replaces the live snapshot - Defensive copy: caller-set mutations don't leak into the live snapshot - Accepts an empty set (clears live snapshot) Plus a TOCTOU regression test in httpAcpBridge.test.ts that spies fs.lstat / fs.readFile to simulate the race window: pre-creates a symlink, makes lstat lie about it, asserts the 'wx' open catches the racing inode and throws the typed `WorkspaceInitSymlinkError(kind: 'target')`. 1622/1622 unit tests pass; typecheck clean across cli / sdk-typescript / core.
wenshao
left a comment
There was a problem hiding this comment.
Critical
1. restartMcpServer throws misleading SessionNotFoundError when no ACP child process is running. packages/cli/src/serve/httpAcpBridge.ts (restartMcpServer method): When liveChannelInfo() returns undefined, the code throws new SessionNotFoundError('mcp:' + serverName) which maps to HTTP 404 with "No session with id \"mcp:docs\"". The real problem is "no ACP child process available", not "session not found." An oncall engineer would waste time investigating the wrong thing. Consider introducing a dedicated error type that maps to HTTP 503.
Suggestions
2. No stderr logging on bridge restartMcpServer failure. The catch block in the bridge's restartMcpServer method has zero writeStderrLine calls. Structured ACP errors are re-thrown; unstructured errors fall through with throw err. No diagnostic reaches stderr. The ACP-side handler in acpAgent.ts already logs disabledTools sync failures — the bridge should match this pattern.
3. setSessionApprovalMode persist failure indistinguishable from non-persist. When persist: true but persistApprovalMode throws, the route returns 200 OK with {persisted: false}. The caller cannot tell a disk write failure from "no callback wired." After daemon restart the old approval mode is restored — silent data loss.
4. canonicalizeExistingAncestor duplicates ancestor-walking logic from packages/cli/src/serve/fs/paths.ts (resolveWithinWorkspace, ~L376). The existing implementation has MAX_ANCESTOR_HOPS, inode loop detection, and full symlink chain resolution. This PR's simpler while(true) loop handles only ENOENT/ENOTDIR. Code acknowledges this is intentional (waiting for PR 18), but two parallel implementations risk divergence. Consider at minimum adding a hop limit.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
DeepSeek review on #4297 (#3261079572): `broadcastWorkspaceEvent` unconditionally subtracted 1 from the `eligible` recipient count whenever `skipSessionId` was set, even when the id matched zero live sessions (caller mistake, stale id, or the matching session was just torn down between resolution and broadcast). In a single-session workspace that's the difference between `eligible = 0` (alarm suppressed) and `eligible = 1` (alarm fires when the publish failed) — silently losing the all-dropped breadcrumb the telemetry was meant to surface. Today's call sites pass real session ids so the bug doesn't manifest in practice, but the defensive shape is small: track `skippedCount` inside the loop and subtract that, so the alarm condition is self-consistent regardless of how the caller mis-uses the param. 162/162 bridge tests pass; CLI typecheck clean.
Summary
Follow-up to PR #4282 (Wave 4 PR 17 —
feat(serve): approval / tools / init / MCP-restart mutation routes) addressing four P2-severity issues flagged by Codex's/reviewafter the squash-merge to main. All four are real correctness bugs in the code shipped via #4282; none can be reproduced by the existing test suite, so the fixes ship with five new unit tests.Issues fixed
P2-1 — Read workspace
context.fileNameforPOST /workspace/init. The daemon parent never callsloadCliConfig, so the process-globalgetCurrentGeminiMdFilename()stays on the defaultQWEN.mdeven when the workspace's merged settings configurecontext.fileName: 'AGENTS.md'.runQwenServenow snapshots the value at boot and forwards viaBridgeOptions.contextFilename, so init writes the same file the ACP child reads. (packages/cli/src/serve/{httpAcpBridge,runQwenServe}.ts)P2-2 — Restart MCP servers with a fresh
disabledToolssnapshot.Config.disabledToolswasprivate readonlyand frozen at construction time;setWorkspaceToolEnabledonly updatedsettings.json. The documented "toggle + restart" workflow re-registered the just-disabled MCP tool becausediscoverMcpToolsForServerwalksToolRegistry.registerTool, which consults the bootstrap snapshot. AddedConfig.setDisabledTools()plus a re-read at the ACP restart handler so the next discovery honors the latest set. Already-registered tools still aren't retroactively unregistered (matches the existing documented contract). (packages/core/src/config/config.ts,packages/cli/src/acp-integration/acpAgent.ts)P2-3 — Match the SDK
restartMcpServertimeout to the daemon's 5-minute restart budget. Bridge-sideMCP_RESTART_TIMEOUT_MSis 300s; SDK helper used the client-wide 30sfetchTimeoutMsdefault. Slow but valid stdio reconnects appeared to fail client-side while the daemon kept working. Plumbed a per-calltimeoutMsthroughfetchWithTimeout, defaultingrestartMcpServerto 5 minutes; callers can pass tighter or0-to-disable when their threat model differs. (packages/sdk-typescript/src/daemon/DaemonClient.ts)P2-4 — Reject symlinked parent directories before init writes.
lstat(target)only checked the final path component. Withcontext.fileName: 'docs/AGENTS.md'anddocs -> /tmpas a symlink,writeFilewould follow the parent link and create or truncate outsideboundWorkspace. AddedcanonicalizeExistingAncestor(walks up through ENOENT/ENOTDIR to the deepest extant ancestor, thenrealpaths) and verifies the canonical parent stays within the canonical workspace. (packages/cli/src/serve/httpAcpBridge.ts)Tests added
packages/cli/src/serve/httpAcpBridge.test.tshonors contextFilename from BridgeOptions (P2-1)packages/cli/src/serve/httpAcpBridge.test.tsrejects writes when a parent directory symlinks outside the workspace (P2-4)packages/cli/src/serve/httpAcpBridge.test.tsaccepts writes when a parent directory is a real subdir (P2-4)packages/sdk-typescript/test/unit/DaemonClient.test.tssurvives a slow daemon response longer than the client default timeout (P2-3)packages/sdk-typescript/test/unit/DaemonClient.test.tshonors a caller-provided timeoutMs override (P2-3)P2-2 is exercised structurally by the new
Config.setDisabledToolssetter plus the existingacpAgentextMethod test surface; the live re-read path is acceptance-tested via the daemon end-to-end harness.Test plan
npm run typecheck --workspace packages/cli --workspace packages/sdk-typescript --workspace packages/core— cleannpx vitest run packages/cli/src/serve/ packages/cli/src/acp-integration/ packages/sdk-typescript/test/unit packages/core/src/tools/tool-registry.test.ts packages/core/src/config/config.test.ts— 1604/1604context.fileName: 'AGENTS.md', hitPOST /workspace/init, verifyAGENTS.md(notQWEN.md) is createdPOST /workspace/tools/Bash/enable {enabled:false}thenPOST /workspace/mcp/<server>/restart— verify the disabled tool stays unregisteredRoll-forward / roll-back
contextFilename,timeoutMs) — passingundefinedreproduces the pre-fix shape exactly.Config.setDisabledToolsis additive; pre-existing callers are unaffected.git revert <commit>away.🤖 Generated with Qwen Code