Skip to content

Commit c2bd6d0

Browse files
cliffhallclaude
andcommitted
fix(servers): close GET migration race; reverse keychain/disk order (#1356)
Addresses code review: 1. **Concurrency**: GET /api/servers was reading + migrating outside the write lock, then writing inside it — a concurrent POST/PUT/DELETE between the read and the lock acquisition could be clobbered by the migration write. Also affected the first-launch seed write. Fix: take a fast unlocked peek; if the file is absent (seed branch) or carries plaintext to migrate (migration branch), re-read inside the write lock and decide there. Added `hasPlaintextSecrets()` as a pure predicate so the common no-migration path stays lock-free. 2. **Write ordering**: POST and PUT wrote disk first, then keychain. A keychain `set` failure mid-write left a half-configured entry on disk and trapped retries at 409. Reversed to keychain-first so a 503 leaves both stores in their pre-write state. PUT in-place now sets new fields, writes disk, then deletes obsolete fields (a failed disk write leaves orphan keychain entries — recoverable — rather than premature deletions of fields the user still expects to see). 3. **Tolerance simplification**: dropped the no-entry regex in `KeyringSecretStore.delete` — both "no entry" and "keychain unavailable" collapse to the same desired outcome (the entry isn't there anymore), so a uniform swallow is clearer. 4. **Parallelism**: `readKeychainEntriesFor` now `Promise.all`s the per-field `get` calls. macOS Keychain round-trips are 10-50ms; serializing 5+ env vars per server × N servers is a meaningful wall-clock cost on GET. 5. **Spec doc**: reflects the tolerance contract (only `set` hard-fails) and the keychain-before-disk write ordering rationale. New regression test in servers-route.test.ts covers the retry-after-503 contract directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8941935 commit c2bd6d0

4 files changed

Lines changed: 179 additions & 56 deletions

File tree

clients/web/src/test/integration/mcp/remote/servers-route.test.ts

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1479,6 +1479,54 @@ describe("/api/servers routes", () => {
14791479
}
14801480
});
14811481

1482+
it("POST 503 leaves no disk entry — retry is not trapped at 409", async () => {
1483+
// Regression for the write-ordering review comment. If keychain
1484+
// set ran AFTER the disk write, a failed `set` would leave the
1485+
// entry on disk and the user's retry would hit 409. With the
1486+
// reversed order, a failed `set` leaves disk untouched and the
1487+
// retry can proceed.
1488+
const u = await startUnavailableHarness();
1489+
try {
1490+
const first = await fetch(`${u.baseUrl}/api/servers`, {
1491+
method: "POST",
1492+
headers: { "Content-Type": "application/json" },
1493+
body: JSON.stringify({
1494+
id: "retry-test",
1495+
config: { type: "streamable-http", url: "https://x.test/mcp" },
1496+
settings: {
1497+
headers: [],
1498+
metadata: [],
1499+
connectionTimeout: 0,
1500+
requestTimeout: 0,
1501+
oauthClientId: "cid",
1502+
oauthClientSecret: "shh",
1503+
},
1504+
}),
1505+
});
1506+
expect(first.status).toBe(503);
1507+
1508+
// Disk: no entry persisted. A GET should not see "retry-test".
1509+
const cfg = existsSync(u.configPath)
1510+
? (JSON.parse(readFileSync(u.configPath, "utf-8")) as MCPConfig)
1511+
: { mcpServers: {} };
1512+
expect(cfg.mcpServers).not.toHaveProperty("retry-test");
1513+
1514+
// Retry with no secret should now succeed (no 409).
1515+
const second = await fetch(`${u.baseUrl}/api/servers`, {
1516+
method: "POST",
1517+
headers: { "Content-Type": "application/json" },
1518+
body: JSON.stringify({
1519+
id: "retry-test",
1520+
config: { type: "stdio", command: "node" },
1521+
}),
1522+
});
1523+
expect(second.status).toBe(200);
1524+
} finally {
1525+
await new Promise<void>((r) => u.server.close(() => r()));
1526+
rmSync(u.tempDir, { recursive: true });
1527+
}
1528+
});
1529+
14821530
it("DELETE succeeds (sweep silently no-ops)", async () => {
14831531
const u = await startUnavailableHarness();
14841532
try {

core/auth/node/secret-store.ts

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -121,12 +121,14 @@ export class KeyringSecretStore implements SecretStore {
121121
const entry = new AsyncEntry(SERVICE_NAME, buildAccount(serverId, field));
122122
try {
123123
await entry.deleteCredential();
124-
} catch (err) {
125-
// `deleteCredential` throws NoEntry for missing entries — treat as success.
126-
const msg = err instanceof Error ? err.message : String(err);
127-
if (/no entry|no matching|not found/i.test(msg)) return;
128-
// Keychain unavailable → silently no-op. We couldn't have written
129-
// anything either, so there's nothing to clean up.
124+
} catch {
125+
// Both reasons for a throw collapse to the same desired outcome
126+
// ("the entry isn't there anymore"): `deleteCredential` raises
127+
// NoEntry for a missing credential, and the native binding
128+
// raises a runtime error when the keychain itself is unavailable.
129+
// We treat both as success — there's no value to lose either
130+
// way, and `set` is the operation that hard-fails when the
131+
// keychain is actually down.
130132
}
131133
}
132134

core/mcp/remote/node/server.ts

Lines changed: 120 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1162,18 +1162,38 @@ export function createRemoteApp(
11621162
// stripped shape into the shape today's browser code expects. Missing
11631163
// fields are simply absent from the result (so `mergeSecretsIntoStored`
11641164
// leaves the disk value untouched).
1165+
// Issue many keychain reads in parallel. On macOS each round-trip to
1166+
// Keychain Services is 10-50ms; a server with 20 entries × 5 env vars
1167+
// each would otherwise serialize to ~5s of rehydration on every GET.
1168+
// `@napi-rs/keyring`'s async APIs release the event loop, so
1169+
// Promise.all is a real win, not just stylistic.
11651170
const readKeychainEntriesFor = async (
11661171
id: string,
11671172
fields: string[],
11681173
): Promise<Record<string, string>> => {
1174+
const values = await Promise.all(
1175+
fields.map(async (field) => [field, await secretStore.get(id, field)] as const),
1176+
);
11691177
const out: Record<string, string> = {};
1170-
for (const field of fields) {
1171-
const v = await secretStore.get(id, field);
1178+
for (const [field, v] of values) {
11721179
if (v !== null) out[field] = v;
11731180
}
11741181
return out;
11751182
};
11761183

1184+
// Cheap structural check used by the GET handler to decide whether
1185+
// to enter the write lock for migration. Pure — no keychain or disk
1186+
// access. The actual migration (`migratePlaintextSecrets`) writes
1187+
// to the keychain, so we must not run it speculatively outside the
1188+
// lock; this predicate lets us keep the fast path lock-free.
1189+
const hasPlaintextSecrets = (config: MCPConfig): boolean => {
1190+
for (const stored of Object.values(config.mcpServers)) {
1191+
const { secrets } = extractSecretsFromStored(stored);
1192+
if (Object.keys(secrets).length > 0) return true;
1193+
}
1194+
return false;
1195+
};
1196+
11771197
// Migrate plaintext secrets in a freshly-read mcp.json into the
11781198
// keychain. Idempotent: when the keychain already has a value for
11791199
// `(id, field)`, that value wins and the disk plaintext is dropped
@@ -1238,24 +1258,31 @@ export function createRemoteApp(
12381258
return out;
12391259
};
12401260

1241-
// Reconcile keychain state for a single server on the write path:
1242-
// delete any field the prior entry held but the new entry doesn't
1243-
// (e.g. an env key the user just removed), then write the new
1244-
// secrets. Order matters — we delete obsolete first so that a
1245-
// pathological case where a field is "renamed" (delete + add a
1246-
// different one) doesn't leave the old value behind.
1247-
const reconcileKeychainEntry = async (
1248-
id: string,
1261+
// Fields the previous entry held that the new entry doesn't —
1262+
// candidates for deletion after the disk write succeeds. Returned
1263+
// as an array so the caller can decide where in the write sequence
1264+
// to perform the destructive operation (we want it after the disk
1265+
// write so a failed disk write doesn't leave the user with their
1266+
// old config on disk but missing keychain entries).
1267+
const computeObsoleteFields = (
12491268
previousFields: Set<string>,
12501269
nextSecrets: Record<string, string>,
1251-
): Promise<void> => {
1270+
): string[] => {
12521271
const nextFieldSet = new Set(Object.keys(nextSecrets));
1272+
const obsolete: string[] = [];
12531273
for (const field of previousFields) {
1254-
if (!nextFieldSet.has(field)) {
1255-
await secretStore.delete(id, field);
1256-
}
1274+
if (!nextFieldSet.has(field)) obsolete.push(field);
1275+
}
1276+
return obsolete;
1277+
};
1278+
1279+
const deleteKeychainFields = async (
1280+
id: string,
1281+
fields: string[],
1282+
): Promise<void> => {
1283+
for (const field of fields) {
1284+
await secretStore.delete(id, field);
12571285
}
1258-
await writeKeychainEntriesFor(id, nextSecrets);
12591286
};
12601287

12611288
const keychainErrorResponse = (
@@ -1270,33 +1297,55 @@ export function createRemoteApp(
12701297

12711298
app.get("/api/servers", async (c) => {
12721299
try {
1300+
// Fast path: peek at the file without taking the write lock. Most
1301+
// GETs land here — file exists, no plaintext to migrate. The
1302+
// unlocked read is safe because we don't write anything in this
1303+
// branch; a concurrent mutation just means the response is a
1304+
// snapshot from a moment ago.
12731305
const raw = await readStoreFile(mcpConfigPath);
1274-
let onDisk: MCPConfig;
1275-
if (raw === null) {
1276-
// First-launch seed. The seed has no secrets so no migration to
1277-
// run; persist verbatim and respond. We rehydrate anyway so the
1278-
// GET path stays uniform (no-op on a seed without env values).
1279-
await writeMcpAndTrackMtime(serializeStore(DEFAULT_SEED_CONFIG));
1280-
onDisk = DEFAULT_SEED_CONFIG;
1281-
} else {
1306+
if (raw !== null) {
12821307
const parsed = parseStore(raw) as { mcpServers?: unknown } | null;
1283-
onDisk = { mcpServers: normalizeMcpServers(parsed?.mcpServers) };
1308+
const onDisk: MCPConfig = {
1309+
mcpServers: normalizeMcpServers(parsed?.mcpServers),
1310+
};
1311+
if (!hasPlaintextSecrets(onDisk)) {
1312+
return c.json(await rehydrateConfig(onDisk));
1313+
}
12841314
}
12851315

1286-
// Migrate plaintext secrets that may have been written by an
1287-
// older Inspector build or hand-edited into mcp.json. Idempotent:
1288-
// on a clean file with no plaintext secrets it's a no-op. The
1289-
// rewrite happens inside the same write lock as POST/PUT/DELETE
1290-
// so concurrent mutations can't race with our cleanup.
1291-
const { migrated, changed } = await migratePlaintextSecrets(onDisk);
1292-
if (changed) {
1293-
await withWriteLock(async () => {
1316+
// Slow path: either the file is missing (first-launch seed) or it
1317+
// contains plaintext we need to migrate. Both branches write the
1318+
// file, so the entire read + decide + write sequence must happen
1319+
// inside the write lock — otherwise a concurrent POST/PUT/DELETE
1320+
// could land between our unlocked read and our locked write, and
1321+
// we'd clobber it. Re-read once we hold the lock so the decision
1322+
// is based on the same snapshot we're about to mutate.
1323+
const settled = await withWriteLock(async () => {
1324+
const rawInside = await readStoreFile(mcpConfigPath);
1325+
if (rawInside === null) {
1326+
// Still absent after taking the lock → no concurrent POST
1327+
// beat us to it; seed the file ourselves. The seed has no
1328+
// secrets so nothing else to do.
1329+
await writeMcpAndTrackMtime(serializeStore(DEFAULT_SEED_CONFIG));
1330+
return DEFAULT_SEED_CONFIG;
1331+
}
1332+
const parsedInside = parseStore(rawInside) as
1333+
| { mcpServers?: unknown }
1334+
| null;
1335+
const inside: MCPConfig = {
1336+
mcpServers: normalizeMcpServers(parsedInside?.mcpServers),
1337+
};
1338+
// A concurrent POST may have run between the unlocked peek and
1339+
// the lock acquisition — if the file now has no plaintext to
1340+
// migrate, leave it alone and return the latest snapshot.
1341+
if (!hasPlaintextSecrets(inside)) return inside;
1342+
const { migrated, changed } = await migratePlaintextSecrets(inside);
1343+
if (changed) {
12941344
await writeMcpAndTrackMtime(serializeStore(migrated));
1295-
});
1296-
}
1297-
1298-
const rehydrated = await rehydrateConfig(migrated);
1299-
return c.json(rehydrated);
1345+
}
1346+
return migrated;
1347+
});
1348+
return c.json(await rehydrateConfig(settled));
13001349
} catch (error) {
13011350
const keychainResp = keychainErrorResponse(c, error);
13021351
if (keychainResp) return keychainResp;
@@ -1346,15 +1395,19 @@ export function createRemoteApp(
13461395
}
13471396
const built = buildStoredEntry(id, body.config, postSettings);
13481397
// Split secret values out of the new entry — the stripped shape
1349-
// goes to disk, the values go to the keychain. We sweep any
1350-
// leftover keychain entries for this id first to handle the
1351-
// edge case where a previous DELETE failed midway and left
1352-
// orphans under the same id the user is now reusing.
1398+
// goes to disk, the values go to the keychain.
13531399
const { stripped, secrets } = extractSecretsFromStored(built);
1400+
// Order: sweep → keychain → disk. The keychain write is the
1401+
// only step that can hard-fail (KeychainUnavailableError on
1402+
// `set`); doing it before the disk write means a 503 leaves no
1403+
// disk entry behind, so a retry POST isn't trapped at 409. The
1404+
// initial sweep handles the case where a previous DELETE failed
1405+
// midway and left orphans under the same id the user is now
1406+
// reusing; it's a silent no-op when the keychain is unavailable.
13541407
await secretStore.deleteAllForServer(id);
1408+
await writeKeychainEntriesFor(id, secrets);
13551409
current.mcpServers[id] = stripped;
13561410
await writeMcpAndTrackMtime(serializeStore(current));
1357-
await writeKeychainEntriesFor(id, secrets);
13581411
return c.json({ ok: true });
13591412
});
13601413
} catch (error) {
@@ -1486,17 +1539,35 @@ export function createRemoteApp(
14861539
next.mcpServers[key] = val;
14871540
}
14881541
}
1489-
await writeMcpAndTrackMtime(serializeStore(next));
1490-
// Keychain reconcile. On rename, every field under the old id
1491-
// is obsolete by definition — sweep then write under the new
1492-
// id. Otherwise diff against the previous field set so we only
1493-
// touch entries that actually need to change.
1542+
// Ordering: write the new keychain entries first, then the
1543+
// disk file, then clean up obsolete keychain entries.
1544+
//
1545+
// - Keychain set is the only hard-fail step (it raises 503 on
1546+
// `KeychainUnavailableError`). Doing it first means a failed
1547+
// set leaves both disk and keychain in their pre-PUT state;
1548+
// the user retries and nothing is half-applied.
1549+
// - The disk write happens after the keychain is fully primed,
1550+
// so a successful disk write is also a fully-consistent end
1551+
// state.
1552+
// - Obsolete deletion comes last because it's destructive: if
1553+
// we deleted first and then the disk write failed, the user
1554+
// would still see the old config on disk but with missing
1555+
// keychain values. With the current order a failed disk
1556+
// write leaves orphan keychain entries — recoverable on the
1557+
// next reconcile or `deleteAllForServer` sweep.
14941558
if (newId !== originalId) {
1495-
await secretStore.deleteAllForServer(originalId);
14961559
await writeKeychainEntriesFor(newId, secrets);
1560+
await writeMcpAndTrackMtime(serializeStore(next));
1561+
await secretStore.deleteAllForServer(originalId);
14971562
} else {
1563+
// In-place update: same id, possibly different fields. Set
1564+
// the new values first, then write disk, then drop obsolete
1565+
// fields (env keys the user removed, OAuth secret cleared).
14981566
const previousFields = new Set(expectedSecretFields(existing));
1499-
await reconcileKeychainEntry(newId, previousFields, secrets);
1567+
const obsolete = computeObsoleteFields(previousFields, secrets);
1568+
await writeKeychainEntriesFor(newId, secrets);
1569+
await writeMcpAndTrackMtime(serializeStore(next));
1570+
await deleteKeychainFields(newId, obsolete);
15001571
}
15011572
return c.json({ ok: true });
15021573
});

specification/v2_servers_file.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,9 @@ Each server entry may carry these Inspector-extension fields at the top level:
245245
- **First-connect contract**: settings apply on the *first* outbound request after the entry loads from disk — no need to open the settings form. The browser sends `settings` to the backend in the `/api/mcp/connect` body; the backend reads it from `RemoteConnectRequest` and threads it into `createTransportNode`.
246246
- **Secret storage (#1356)**: `oauth.clientSecret` and stdio `env` values are persisted in the OS keychain (macOS Keychain Services / Windows Credential Manager / Linux libsecret via `@napi-rs/keyring`), keyed by `(serverId, field)` under the service name `mcp-inspector`. Field names: `oauth-client-secret`, `env:<KEY>` (one per stdio env variable). The on-disk `mcp.json` is stripped of these values — `oauth.clientSecret` is omitted entirely, stdio env keys are preserved with empty-string placeholders (`"env": { "API_KEY": "" }`) so the file still documents the env interface the server expects. The wire shape returned by `GET /api/servers` is unchanged from before #1356: the handler rehydrates values from the keychain so browser code sees the same JSON it has always seen. The keychain interactions live in `core/auth/node/secret-store.ts` behind a `SecretStore` interface; `KeyringSecretStore` is the production impl and `InMemorySecretStore` is the test double the integration suite injects via `RemoteServerOptions.secretStore`.
247247
- **Migration**: on every `GET /api/servers`, the handler walks the freshly-read config and, for any entry that still carries plaintext secrets (older Inspector builds, hand-edited files, files imported from another tool), lifts each value into the keychain and rewrites the file with the stripped shape. The migration is idempotent — when the keychain already holds a value for `(serverId, field)`, the keychain wins and the disk plaintext is dropped unread. After the rewrite the disk file no longer contains the secret material.
248-
- **Linux without libsecret**: `@napi-rs/keyring` returns a hard error on any operation. The handlers translate that to a `503` response with a message pointing at libsecret/gnome-keyring install. macOS and Windows always have a working keychain so this only realistically fires on minimal Linux installs.
248+
- **Linux without libsecret**: `KeyringSecretStore` is *tolerant* — only the `set` operation throws `KeychainUnavailableError` (translated to a `503` by the handlers); `get` returns `null` and the destructive operations silently no-op. The result is that no-secret flows (creating a stdio server with no env values, deleting an entry, reading the list, the defensive sweep on POST) all work normally on a minimal Linux box without libsecret. Only the moments where a secret would actually be lost — saving an OAuth client secret, saving a stdio env value, or migrating a plaintext value into the keychain — surface a clear error. macOS and Windows always have a working keychain so this only matters on minimal Linux installs.
249+
- **Migration tolerance**: when migration encounters `KeychainUnavailableError`, the GET handler logs a warning, leaves the on-disk plaintext untouched, and serves the (still-plaintext) response. Subsequent reads retry — installing libsecret later lifts the secrets on the next GET without any user action.
250+
- **Write ordering on POST/PUT**: keychain writes happen before the disk write, and obsolete-field deletions happen after. The intent is that a `set` failure (the only hard-fail path) leaves both stores in their pre-write state — no half-applied entry on disk that would trap a retry POST at `409`, and no premature deletion of an obsolete field whose disk write later fails.
249251
- **Out of scope for this PR**: the OAuth handshake itself still runs in the browser via the MCP SDK, so during the token exchange the secret transits the wire (browser → MCP SDK → OAuth provider's token endpoint). The on-disk win this PR delivers is that the secret is no longer in the shareable / symlinked `mcp.json` and is no longer the source-of-truth on the filesystem. Moving the token exchange to the Node side is tracked separately.
250252
- **Hard-cutover legacy behavior (per #1358 decision 4)**: files written by the one pre-#1358 build of v2/main have a nested `settings` block. `normalizeMcpServers` drops the node on read and logs a one-line warn including the server id; the persisted headers / metadata / timeouts / OAuth credentials are intentionally lost on first read. Users re-enter them via the settings form (or hand-edit the file into the flat shape). v2 has not shipped a stable release with the nested shape, so the blast radius is the small set of v2/main dogfooders who edited per-server settings between #1353 merging and this change.
251253
- **UI**: `ServerSettingsModal` is opened from the server card's settings affordance. Saving routes through `useServers.updateServerSettings(id, settings)` which issues a settings-only `PUT /api/servers/:id` with `{ id, settings }` — the route preserves the on-disk transport config inside its write lock. Conversely, `useServers.updateServer` (driven by the basic-config modal) issues a config-only PUT with `{ id, config }` and the route preserves the on-disk settings fields. Edits in either modal cannot silently wipe the other half.

0 commit comments

Comments
 (0)