Skip to content

fix(config): atomic JSON writes to prevent half-read races during startup#965

Open
dongsheng123132 wants to merge 1 commit into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/atomic-write-config
Open

fix(config): atomic JSON writes to prevent half-read races during startup#965
dongsheng123132 wants to merge 1 commit into
ValueCell-ai:mainfrom
dongsheng123132:upstream-pr/atomic-write-config

Conversation

@dongsheng123132
Copy link
Copy Markdown

Summary

writeJsonFile (in electron/utils/openclaw-auth.ts, used for openclaw.json, auth-profiles.json, models.json) calls fs.writeFile directly. Several startup paths run concurrently and read these files at the same time: sanitizeOpenClawConfig, listAccounts / listLegacyProviders, channel.runtime, capability-monitor, etc. When a writer interleaves with a reader\u0027s readFile + JSON.parse, the reader can observe a half-written file and crash with ```Unexpected end of JSON input```.

On a Windows portable install running off a USB drive (slower flush latency), this is reproducible on every gateway restart that happens to coincide with a sanitize or readback. Customer log signature:

```
Failed to read OpenClaw config: SyntaxError: Unexpected end of JSON input
at JSON.parse ...
```

Fix

Introduce writeFileAtomic in a new helper electron/utils/atomic-write.ts:

  1. Write to a unique tmp file `.tmp..` so concurrent writers never collide.
  2. Rename tmp into place. Same-disk rename is atomic on POSIX and NTFS — readers always see the complete old or new file, never a partial one.
  3. On Windows the rename can fail with EPERM/EBUSY/EACCES if the target file is held by another process (especially common on USB drives). Retry up to 5x with linear backoff, then fall back to a direct in-place write so we never lose data — the reader\u0027s existing try/catch absorbs the half-read window in that fallback case (same behavior as before this helper existed).

Routed every writeJsonFile call through the helper, and also routed the one direct writeFile call (modelsPath cleanup) through it so the same atomicity guarantee covers per-agent model registry writes too.

Verification

  • `pnpm typecheck` clean.
  • `pnpm exec vitest run tests/unit/atomic-write.test.ts` — 5/5 pass.
  • 5 unit tests cover: basic write, EPERM retry, fallback after 5 retries, concurrent writer non-collision via unique tmp paths, and correct encoding pass-through.
  • End-to-end on Windows portable USB build for several days: the "Unexpected end of JSON input" crash signature has not reappeared in customer logs.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Pre-Submission Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly the new helper which explains the Windows EPERM/USB context inline
  • I have added tests that prove my fix is effective (5 unit tests, see tests/unit/atomic-write.test.ts)
  • My changes generate no new warnings (typecheck and vitest pass)

Related

Companion to #964 (gateway port-deadlock) — both bugs surface mostly under Windows portable + USB workloads but apply across platforms. Discovered while running our downstream portable fork (dongsheng123132/clawx-portable@experimental).

`writeJsonFile` (used for openclaw.json, auth-profiles.json, models.json)
calls `fs.writeFile` directly. Several startup paths run concurrently
and read these files at the same time:

  - sanitizeOpenClawConfig (this module)
  - listAccounts / listLegacyProviders (provider registry)
  - channel routes (channel.runtime)
  - capability-monitor (gateway)

When a writer interleaves with a reader's `readFile + JSON.parse`, the
reader can observe a half-written file and crash with
"Unexpected end of JSON input". On a Windows portable install running
off a USB drive (slower flush latency), this is reproducible on every
gateway restart that happens to coincide with a sanitize or readback.

Fix: introduce `writeFileAtomic(filePath, data, encoding?)` in
`electron/utils/atomic-write.ts`:

  1. Write to a unique tmp file `<file>.tmp.<pid>.<counter>` to avoid
     concurrent writers stomping each other.
  2. Rename tmp → real path. Same-disk rename is atomic on POSIX and
     NTFS — readers always see the complete old or new file.
  3. On Windows the rename can fail with EPERM/EBUSY/EACCES if the
     target file is held by another process (especially common on USB
     drives). Retry up to 5× with linear backoff, then fall back to a
     direct in-place write so we never lose data.

Routed all `writeJsonFile` calls through the helper. Also routed the
one direct `writeFile` (modelsPath cleanup at line 914) through it so
the same atomicity guarantee covers per-agent model registry writes.

Tests: 5 unit tests cover the basic write, EPERM retry, fallback after
5 retries, concurrent writer non-collision via unique tmp paths, and
correct encoding pass-through.

`pnpm typecheck` clean. `pnpm exec vitest run tests/unit/atomic-write.test.ts`
5/5 pass.
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f28dbef04

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +39 to +43
await writeFile(tmpPath, data, encoding);

for (let attempt = 0; attempt < 5; attempt++) {
try {
await rename(tmpPath, filePath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve existing file mode when replacing config files

This atomic-write implementation creates a brand new temp file with default permissions and then renames it over the target, which resets the target's mode bits instead of preserving them. If openclaw.json/auth-profiles.json were previously locked down (for example 0600 from external tooling), a subsequent write here can silently widen access (commonly to 0644), exposing API keys/tokens to other local users. Please carry over the existing file mode (e.g., stat target and apply mode to temp before rename, or chmod after rename) when the destination already exists.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant