Skip to content

Commit e903003

Browse files
NagyViktNagyViktclaude
authored
feat(durability): atomic writes + registry lock for Theme N1 (#29)
- Add src/infra/fs/atomic-write.ts: temp-file write, fsync(fd), chmod before rename, rename, fsync(dir). Dir fsync gated on non-Windows per POSIX. - Add src/infra/fs/registry-lock.ts: O_EXCL lock with PID-liveness probe and 30 s wall-clock stale-lock heuristic. In-process callers block instead of reaping each other. - Make secureWriteFile a thin wrapper over atomicWriteFile so every snapshot, current, and sessions.json write fsyncs. - Add persistRegistryAtomic: single registry write path that reloads under the lock and unions account entries before writing. - Route AccountService.persistRegistry and useAccount's direct saveRegistry call through the locked atomic path; no public method signature changes. - Add src/tests/registry-durability.test.ts covering the rename-fails-mid- write window, in-process lock serialization, and stale-lock reaping. - Release notes in releases/v0.1.25.md include a Durability section. Refs: docs/future/17-ROADMAP.md Theme N1, docs/future/01-ARCHITECTURE.md §5. Co-authored-by: NagyVikt <nagy.viktordp@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 59247dd commit e903003

7 files changed

Lines changed: 560 additions & 35 deletions

File tree

releases/v0.1.25.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# authmux v0.1.25
2+
3+
## Added
4+
- `src/infra/fs/atomic-write.ts` — single durable file-write primitive with
5+
fsync-before-rename and POSIX dir-fsync after rename, plus mode applied
6+
before rename so 0600 files are never visible at a looser mode.
7+
- `src/infra/fs/registry-lock.ts` — advisory lock around `registry.json`
8+
with PID-liveness probing plus a 30 s wall-clock stale-lock heuristic.
9+
- `persistRegistryAtomic` in `src/lib/accounts/registry.ts` is now the single
10+
registry write path; it reloads under the lock and merges accounts so two
11+
concurrent writers (daemon + interactive) no longer race-lose each other's
12+
mutations.
13+
- `src/tests/registry-durability.test.ts` covers the SIGKILL-between-
14+
writeFile-and-rename window and the stale-lock-reaping path.
15+
16+
## Changed
17+
- `secureWriteFile` now delegates to `atomicWriteFile`; behavior is the same
18+
(atomic temp+rename, 0600 perms) but every snapshot, `current`, and
19+
`sessions.json` write now also fsyncs the file and the containing directory
20+
before returning.
21+
- `AccountService.useAccount` no longer calls `saveRegistry` directly; all
22+
registry writes go through the locked `persistRegistry` path.
23+
24+
## Fixed
25+
- Half-written `registry.json` after SIGKILL or laptop sleep no longer
26+
shadows the on-disk file: the rename is the commit point.
27+
- Concurrent `authmux daemon --watch` + `authmux use foo` mutations are
28+
serialized through the registry lock.
29+
30+
## Deprecated
31+
- None.
32+
33+
## Removed
34+
- None.
35+
36+
## Security
37+
- File mode 0600 is applied to the temp file BEFORE the rename, closing the
38+
brief window where the previous chmod-after-rename path could expose
39+
freshly-renamed files at the umask default.
40+
41+
## Durability
42+
43+
This release closes the two data-loss windows tracked under Theme N1 of
44+
`docs/future/17-ROADMAP.md`:
45+
46+
1. **Partial-write corruption.** `atomicWriteFile` writes to a temp file,
47+
`fsync`s the file fd, applies the final mode, renames over the target,
48+
and (on POSIX) `fsync`s the containing directory. A power loss between
49+
any of these steps leaves the previous content of the target intact.
50+
2. **Concurrent-writer lost mutations.** `withRegistryLock` serializes
51+
writers; `persistRegistryAtomic` reloads the registry under the lock and
52+
merges accounts before writing so neither writer silently discards the
53+
other.
54+
55+
Crash-safety guarantees promised after this release match the table in
56+
`docs/future/01-ARCHITECTURE.md` §5.3 rows 1, 2, and 4.
57+
58+
`fsync` adds latency on spinning disks; on SSDs it is dominated by the
59+
network/parse work the CLI was already doing and is not user-visible. The
60+
dir-fsync is gated on `process.platform !== "win32"` because Windows does
61+
not allow opening a directory as a file.
62+
63+
## Migration
64+
65+
No migration is required. The on-disk layout of `registry.json`, `current`,
66+
`sessions.json`, and the per-account snapshot files is unchanged.

src/infra/fs/atomic-write.ts

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import fsp from "node:fs/promises";
2+
import path from "node:path";
3+
import crypto from "node:crypto";
4+
5+
/**
6+
* Default mode for atomically-written files. Callers that want world-readable
7+
* content can override via the `mode` option (e.g. 0o644). For secrets we use
8+
* 0o600 and apply it BEFORE rename so the final inode is never visible at a
9+
* looser mode.
10+
*/
11+
export const DEFAULT_FILE_MODE = 0o600;
12+
13+
const IS_WINDOWS = process.platform === "win32";
14+
15+
export interface AtomicWriteOptions {
16+
/** Final file mode applied to the temp file before the rename. Default 0o600. */
17+
mode?: number;
18+
/**
19+
* Encoding for string payloads. Default "utf8". Buffer payloads ignore this.
20+
*/
21+
encoding?: BufferEncoding;
22+
/**
23+
* Skip fsync on the directory after the rename. Default false. Set true only
24+
* if the caller has already fsynced the dir (e.g. batch writes).
25+
*/
26+
skipDirSync?: boolean;
27+
}
28+
29+
/**
30+
* Write `data` to `target` atomically.
31+
*
32+
* Order: mkdir -> open temp -> writeFile -> fsync(file) -> chmod -> close ->
33+
* rename -> fsync(dir).
34+
*
35+
* fsync on the directory is the POSIX requirement to make a rename durable;
36+
* we skip it on Windows where opening a directory as a file is not allowed.
37+
*
38+
* This is the single durable write path for authmux. `secureWriteFile`
39+
* delegates here; callers should prefer this helper directly.
40+
*/
41+
export async function atomicWriteFile(
42+
target: string,
43+
data: string | Buffer,
44+
options: AtomicWriteOptions = {},
45+
): Promise<void> {
46+
const mode = options.mode ?? DEFAULT_FILE_MODE;
47+
const encoding = options.encoding ?? "utf8";
48+
const dir = path.dirname(target);
49+
const base = path.basename(target);
50+
51+
await fsp.mkdir(dir, { recursive: true });
52+
53+
const suffix = `${process.pid}.${crypto.randomBytes(4).toString("hex")}.tmp`;
54+
const tmp = path.join(dir, `.${base}.${suffix}`);
55+
56+
const handle = await fsp.open(tmp, "w", mode);
57+
let renamed = false;
58+
try {
59+
if (typeof data === "string") {
60+
await handle.writeFile(data, encoding);
61+
} else {
62+
await handle.writeFile(data);
63+
}
64+
// fsync the file fd so the bytes survive a power loss.
65+
try {
66+
await handle.sync();
67+
} catch {
68+
// Some filesystems (tmpfs, network mounts) reject fsync; not fatal.
69+
}
70+
} finally {
71+
await handle.close();
72+
}
73+
74+
// Apply final perms BEFORE rename so the final inode is never visible at a
75+
// looser mode (chmod-after-rename has a tiny window where the file is at the
76+
// umask default).
77+
if (!IS_WINDOWS) {
78+
try {
79+
await fsp.chmod(tmp, mode);
80+
} catch {
81+
// best-effort
82+
}
83+
}
84+
85+
try {
86+
await fsp.rename(tmp, target);
87+
renamed = true;
88+
} finally {
89+
if (!renamed) {
90+
// Clean up the temp file if the rename failed; do not mask the original
91+
// error.
92+
try {
93+
await fsp.unlink(tmp);
94+
} catch {
95+
// ignore
96+
}
97+
}
98+
}
99+
100+
// fsync the containing directory so the rename itself is durable on
101+
// ext4/xfs etc. POSIX-only; Windows does not let us open a directory as a
102+
// file. tmpfs / network filesystems may also reject this — non-fatal.
103+
if (!IS_WINDOWS && !options.skipDirSync) {
104+
let dirHandle: fsp.FileHandle | undefined;
105+
try {
106+
dirHandle = await fsp.open(dir, "r");
107+
await dirHandle.sync();
108+
} catch {
109+
// best-effort durability hint; not all FS / OS allow dir fsync.
110+
} finally {
111+
if (dirHandle) {
112+
try {
113+
await dirHandle.close();
114+
} catch {
115+
// ignore
116+
}
117+
}
118+
}
119+
}
120+
}

src/infra/fs/registry-lock.ts

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
import fs from "node:fs";
2+
import fsp from "node:fs/promises";
3+
import path from "node:path";
4+
5+
import { resolveRegistryPath } from "../../lib/config/paths";
6+
7+
/**
8+
* Advisory lock for the accounts registry.
9+
*
10+
* Cooperating writers (interactive commands + daemon) call
11+
* `withRegistryLock(fn)` to serialize their reload-merge-write sequence. The
12+
* lock file is a sibling of `registry.json` named `registry.json.lock` and
13+
* contains the locking process's PID plus an ISO timestamp.
14+
*
15+
* Stale-lock reaping: if `O_EXCL` create fails, we read the file. If the
16+
* recorded PID is no longer alive, or the timestamp is older than
17+
* `STALE_LOCK_MS`, we steal the lock by overwriting it with our own PID.
18+
*
19+
* The lock is best-effort within a single host; it does NOT protect against
20+
* NFS / shared filesystems. Authmux state is per-user / local, so that is
21+
* acceptable.
22+
*/
23+
24+
export const LOCK_ACQUIRE_TIMEOUT_MS = 5000;
25+
export const STALE_LOCK_MS = 30_000;
26+
const INITIAL_BACKOFF_MS = 25;
27+
const MAX_BACKOFF_MS = 250;
28+
29+
export interface LockOptions {
30+
/** Override the default path used for the lock file (testing only). */
31+
lockPath?: string;
32+
/** Maximum wall-clock time to wait for the lock. Default 5000 ms. */
33+
timeoutMs?: number;
34+
/** Threshold past which a lock with a live-looking PID is still reaped. */
35+
staleMs?: number;
36+
}
37+
38+
interface LockPayload {
39+
pid: number;
40+
at: string;
41+
}
42+
43+
function delay(ms: number): Promise<void> {
44+
return new Promise((resolve) => setTimeout(resolve, ms));
45+
}
46+
47+
function isAlive(pid: number): boolean {
48+
if (!Number.isFinite(pid) || pid <= 0) return false;
49+
try {
50+
// Signal 0 is the standard POSIX liveness probe; on Windows Node maps it
51+
// to a process-exists check too.
52+
process.kill(pid, 0);
53+
return true;
54+
} catch (err) {
55+
const code = (err as NodeJS.ErrnoException).code;
56+
// EPERM means the process exists but we can't signal it (different user).
57+
return code === "EPERM";
58+
}
59+
}
60+
61+
function parsePayload(raw: string): LockPayload | null {
62+
try {
63+
const parsed = JSON.parse(raw) as { pid?: unknown; at?: unknown };
64+
const pid = typeof parsed.pid === "number" ? parsed.pid : NaN;
65+
const at = typeof parsed.at === "string" ? parsed.at : "";
66+
if (!Number.isFinite(pid)) return null;
67+
return { pid, at };
68+
} catch {
69+
return null;
70+
}
71+
}
72+
73+
async function tryCreateExclusive(lockPath: string, payload: string): Promise<boolean> {
74+
try {
75+
// wx = write + O_EXCL + O_CREAT — fails atomically if the file exists.
76+
// We deliberately do NOT go through atomicWriteFile here: an atomic
77+
// temp+rename would defeat the O_EXCL semantics that make the lock work.
78+
await fsp.writeFile(lockPath, payload, { flag: "wx", mode: 0o600 });
79+
return true;
80+
} catch (err) {
81+
const code = (err as NodeJS.ErrnoException).code;
82+
if (code === "EEXIST") return false;
83+
throw err;
84+
}
85+
}
86+
87+
async function readLock(lockPath: string): Promise<{ payload: LockPayload | null; mtimeMs: number } | null> {
88+
try {
89+
const [raw, stat] = await Promise.all([
90+
fsp.readFile(lockPath, "utf8"),
91+
fsp.stat(lockPath),
92+
]);
93+
return { payload: parsePayload(raw), mtimeMs: stat.mtimeMs };
94+
} catch (err) {
95+
const code = (err as NodeJS.ErrnoException).code;
96+
if (code === "ENOENT") return null;
97+
throw err;
98+
}
99+
}
100+
101+
function shouldReap(lock: { payload: LockPayload | null; mtimeMs: number }, staleMs: number): boolean {
102+
if (!lock.payload) return true; // unparseable -> reap
103+
// NOTE: do NOT reap on `pid === process.pid` — within a single Node process
104+
// multiple async callers legitimately share a PID, and they must block each
105+
// other rather than stomp the lock. Cross-process stale ownership is what
106+
// the liveness probe and wall-clock heuristic below cover.
107+
if (!isAlive(lock.payload.pid)) return true;
108+
const ageMs = Date.now() - lock.mtimeMs;
109+
return ageMs > staleMs;
110+
}
111+
112+
async function acquire(lockPath: string, timeoutMs: number, staleMs: number): Promise<void> {
113+
const deadline = Date.now() + timeoutMs;
114+
const payload = JSON.stringify({ pid: process.pid, at: new Date().toISOString() });
115+
let backoff = INITIAL_BACKOFF_MS;
116+
117+
await fsp.mkdir(path.dirname(lockPath), { recursive: true });
118+
119+
// Loop until we either acquire or time out.
120+
for (;;) {
121+
if (await tryCreateExclusive(lockPath, payload)) return;
122+
123+
const current = await readLock(lockPath);
124+
if (current && shouldReap(current, staleMs)) {
125+
// Steal the lock. Use an unlink-then-create dance; if another process
126+
// beat us to either step, we just loop again.
127+
try {
128+
await fsp.unlink(lockPath);
129+
} catch (err) {
130+
const code = (err as NodeJS.ErrnoException).code;
131+
if (code !== "ENOENT") {
132+
// Could not remove; fall through to retry.
133+
}
134+
}
135+
if (await tryCreateExclusive(lockPath, payload)) return;
136+
}
137+
138+
if (Date.now() >= deadline) {
139+
throw new Error(
140+
`registry lock at ${lockPath} held by pid=${current?.payload?.pid ?? "unknown"} ` +
141+
`(timed out after ${timeoutMs} ms)`,
142+
);
143+
}
144+
145+
await delay(backoff);
146+
backoff = Math.min(backoff * 2, MAX_BACKOFF_MS);
147+
}
148+
}
149+
150+
function release(lockPath: string): void {
151+
try {
152+
const raw = fs.readFileSync(lockPath, "utf8");
153+
const parsed = parsePayload(raw);
154+
if (parsed && parsed.pid !== process.pid) {
155+
// Not ours anymore (we were preempted by a stale-reap). Leave it alone.
156+
return;
157+
}
158+
fs.unlinkSync(lockPath);
159+
} catch {
160+
// best-effort; the next acquirer will reap a stale file if we crashed.
161+
}
162+
}
163+
164+
/**
165+
* Run `fn` under the registry lock. Reload-merge-write logic lives in the
166+
* callback so two writers cannot race-lose mutations.
167+
*/
168+
export async function withRegistryLock<T>(
169+
fn: () => Promise<T>,
170+
options: LockOptions = {},
171+
): Promise<T> {
172+
const lockPath = options.lockPath ?? `${resolveRegistryPath()}.lock`;
173+
const timeoutMs = options.timeoutMs ?? LOCK_ACQUIRE_TIMEOUT_MS;
174+
const staleMs = options.staleMs ?? STALE_LOCK_MS;
175+
176+
await acquire(lockPath, timeoutMs, staleMs);
177+
178+
// Best-effort: drop the lock on abnormal process exit. We attach once; the
179+
// listeners are idempotent (release() no-ops if the lock isn't ours).
180+
const cleanup = () => release(lockPath);
181+
process.once("exit", cleanup);
182+
183+
try {
184+
return await fn();
185+
} finally {
186+
process.removeListener("exit", cleanup);
187+
release(lockPath);
188+
}
189+
}

0 commit comments

Comments
 (0)