Skip to content

Commit 1bed9ea

Browse files
authored
fix(cache): atomic writes to prevent torn-read deletion of valid cache entries (#1056)
## Summary Fixes the failing `main` build caused by a ~10% flaky unit test (`response-cache.test.ts` → `--fresh round-trip: stale entry is replaced by fresh response`) that exposed a **real production bug** in the HTTP response cache. ## Root cause In `src/lib/response-cache.ts`, each cache write fires `cleanupCache()` **fire-and-forget** at 10% probability (`CLEANUP_PROBABILITY = 0.1`). A second write to the **same key** overwrote the file with a **non-atomic** `writeFile`. The first write's async cleanup could then read the file **mid-overwrite** (torn read), fail to `JSON.parse` the truncated content, and **delete it** as "corrupted" in `collectEntryMetadata` — silently losing a valid cache entry. **Proof:** A 300-iteration repro failed ~10% of the time (matching `CLEANUP_PROBABILITY`), and on every miss the cache directory was **empty** — the valid entry was deleted, not expired. This is a genuine correctness bug beyond tests: any two rapid writes to the same cache key (paginated fetches, concurrent CLI invocations sharing the cache dir) can race so cleanup's torn read deletes a valid entry. ## Why PR checks didn't catch it The flake is ~10%, so any single CI "Unit Tests" run passes ~90% of the time. PR #1051's Unit Tests check landed in the lucky 90%; the post-merge `main` run hit the unlucky 10%. The bug is **pre-existing** — the cleanup code predates #1051, which never touched `response-cache.ts`. ## Fix 1. **Atomic writes** — new `atomicWriteCacheFile()` writes to a unique temp file (`<key>.<pid>.<uuid>.tmp`) then `rename`s it into place. `rename` is atomic on POSIX (same filesystem) and near-atomic on Windows (same volume), so a concurrent reader sees a complete old or new file, never a torn one. 2. **Cleanup hardening** — `collectEntryMetadata()` now separates failure modes: - **transient read failure** (locking, AV scanner, ENOENT from a concurrent sweep) → **skip**, never delete. - **fully read but unparseable** → genuine corruption (atomic writes preclude torn reads), so the file is marked expired and reclaimed by `deleteExpiredEntries`. This keeps corrupt files visible to `MAX_CACHE_ENTRIES` eviction instead of leaking past the count. 3. **Temp-file sweep** — `cleanupCache()` removes orphaned `.tmp` files older than 60s, so a crash between `writeFile` and `rename` can't leak temp files. 4. **Diagnostics** — previously-silent catch blocks now `log.debug()` the suppressed error (per AGENTS.md no-silent-catch rule). ## Verification - Amplified repro: **300/300 pass** (was ~10% failing before). - Actual test file: **10/10 runs green** (36 tests, +2 new regression tests). - `pnpm run lint`: clean. `pnpm exec tsc --noEmit`: exit 0. - Full unit suite: 325 files, all passing (one unrelated pre-existing property flake noted below). ## Out of scope (follow-up) During full-suite runs I observed an unrelated, pre-existing flaky **property** test: `auto-paginate.property.test.ts:94` (counterexample `[393,392,98]` — an `autoPaginate` `nextCursor` trim bug when `total > limit`). It passes in isolation and is independent of this change. Tracked as a follow-up.
1 parent 41f8aa2 commit 1bed9ea

2 files changed

Lines changed: 151 additions & 8 deletions

File tree

src/lib/response-cache.ts

Lines changed: 101 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,14 @@
1414
* @module
1515
*/
1616

17-
import { createHash } from "node:crypto";
17+
import { createHash, randomUUID } from "node:crypto";
1818
import {
1919
mkdir,
2020
readdir,
2121
readFile,
22+
rename,
2223
rm,
24+
stat,
2325
unlink,
2426
writeFile,
2527
} from "node:fs/promises";
@@ -30,8 +32,12 @@ import pLimit from "p-limit";
3032
import { getIdentityFingerprint } from "./db/auth.js";
3133
import { getConfigDir } from "./db/index.js";
3234
import { getEnv } from "./env.js";
35+
import { logger } from "./logger.js";
3336
import { recordCacheHit, withCacheSpan } from "./telemetry.js";
3437

38+
/** Tagged logger for diagnostic visibility into best-effort cache operations. */
39+
const log = logger.withTag("response-cache");
40+
3541
// ---------------------------------------------------------------------------
3642
// TTL tiers — used as fallback when the server sends no cache headers
3743
// ---------------------------------------------------------------------------
@@ -582,6 +588,38 @@ export async function storeCachedResponse(
582588
}
583589
}
584590

591+
/**
592+
* Atomically write a cache file by writing to a unique temp file in the same
593+
* directory and renaming it into place.
594+
*
595+
* A plain `writeFile` is not atomic: a concurrent reader (e.g. the probabilistic
596+
* {@link cleanupCache} sweep fired fire-and-forget by an earlier write) can read
597+
* the file mid-write, fail to `JSON.parse` the truncated content, and delete it
598+
* as "corrupted" — silently losing a valid cache entry. `rename` into place is
599+
* atomic on POSIX (same filesystem) and near-atomic on Windows (same volume), so
600+
* a concurrent reader sees either the complete old file or the complete new file
601+
* rather than a half-written one.
602+
*
603+
* Best-effort cleanup of the temp file on failure; the caller treats write
604+
* failures as non-fatal.
605+
*/
606+
async function atomicWriteCacheFile(
607+
finalPath: string,
608+
serialized: string
609+
): Promise<void> {
610+
const tmpPath = `${finalPath}.${process.pid}.${randomUUID()}.tmp`;
611+
try {
612+
await writeFile(tmpPath, serialized, "utf-8");
613+
await rename(tmpPath, finalPath);
614+
} catch (error) {
615+
await unlink(tmpPath).catch((cleanupError) => {
616+
// The temp file may never have been created (e.g. writeFile failed).
617+
log.debug("Failed to clean up cache temp file", cleanupError);
618+
});
619+
throw error;
620+
}
621+
}
622+
585623
/** Inputs for {@link writeResponseToCache}, bundled to stay under useMaxParams. */
586624
type WriteRequest = {
587625
key: string;
@@ -634,7 +672,7 @@ async function writeResponseToCache(req: WriteRequest): Promise<number> {
634672

635673
const serialized = JSON.stringify(entry);
636674
await mkdir(getCacheDir(), { recursive: true, mode: 0o700 });
637-
await writeFile(cacheFilePath(key), serialized, "utf-8");
675+
await atomicWriteCacheFile(cacheFilePath(key), serialized);
638676

639677
// Probabilistic cleanup to avoid unbounded cache growth
640678
if (Math.random() < CLEANUP_PROBABILITY) {
@@ -736,6 +774,14 @@ async function cleanupCache(): Promise<void> {
736774
throw error;
737775
}
738776

777+
// Sweep orphaned temp files left by a crash between writeFile and rename in
778+
// {@link atomicWriteCacheFile}. Done regardless of whether any .json files
779+
// exist so leaked temp files can never accumulate unbounded.
780+
const tmpFiles = files.filter((f) => f.endsWith(".tmp"));
781+
if (tmpFiles.length > 0) {
782+
await deleteStaleTempFiles(cacheDir, tmpFiles);
783+
}
784+
739785
const jsonFiles = files.filter((f) => f.endsWith(".json"));
740786
if (jsonFiles.length === 0) {
741787
return;
@@ -750,6 +796,36 @@ async function cleanupCache(): Promise<void> {
750796
]);
751797
}
752798

799+
/**
800+
* Age (ms) after which an orphaned `.tmp` file is considered abandoned.
801+
*
802+
* A live {@link atomicWriteCacheFile} writes then renames in well under a
803+
* second, so any `.tmp` file older than this was left by a crashed process and
804+
* is safe to remove. The generous threshold avoids racing a concurrent write.
805+
*/
806+
const STALE_TEMP_FILE_MS = 60_000;
807+
808+
/** Delete `.tmp` files older than {@link STALE_TEMP_FILE_MS}. Best-effort. */
809+
async function deleteStaleTempFiles(
810+
cacheDir: string,
811+
tmpFiles: string[]
812+
): Promise<void> {
813+
const cutoff = Date.now() - STALE_TEMP_FILE_MS;
814+
await cacheIO.map(tmpFiles, async (file) => {
815+
const filePath = join(cacheDir, file);
816+
try {
817+
const stats = await stat(filePath);
818+
if (stats.mtimeMs < cutoff) {
819+
await unlink(filePath).catch(() => {
820+
// Already gone — another sweep or the owning process removed it.
821+
});
822+
}
823+
} catch (error) {
824+
log.debug("Failed to inspect cache temp file during sweep", error);
825+
}
826+
});
827+
}
828+
753829
/** Metadata for a cache entry, used for cleanup decisions */
754830
type EntryMetadata = { file: string; createdAt: number; expired: boolean };
755831

@@ -769,20 +845,37 @@ async function collectEntryMetadata(
769845

770846
await cacheIO.map(jsonFiles, async (file) => {
771847
const filePath = join(cacheDir, file);
848+
849+
// Read and parse are handled separately so we can distinguish a transient
850+
// read failure (skip — never delete) from genuine corruption (parse failed
851+
// on a fully-read file — safe to delete). Atomic writes
852+
// ({@link atomicWriteCacheFile}) guarantee readers never see a half-written
853+
// file, so a parse failure here means real corruption, not a torn read —
854+
// deleting was previously a data-loss bug when writes were non-atomic.
855+
let raw: string;
856+
try {
857+
raw = await readFile(filePath, "utf-8");
858+
} catch (error) {
859+
// Transient read failure (locking, AV scanner, ENOENT from a concurrent
860+
// sweep). Skip — a later sweep will reconsider the file.
861+
log.debug("Skipping cache file with unreadable contents", error);
862+
return;
863+
}
864+
772865
try {
773-
const raw = await readFile(filePath, "utf-8");
774866
const entry = JSON.parse(raw) as CacheEntry;
775867
const expired =
776868
entry.expiresAt !== undefined
777869
? now >= entry.expiresAt
778870
: now - entry.createdAt >
779871
FALLBACK_TTL_MS[classifyUrl(entry.url ?? "")];
780872
entries.push({ file, createdAt: entry.createdAt, expired });
781-
} catch {
782-
// Unparseable file — delete it
783-
unlink(filePath).catch(() => {
784-
// Best-effort cleanup of corrupted file
785-
});
873+
} catch (error) {
874+
// Fully read but unparseable — genuine corruption. Mark expired so
875+
// {@link deleteExpiredEntries} reclaims it; this also keeps eviction
876+
// counts accurate (corrupt files are not invisible to MAX_CACHE_ENTRIES).
877+
log.debug("Reclaiming corrupt cache file during cleanup", error);
878+
entries.push({ file, createdAt: now, expired: true });
786879
}
787880
});
788881

test/lib/response-cache.test.ts

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -470,6 +470,56 @@ describe("file structure", () => {
470470
expect(files.length).toBe(1);
471471
expect(files[0]).toMatch(/^[0-9a-f]{64}\.json$/);
472472
});
473+
474+
test("atomic write leaves no temp file behind on success", async () => {
475+
await storeCachedResponse(
476+
TEST_METHOD,
477+
TEST_URL,
478+
{},
479+
mockResponse(TEST_BODY)
480+
);
481+
482+
const cacheDir = join(getConfigDir(), "cache", "responses");
483+
const files = await readdir(cacheDir);
484+
// Exactly the final .json file — the temp file was renamed into place.
485+
expect(files.every((f) => f.endsWith(".json"))).toBe(true);
486+
expect(files.some((f) => f.endsWith(".tmp"))).toBe(false);
487+
});
488+
});
489+
490+
// ---------------------------------------------------------------------------
491+
// Atomic write / torn-read regression (getsentry/cli#1056)
492+
//
493+
// A write fires cleanupCache() fire-and-forget at 10% probability. Before
494+
// atomic writes, a second write to the same key could be read mid-overwrite by
495+
// that cleanup sweep, fail to JSON.parse, and be deleted as "corrupted" —
496+
// losing a valid entry. This loop exercises the exact store→overwrite→read
497+
// sequence many times so the probabilistic cleanup is virtually guaranteed to
498+
// fire; every iteration must still serve the fresh value.
499+
// ---------------------------------------------------------------------------
500+
501+
describe("atomic write regression", () => {
502+
test("repeated overwrite-then-read never loses the entry", async () => {
503+
for (let i = 0; i < 50; i++) {
504+
await storeCachedResponse(
505+
TEST_METHOD,
506+
TEST_URL,
507+
{},
508+
mockResponse({ data: `stale-${i}` })
509+
);
510+
const freshBody = { data: `fresh-${i}` };
511+
await storeCachedResponse(
512+
TEST_METHOD,
513+
TEST_URL,
514+
{},
515+
mockResponse(freshBody)
516+
);
517+
518+
const cached = await getCachedResponse(TEST_METHOD, TEST_URL, {});
519+
expect(cached).toBeDefined();
520+
expect(await cached!.json()).toEqual(freshBody);
521+
}
522+
});
473523
});
474524

475525
// ---------------------------------------------------------------------------

0 commit comments

Comments
 (0)