Summary
Both file-based caches write entries in place with no temp-file + atomic rename, and the read paths don't defend against partial/truncated entries. If a writer is interrupted (crash, OOM, kill) or a reader in another process observes a file mid-write, the entry is left partial and every subsequent read of that key raises instead of recomputing — the bad entry poisons the cache until it is deleted by hand. This matters for parallel Optuna workers and the HTTP/MCP server, which can hit the same cache concurrently.
Where (on dev)
Embeddings — src/autointent/_wrappers/embedder/sentence_transformers.py:
- write:
np.save(embeddings_path, ...) straight to the final path (lines ~228–231)
- read:
if embeddings_path.exists(): np.load(embeddings_path) with no try/except (lines ~183–185)
Structured outputs — src/autointent/_dump_tools/unit_dumpers.py, PydanticModelDumper.dump (lines 158–165) does mkdir then writes class_info.json and model_dump.json as two separate, non-atomic steps. The read path StructuredOutputCache._load_from_disk (src/autointent/generation/_cache.py) only catches ValidationError / ImportError, so a missing model_dump.json raises an uncaught FileNotFoundError.
Reproduce (no network)
Embeddings — a truncated .npy makes the next embed() raise ValueError (not a miss), permanently:
# after one successful embed(utts), truncate the cached .npy to half its bytes:
raw = embeddings_path.read_bytes()
embeddings_path.write_bytes(raw[: len(raw) // 2])
embedder.embed(utts) # -> raises ValueError; never recomputes
Structured — an entry directory missing model_dump.json (interrupted between the two writes) makes get() raise FileNotFoundError:
cache.set(msgs, Out, params, Out(label="x"))
(entry_dir / "model_dump.json").unlink() # simulate a crash between the two file writes
StructuredOutputCache(use_cache=True).get(msgs, Out, params) # -> raises FileNotFoundError
Both were reproduced in a benchmark: each cache raises on the next read and does not auto-recover.
Suggested fix
- Atomic writes: write to a temp path and
os.replace() (atomic on POSIX) for both np.save and the structured dump (write into a temp directory, then rename it into place).
- Self-healing reads: wrap
np.load / PydanticModelDumper.load so a corrupt/partial entry is deleted and treated as a miss (recompute) instead of raising. For the structured cache, also catch OSError / FileNotFoundError, not just ValidationError / ImportError.
Related: #326 (directory-aware deletion is needed for the cleanup path).
Severity
Medium-High under concurrency (parallel trials / long-running server).
How it was found
Robustness scenario of a benchmark of AutoIntent 0.3.1's caches.
Summary
Both file-based caches write entries in place with no temp-file + atomic rename, and the read paths don't defend against partial/truncated entries. If a writer is interrupted (crash, OOM, kill) or a reader in another process observes a file mid-write, the entry is left partial and every subsequent read of that key raises instead of recomputing — the bad entry poisons the cache until it is deleted by hand. This matters for parallel Optuna workers and the HTTP/MCP server, which can hit the same cache concurrently.
Where (on
dev)Embeddings —
src/autointent/_wrappers/embedder/sentence_transformers.py:np.save(embeddings_path, ...)straight to the final path (lines ~228–231)if embeddings_path.exists(): np.load(embeddings_path)with notry/except(lines ~183–185)Structured outputs —
src/autointent/_dump_tools/unit_dumpers.py,PydanticModelDumper.dump(lines 158–165) doesmkdirthen writesclass_info.jsonandmodel_dump.jsonas two separate, non-atomic steps. The read pathStructuredOutputCache._load_from_disk(src/autointent/generation/_cache.py) only catchesValidationError/ImportError, so a missingmodel_dump.jsonraises an uncaughtFileNotFoundError.Reproduce (no network)
Embeddings — a truncated
.npymakes the nextembed()raiseValueError(not a miss), permanently:Structured — an entry directory missing
model_dump.json(interrupted between the two writes) makesget()raiseFileNotFoundError:Both were reproduced in a benchmark: each cache raises on the next read and does not auto-recover.
Suggested fix
os.replace()(atomic on POSIX) for bothnp.saveand the structured dump (write into a temp directory, then rename it into place).np.load/PydanticModelDumper.loadso a corrupt/partial entry is deleted and treated as a miss (recompute) instead of raising. For the structured cache, also catchOSError/FileNotFoundError, not justValidationError/ImportError.Related: #326 (directory-aware deletion is needed for the cleanup path).
Severity
Medium-High under concurrency (parallel trials / long-running server).
How it was found
Robustness scenario of a benchmark of AutoIntent 0.3.1's caches.