Skip to content

Cache writes are not atomic (embeddings + structured outputs) → an interrupted/concurrent write poisons future reads #335

Description

@voorhs

Summary

Both file-based caches write entries in place with no temp-file + atomic rename, and the read paths don't defend against partial/truncated entries. If a writer is interrupted (crash, OOM, kill) or a reader in another process observes a file mid-write, the entry is left partial and every subsequent read of that key raises instead of recomputing — the bad entry poisons the cache until it is deleted by hand. This matters for parallel Optuna workers and the HTTP/MCP server, which can hit the same cache concurrently.

Where (on dev)

Embeddingssrc/autointent/_wrappers/embedder/sentence_transformers.py:

  • write: np.save(embeddings_path, ...) straight to the final path (lines ~228–231)
  • read: if embeddings_path.exists(): np.load(embeddings_path) with no try/except (lines ~183–185)

Structured outputssrc/autointent/_dump_tools/unit_dumpers.py, PydanticModelDumper.dump (lines 158–165) does mkdir then writes class_info.json and model_dump.json as two separate, non-atomic steps. The read path StructuredOutputCache._load_from_disk (src/autointent/generation/_cache.py) only catches ValidationError / ImportError, so a missing model_dump.json raises an uncaught FileNotFoundError.

Reproduce (no network)

Embeddings — a truncated .npy makes the next embed() raise ValueError (not a miss), permanently:

# after one successful embed(utts), truncate the cached .npy to half its bytes:
raw = embeddings_path.read_bytes()
embeddings_path.write_bytes(raw[: len(raw) // 2])
embedder.embed(utts)   # -> raises ValueError; never recomputes

Structured — an entry directory missing model_dump.json (interrupted between the two writes) makes get() raise FileNotFoundError:

cache.set(msgs, Out, params, Out(label="x"))
(entry_dir / "model_dump.json").unlink()   # simulate a crash between the two file writes
StructuredOutputCache(use_cache=True).get(msgs, Out, params)   # -> raises FileNotFoundError

Both were reproduced in a benchmark: each cache raises on the next read and does not auto-recover.

Suggested fix

  1. Atomic writes: write to a temp path and os.replace() (atomic on POSIX) for both np.save and the structured dump (write into a temp directory, then rename it into place).
  2. Self-healing reads: wrap np.load / PydanticModelDumper.load so a corrupt/partial entry is deleted and treated as a miss (recompute) instead of raising. For the structured cache, also catch OSError / FileNotFoundError, not just ValidationError / ImportError.

Related: #326 (directory-aware deletion is needed for the cleanup path).

Severity

Medium-High under concurrency (parallel trials / long-running server).

How it was found

Robustness scenario of a benchmark of AutoIntent 0.3.1's caches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions