Cache writes are not atomic (embeddings + structured outputs) → an interrupted/concurrent write poisons future reads

## Summary

Both file-based caches write entries **in place** with no temp-file + atomic rename, and the read paths don't defend against partial/truncated entries. If a writer is interrupted (crash, OOM, kill) or a reader in another process observes a file mid-write, the entry is left partial and **every subsequent read of that key raises** instead of recomputing — the bad entry poisons the cache until it is deleted by hand. This matters for parallel Optuna workers and the HTTP/MCP server, which can hit the same cache concurrently.

## Where (on `dev`)

**Embeddings** — `src/autointent/_wrappers/embedder/sentence_transformers.py`:
- write: `np.save(embeddings_path, ...)` straight to the final path (lines ~228–231)
- read: `if embeddings_path.exists(): np.load(embeddings_path)` with no `try/except` (lines ~183–185)

**Structured outputs** — `src/autointent/_dump_tools/unit_dumpers.py`, `PydanticModelDumper.dump` (lines 158–165) does `mkdir` then writes `class_info.json` and `model_dump.json` as two separate, non-atomic steps. The read path `StructuredOutputCache._load_from_disk` (`src/autointent/generation/_cache.py`) only catches `ValidationError` / `ImportError`, so a missing `model_dump.json` raises an uncaught `FileNotFoundError`.

## Reproduce (no network)

**Embeddings** — a truncated `.npy` makes the next `embed()` raise `ValueError` (not a miss), permanently:

```python
# after one successful embed(utts), truncate the cached .npy to half its bytes:
raw = embeddings_path.read_bytes()
embeddings_path.write_bytes(raw[: len(raw) // 2])
embedder.embed(utts)   # -> raises ValueError; never recomputes
```

**Structured** — an entry directory missing `model_dump.json` (interrupted between the two writes) makes `get()` raise `FileNotFoundError`:

```python
cache.set(msgs, Out, params, Out(label="x"))
(entry_dir / "model_dump.json").unlink()   # simulate a crash between the two file writes
StructuredOutputCache(use_cache=True).get(msgs, Out, params)   # -> raises FileNotFoundError
```

Both were reproduced in a benchmark: each cache raises on the next read and does not auto-recover.

## Suggested fix

1. **Atomic writes:** write to a temp path and `os.replace()` (atomic on POSIX) for both `np.save` and the structured dump (write into a temp directory, then rename it into place).
2. **Self-healing reads:** wrap `np.load` / `PydanticModelDumper.load` so a corrupt/partial entry is deleted and treated as a miss (recompute) instead of raising. For the structured cache, also catch `OSError` / `FileNotFoundError`, not just `ValidationError` / `ImportError`.

Related: #326 (directory-aware deletion is needed for the cleanup path).

## Severity

Medium-High under concurrency (parallel trials / long-running server).

## How it was found

Robustness scenario of a benchmark of AutoIntent 0.3.1's caches.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache writes are not atomic (embeddings + structured outputs) → an interrupted/concurrent write poisons future reads #335

Summary

Where (on `dev`)

Reproduce (no network)

Suggested fix

Severity

How it was found

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Cache writes are not atomic (embeddings + structured outputs) → an interrupted/concurrent write poisons future reads #335

Description

Summary

Where (on dev)

Reproduce (no network)

Suggested fix

Severity

How it was found

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Where (on `dev`)