Skip to content
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,26 @@ and this project adheres to

## [Unreleased]

### Added

- **Polish prompt-cache hit-rate telemetry.** Each polish run now
tracks Anthropic prompt-cache token usage and logs a one-line
summary at the end of `attune-author regenerate`:
`Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s))`.
A `WARNING` is appended when the run's hit rate falls below 50%,
surfacing silent cache regressions (prompt edits, model alias
drift). Hit rate is `read / (read + creation)` cacheable input
tokens.
- `attune_author.doc_gen._anthropic.call_anthropic` gains an optional
`on_cache_usage(creation, read, model)` callback; backward
compatible (the doc-gen path passes nothing).
- New in `attune_author.polish`: `PolishCacheStats`,
`polish_cache_stats()`, `format_polish_cache_summary()`,
`reset_polish_cache_telemetry()`. Telemetry follows the existing
in-process faithfulness-counter pattern (no new on-disk format).
- README: new "Cache hit rate" subsection under Polish cache.
- 16 new tests in `tests/unit/test_polish_cache_metrics.py`.

## [0.14.2] - 2026-05-27

### Fixed
Expand Down
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,33 @@ volatile frontmatter fields like `generated_at` stripped),
context, and model name. Changing the model automatically invalidates
all prior entries.

### Cache hit rate

Separately from the on-disk response cache above, each polish call
uses Anthropic's **prompt cache** for the ~6000-token system prompt.
After a regen run, `attune-author` logs a one-line summary at INFO:

```
Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s))
```

The hit rate is `read / (read + creation)` — the fraction of cacheable
input tokens served from cache rather than re-billed. Prompt caching
cuts input cost ~90% on the cached portion, so a healthy multi-template
run should settle well above 50% once the first call warms the cache.

- **High (>80%)** — expected steady state; the system prompt is being
reused across calls.
- **Low (<50%)** — triggers a `WARNING` in the summary. Usually means
the cache boundary broke: the system prompt changed between calls,
the model alias drifted, or only a single template was polished (no
reuse). Check recent edits to `polish_prompts.py` or `_POLISH_MODEL`.
- **"no cacheable tokens observed"** — the prompt fell below Anthropic's
caching threshold or caching is disabled (`POLISH_CACHE_SYSTEM`).

The metric is per-run (in-process); it is not persisted across
invocations.

## Python API

```python
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Decisions — Polish prompt-cache hit-rate telemetry

**Status:** Draft (2026-05-11) — gated on briefing-followup batch
**Status:** ✅ DONE (2026-06-06) — shipped to [Unreleased]. The Draft
"gated on briefing-followup batch" note was superseded by this file's
own "Execution gate" ("Not blocking"). One deviation: attune-author has
no telemetry JSONL, so the metric uses the existing in-process
faithfulness-counter pattern (INFO summary at end of run) rather than a
new JSONL file; the threshold warning is current-run, not cross-run.
See `tasks.md` for the per-phase record.
**Owner:** Patrick

---
Expand Down
72 changes: 72 additions & 0 deletions docs/specs/archive/polish-cache-hit-metrics/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Tasks — Polish prompt-cache hit-rate telemetry

**Status:** ✅ DONE (2026-06-06) — shipped to [Unreleased]. See the
"Deviation" note under Phases 3–4: attune-author has no JSONL
telemetry, so the metric follows the existing in-process
faithfulness-counter pattern (reset at run start, INFO summary at run
end) instead of a new JSONL subsystem. Acceptance criteria in
`decisions.md` are all met.

## Phase 1 — Read the cache fields

- [x] **1.1** Captured via a new `on_cache_usage(creation, read, model)`
callback on `doc_gen._anthropic.call_anthropic` (polish can't see
`response.usage` directly — `call_anthropic` returns only text).
`_log_cache_usage` now returns `(creation, read)`.
- [x] **1.2** Compute hit rate: `read / max(read + creation, 1)`
(`PolishCacheStats.hit_rate`)
- [x] **1.3** `PolishCacheStats` dataclass added in `polish.py`

## Phase 2 — Surface to user

- [x] **2.1** End-of-run summary logged at INFO via
`format_polish_cache_summary()`:
`Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s))`
- [x] **2.2** Graceful when both are zero:
`Polish cache: no cacheable tokens observed (cache not configured?)`

## Phase 3 — Log to telemetry *(deviation, see note)*

- [x] **3.1** ~~Append per-call to existing telemetry JSONL~~ →
**There is no telemetry JSONL in attune-author.** Adopted the
existing in-process counter idiom (`_polish_cache_telemetry()` +
`reset_polish_cache_telemetry()`, mirroring
`generator._faithfulness_telemetry`), surfaced via the INFO
end-of-run summary in `maintenance.py`. Building a JSONL
subsystem would contradict the spec's "low effort, single file"
scope and the codebase's telemetry pattern.
- [x] **3.2** Aggregate fields: calls, creation_tokens, read_tokens,
derived hit_rate, model (model accepted by the callback; per-model
breakdown explicitly out of scope per decisions.md).

## Phase 4 — Threshold warning *(deviation: current-run, not cross-run)*

- [x] **4.1–4.3** `format_polish_cache_summary()` appends a `WARNING`
when the **current run's** hit rate < 50% (`_CACHE_HIT_WARN_THRESHOLD`)
and ≥1 cacheable token was seen, with a pointer to the README.
Cross-run rolling history (last N records) is deferred — it would
require the persistent JSONL layer this spec deliberately avoided.

## Phase 5 — Test

- [x] **5.1** `tests/unit/test_polish_cache_metrics.py`: mocks Anthropic
responses with known cache_creation/cache_read values; asserts the
callback fires (incl. the zero case), hit-rate math, accumulator,
summary line, and threshold warning (16 tests).
- [ ] **5.2** Integration test (optional) — **skipped**: would require a
live API key (real prompt-cache hits can't be observed against a
mock). The unit tests cover the compute path; left optional as the
spec allowed.

## Phase 6 — Docs

- [x] **6.1** README "Cache hit rate" subsection — meaning, healthy
ranges, what to do when it drops.
- [x] **6.2** CHANGELOG [Unreleased] entry added.

## Out of scope

- Per-stage cache breakdown (system / examples / messages)
- Cost-in-dollars tracking (token-level only)
- Cache strategy changes
- Cross-package telemetry aggregation
Original file line number Diff line number Diff line change
@@ -1,8 +1,27 @@
# Spec: Regen Pipeline — Design

> ## ⚠️ OBSOLETE — do not implement (reconciled 2026-06-06)
>
> This design was never built and conflicts with the shipped architecture. It
> assumes a single `corpus_root`, a React/JSX frontend (`App.jsx`,
> `CorpusSetup`), a polish+Haiku `_regen` pipeline, and WS-badge wiring — none
> of which exist. The shipped reality instead uses:
>
> - **Regen:** `sidecar/attune_gui/routes/living_docs.py` →
> `POST /api/living-docs/docs/{id}/regenerate` → Jobs registry
> (`attune_gui.jobs`) → `_regenerate_doc_executor` →
> `attune_author.generator.generate_feature_templates` + `load_manifest`.
> - **Corpus config:** multi-corpus registry (`attune_gui.editor_corpora`,
> `POST /api/corpus/register`) + workspace config (`attune_gui.workspace`,
> `living_docs.py` `get_config`/`set_config`).
> - **Frontend:** TypeScript (`editor-frontend/src/corpus-switcher.ts`), not React.
> - **Bulk:** `make regen-all` (Makefile), not `POST /api/templates/refresh-all`.
>
> Kept verbatim below for historical context only. See `requirements.md` banner.

## Phase 2: Design

**Status**: in-review
**Status**: obsolete — superseded by living-docs regen automation (was "in-review", never built)

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,31 @@

---

> ## ⚠️ RECONCILED — satisfied-by-different-means (2026-06-06)
>
> This spec was previously marked "complete" with all tasks ✅, but a code
> audit found **none** of its named symbols ever shipped (`_regen`,
> `regen_template(corpus_root=…)`, `_resolve_corpus_root`, `atomic_write`,
> `_patch_summaries_json`) and the attune-gui pieces (`/api/config`,
> `refresh-all`, `CorpusSetup`) do not exist. The underlying need was instead
> met by a **more evolved architecture**. All three user stories are satisfied:
>
> | User story | Status | Actual implementation |
> |---|---|---|
> | US1 — badge click → regen → saved to disk | ✅ met | `POST /api/living-docs/docs/{id}/regenerate` → Jobs registry → `_regenerate_doc_executor` → `attune_author.generator.generate_feature_templates` (`sidecar/attune_gui/routes/living_docs.py`). Source-driven generation, not polish+Haiku. |
> | US2 — first-run corpus setup UI | ✅ exceeded | Multi-corpus registry: `editor_corpora.py`, `POST /api/corpus/register`, `editor-frontend/src/corpus-switcher.ts` (dropdown + "Add corpus…" modal). |
> | US3 — env auto-load on startup | ✅ met | Workspace config (`living_docs.py` `get_config`/`set_config`, `attune_gui.workspace`) + persisted corpus registry, replacing single `ATTUNE_CORPUS_ROOT`. |
>
> Bulk regen ships as the build-time `make regen-all` target (Makefile), not a
> runtime "Regen all stale" button. The frontend is **TypeScript**, not the
> React/JSX assumed by `design.md`.
>
> **No genuine product gaps remain.** This spec is retained for history; the
> `design.md` below is **obsolete** (see its banner). Do not implement it.

## Phase 1: Requirements

**Status**: approved
**Status**: reconciled — satisfied by living-docs regen automation + corpus registry (was falsely marked "approved/complete")

### Problem statement

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,29 @@

## Phase 3: Tasks

**Status**: complete
**Status**: NOT done as written — reconciled 2026-06-06

> Shipped: `attune-author regenerate` CLI lives in `src/attune_author/cli.py:507` (handler) with the parser registered around line 154. Core logic in `maintenance.py` and `maintenance_batch.py`. CHANGELOG documents the batch variant.
> ## ⚠️ The task table below is INACCURATE
>
> A 2026-06-06 code audit found that **none** of the attune-author symbols in
> tasks 2–9 ever shipped (`_resolve_corpus_root`, `atomic_write`,
> `_patch_summaries_json`, `regen_template(corpus_root=…)`, `_regen`) and
> **none** of the attune-gui pieces in tasks 10–24 exist (`config.py`
> `ConfigState`, `/api/config`, `/api/templates/refresh-all`,
> `/api/browse/directory`, `CorpusSetup`, `App.jsx`). The "done" marks below are
> false. The earlier "Shipped" note conflated this spec with the unrelated
> hash-mismatch `attune-author regenerate` CLI — a different feature.
>
> **What actually satisfies the spec's user stories** (see `requirements.md`
> banner for the full mapping):
> - Single-doc regen → `POST /api/living-docs/docs/{id}/regenerate` (Jobs +
> `attune_author.generator.generate_feature_templates`).
> - Corpus config → corpus registry (`editor_corpora.py`,
> `/api/corpus/register`) + workspace config (`attune_gui.workspace`).
> - Bulk → `make regen-all` (Makefile).
>
> No code action is required: the product need is met. The table below is left
> intact only as a record of the original (unbuilt) plan.

### Implementation order

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Decisions — Regen / staleness hash mismatch

**Status:** root cause confirmed 2026-05-27 — original hypothesis (budget truncation of hash inputs) was wrong; actual cause is LLM-polished frontmatter laundering. Fix direction concrete. Implementation TBD.
**Status:** ✅ DONE — shipped in PR #48 (commit 1b1c7c5) / v0.14.2. Root cause was LLM-polished frontmatter laundering (not the original budget-truncation hypothesis). Fix: `apply_polish_results` re-injects deterministic frontmatter fields via `_replace_polished_frontmatter` (`generator.py:483`, `_DETERMINISTIC_FRONTMATTER_FIELDS`). Regression test: `tests/unit/test_polished_frontmatter_reinjection.py`. CHANGELOG documents it under [0.14.2]. Phase 3 release shipped; attune-gui can pin ≥0.14.2 to unblock its Phase 2.
**Owner:** Patrick
**Filed:** 2026-05-25 (handoff from attune-gui Phase 2 blockers; see [attune-gui docs/specs/living-docs-regen-automation/decisions.md](https://github.com/Smart-AI-Memory/attune-gui/blob/main/docs/specs/living-docs-regen-automation/decisions.md#phase-2-blockers-discovered-2026-05-23))

Expand Down
55 changes: 0 additions & 55 deletions docs/specs/polish-cache-hit-metrics/tasks.md

This file was deleted.

23 changes: 20 additions & 3 deletions src/attune_author/doc_gen/_anthropic.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from collections.abc import Callable

from anthropic import Anthropic

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -102,6 +104,7 @@ def call_anthropic(
model: str,
max_tokens: int,
cache_system: bool = False,
on_cache_usage: Callable[[int, int, str], None] | None = None,
) -> str:
"""Make a single-turn ``messages.create`` call with retry/backoff.

Expand All @@ -125,6 +128,13 @@ def call_anthropic(
for sonnet/opus, 2048 for haiku); below that, the call
still works but no cache is used. Cache token usage is
emitted at INFO so callers can verify hits.
on_cache_usage: Optional callback invoked once per successful
call with ``(cache_creation_input_tokens,
cache_read_input_tokens, model)``. Lets a caller (e.g. the
polish pass) accumulate cache hit-rate telemetry without
this module owning that concern. Fired even when both
counts are zero so callers can distinguish "no cache
configured" from "never called".

Returns:
The first text block of the response, or the empty
Expand Down Expand Up @@ -164,7 +174,9 @@ def call_anthropic(
system=system_payload,
messages=[{"role": "user", "content": user_message}],
)
_log_cache_usage(response, model)
creation, read = _log_cache_usage(response, model)
if on_cache_usage is not None:
on_cache_usage(creation, read, model)
if response.content:
return response.content[0].text
return ""
Expand All @@ -182,16 +194,20 @@ def call_anthropic(
raise AnthropicCallError(_redact(str(last_exc))) from None


def _log_cache_usage(response: object, model: str) -> None:
def _log_cache_usage(response: object, model: str) -> tuple[int, int]:
"""Emit cache hit telemetry from an Anthropic response.

Reads ``cache_creation_input_tokens`` and ``cache_read_input_tokens``
from the response's usage object when present and logs them at INFO.
Older SDK responses without those fields are silently skipped.

Returns:
``(creation, read)`` token counts, defaulting to ``(0, 0)`` when
the response has no usage block or the SDK omits the fields.
"""
usage = getattr(response, "usage", None)
if usage is None:
return
return (0, 0)
creation = getattr(usage, "cache_creation_input_tokens", 0) or 0
read = getattr(usage, "cache_read_input_tokens", 0) or 0
if creation or read:
Expand All @@ -201,3 +217,4 @@ def _log_cache_usage(response: object, model: str) -> None:
creation,
read,
)
return (creation, read)
Loading
Loading