diff --git a/CHANGELOG.md b/CHANGELOG.md index 71c658e..c1931f8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,6 +10,26 @@ and this project adheres to ## [Unreleased] +### Added + +- **Polish prompt-cache hit-rate telemetry.** Each polish run now + tracks Anthropic prompt-cache token usage and logs a one-line + summary at the end of `attune-author regenerate`: + `Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s))`. + A `WARNING` is appended when the run's hit rate falls below 50%, + surfacing silent cache regressions (prompt edits, model alias + drift). Hit rate is `read / (read + creation)` cacheable input + tokens. + - `attune_author.doc_gen._anthropic.call_anthropic` gains an optional + `on_cache_usage(creation, read, model)` callback; backward + compatible (the doc-gen path passes nothing). + - New in `attune_author.polish`: `PolishCacheStats`, + `polish_cache_stats()`, `format_polish_cache_summary()`, + `reset_polish_cache_telemetry()`. Telemetry follows the existing + in-process faithfulness-counter pattern (no new on-disk format). + - README: new "Cache hit rate" subsection under Polish cache. + - 16 new tests in `tests/unit/test_polish_cache_metrics.py`. + ## [0.14.2] - 2026-05-27 ### Fixed diff --git a/README.md b/README.md index d12d498..4985c0f 100644 --- a/README.md +++ b/README.md @@ -257,6 +257,33 @@ volatile frontmatter fields like `generated_at` stripped), context, and model name. Changing the model automatically invalidates all prior entries. +### Cache hit rate + +Separately from the on-disk response cache above, each polish call +uses Anthropic's **prompt cache** for the ~6000-token system prompt. +After a regen run, `attune-author` logs a one-line summary at INFO: + +``` +Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s)) +``` + +The hit rate is `read / (read + creation)` — the fraction of cacheable +input tokens served from cache rather than re-billed. Prompt caching +cuts input cost ~90% on the cached portion, so a healthy multi-template +run should settle well above 50% once the first call warms the cache. + +- **High (>80%)** — expected steady state; the system prompt is being + reused across calls. +- **Low (<50%)** — triggers a `WARNING` in the summary. Usually means + the cache boundary broke: the system prompt changed between calls, + the model alias drifted, or only a single template was polished (no + reuse). Check recent edits to `polish_prompts.py` or `_POLISH_MODEL`. +- **"no cacheable tokens observed"** — the prompt fell below Anthropic's + caching threshold or caching is disabled (`POLISH_CACHE_SYSTEM`). + +The metric is per-run (in-process); it is not persisted across +invocations. + ## Python API ```python diff --git a/docs/specs/polish-cache-hit-metrics/decisions.md b/docs/specs/archive/polish-cache-hit-metrics/decisions.md similarity index 84% rename from docs/specs/polish-cache-hit-metrics/decisions.md rename to docs/specs/archive/polish-cache-hit-metrics/decisions.md index 11d0be5..4423033 100644 --- a/docs/specs/polish-cache-hit-metrics/decisions.md +++ b/docs/specs/archive/polish-cache-hit-metrics/decisions.md @@ -1,6 +1,12 @@ # Decisions — Polish prompt-cache hit-rate telemetry -**Status:** Draft (2026-05-11) — gated on briefing-followup batch +**Status:** ✅ DONE (2026-06-06) — shipped to [Unreleased]. The Draft +"gated on briefing-followup batch" note was superseded by this file's +own "Execution gate" ("Not blocking"). One deviation: attune-author has +no telemetry JSONL, so the metric uses the existing in-process +faithfulness-counter pattern (INFO summary at end of run) rather than a +new JSONL file; the threshold warning is current-run, not cross-run. +See `tasks.md` for the per-phase record. **Owner:** Patrick --- diff --git a/docs/specs/archive/polish-cache-hit-metrics/tasks.md b/docs/specs/archive/polish-cache-hit-metrics/tasks.md new file mode 100644 index 0000000..d1a6c04 --- /dev/null +++ b/docs/specs/archive/polish-cache-hit-metrics/tasks.md @@ -0,0 +1,72 @@ +# Tasks — Polish prompt-cache hit-rate telemetry + +**Status:** ✅ DONE (2026-06-06) — shipped to [Unreleased]. See the +"Deviation" note under Phases 3–4: attune-author has no JSONL +telemetry, so the metric follows the existing in-process +faithfulness-counter pattern (reset at run start, INFO summary at run +end) instead of a new JSONL subsystem. Acceptance criteria in +`decisions.md` are all met. + +## Phase 1 — Read the cache fields + +- [x] **1.1** Captured via a new `on_cache_usage(creation, read, model)` + callback on `doc_gen._anthropic.call_anthropic` (polish can't see + `response.usage` directly — `call_anthropic` returns only text). + `_log_cache_usage` now returns `(creation, read)`. +- [x] **1.2** Compute hit rate: `read / max(read + creation, 1)` + (`PolishCacheStats.hit_rate`) +- [x] **1.3** `PolishCacheStats` dataclass added in `polish.py` + +## Phase 2 — Surface to user + +- [x] **2.1** End-of-run summary logged at INFO via + `format_polish_cache_summary()`: + `Polish cache hit: 87% (1241 read / 1421 total tokens, 6 call(s))` +- [x] **2.2** Graceful when both are zero: + `Polish cache: no cacheable tokens observed (cache not configured?)` + +## Phase 3 — Log to telemetry *(deviation, see note)* + +- [x] **3.1** ~~Append per-call to existing telemetry JSONL~~ → + **There is no telemetry JSONL in attune-author.** Adopted the + existing in-process counter idiom (`_polish_cache_telemetry()` + + `reset_polish_cache_telemetry()`, mirroring + `generator._faithfulness_telemetry`), surfaced via the INFO + end-of-run summary in `maintenance.py`. Building a JSONL + subsystem would contradict the spec's "low effort, single file" + scope and the codebase's telemetry pattern. +- [x] **3.2** Aggregate fields: calls, creation_tokens, read_tokens, + derived hit_rate, model (model accepted by the callback; per-model + breakdown explicitly out of scope per decisions.md). + +## Phase 4 — Threshold warning *(deviation: current-run, not cross-run)* + +- [x] **4.1–4.3** `format_polish_cache_summary()` appends a `WARNING` + when the **current run's** hit rate < 50% (`_CACHE_HIT_WARN_THRESHOLD`) + and ≥1 cacheable token was seen, with a pointer to the README. + Cross-run rolling history (last N records) is deferred — it would + require the persistent JSONL layer this spec deliberately avoided. + +## Phase 5 — Test + +- [x] **5.1** `tests/unit/test_polish_cache_metrics.py`: mocks Anthropic + responses with known cache_creation/cache_read values; asserts the + callback fires (incl. the zero case), hit-rate math, accumulator, + summary line, and threshold warning (16 tests). +- [ ] **5.2** Integration test (optional) — **skipped**: would require a + live API key (real prompt-cache hits can't be observed against a + mock). The unit tests cover the compute path; left optional as the + spec allowed. + +## Phase 6 — Docs + +- [x] **6.1** README "Cache hit rate" subsection — meaning, healthy + ranges, what to do when it drops. +- [x] **6.2** CHANGELOG [Unreleased] entry added. + +## Out of scope + +- Per-stage cache breakdown (system / examples / messages) +- Cost-in-dollars tracking (token-level only) +- Cache strategy changes +- Cross-package telemetry aggregation diff --git a/docs/specs/polish-fact-check/decisions.md b/docs/specs/archive/polish-fact-check/decisions.md similarity index 100% rename from docs/specs/polish-fact-check/decisions.md rename to docs/specs/archive/polish-fact-check/decisions.md diff --git a/docs/specs/polish-fact-check/design.md b/docs/specs/archive/polish-fact-check/design.md similarity index 100% rename from docs/specs/polish-fact-check/design.md rename to docs/specs/archive/polish-fact-check/design.md diff --git a/docs/specs/polish-fact-check/requirements.md b/docs/specs/archive/polish-fact-check/requirements.md similarity index 100% rename from docs/specs/polish-fact-check/requirements.md rename to docs/specs/archive/polish-fact-check/requirements.md diff --git a/docs/specs/polish-fact-check/tasks.md b/docs/specs/archive/polish-fact-check/tasks.md similarity index 100% rename from docs/specs/polish-fact-check/tasks.md rename to docs/specs/archive/polish-fact-check/tasks.md diff --git a/docs/specs/regen-pipeline/design.md b/docs/specs/archive/regen-pipeline/design.md similarity index 88% rename from docs/specs/regen-pipeline/design.md rename to docs/specs/archive/regen-pipeline/design.md index 718abf3..e8a63fd 100644 --- a/docs/specs/regen-pipeline/design.md +++ b/docs/specs/archive/regen-pipeline/design.md @@ -1,8 +1,27 @@ # Spec: Regen Pipeline — Design +> ## ⚠️ OBSOLETE — do not implement (reconciled 2026-06-06) +> +> This design was never built and conflicts with the shipped architecture. It +> assumes a single `corpus_root`, a React/JSX frontend (`App.jsx`, +> `CorpusSetup`), a polish+Haiku `_regen` pipeline, and WS-badge wiring — none +> of which exist. The shipped reality instead uses: +> +> - **Regen:** `sidecar/attune_gui/routes/living_docs.py` → +> `POST /api/living-docs/docs/{id}/regenerate` → Jobs registry +> (`attune_gui.jobs`) → `_regenerate_doc_executor` → +> `attune_author.generator.generate_feature_templates` + `load_manifest`. +> - **Corpus config:** multi-corpus registry (`attune_gui.editor_corpora`, +> `POST /api/corpus/register`) + workspace config (`attune_gui.workspace`, +> `living_docs.py` `get_config`/`set_config`). +> - **Frontend:** TypeScript (`editor-frontend/src/corpus-switcher.ts`), not React. +> - **Bulk:** `make regen-all` (Makefile), not `POST /api/templates/refresh-all`. +> +> Kept verbatim below for historical context only. See `requirements.md` banner. + ## Phase 2: Design -**Status**: in-review +**Status**: obsolete — superseded by living-docs regen automation (was "in-review", never built) --- diff --git a/docs/specs/regen-pipeline/requirements.md b/docs/specs/archive/regen-pipeline/requirements.md similarity index 66% rename from docs/specs/regen-pipeline/requirements.md rename to docs/specs/archive/regen-pipeline/requirements.md index f94af8d..6afcda8 100644 --- a/docs/specs/regen-pipeline/requirements.md +++ b/docs/specs/archive/regen-pipeline/requirements.md @@ -5,9 +5,31 @@ --- +> ## ⚠️ RECONCILED — satisfied-by-different-means (2026-06-06) +> +> This spec was previously marked "complete" with all tasks ✅, but a code +> audit found **none** of its named symbols ever shipped (`_regen`, +> `regen_template(corpus_root=…)`, `_resolve_corpus_root`, `atomic_write`, +> `_patch_summaries_json`) and the attune-gui pieces (`/api/config`, +> `refresh-all`, `CorpusSetup`) do not exist. The underlying need was instead +> met by a **more evolved architecture**. All three user stories are satisfied: +> +> | User story | Status | Actual implementation | +> |---|---|---| +> | US1 — badge click → regen → saved to disk | ✅ met | `POST /api/living-docs/docs/{id}/regenerate` → Jobs registry → `_regenerate_doc_executor` → `attune_author.generator.generate_feature_templates` (`sidecar/attune_gui/routes/living_docs.py`). Source-driven generation, not polish+Haiku. | +> | US2 — first-run corpus setup UI | ✅ exceeded | Multi-corpus registry: `editor_corpora.py`, `POST /api/corpus/register`, `editor-frontend/src/corpus-switcher.ts` (dropdown + "Add corpus…" modal). | +> | US3 — env auto-load on startup | ✅ met | Workspace config (`living_docs.py` `get_config`/`set_config`, `attune_gui.workspace`) + persisted corpus registry, replacing single `ATTUNE_CORPUS_ROOT`. | +> +> Bulk regen ships as the build-time `make regen-all` target (Makefile), not a +> runtime "Regen all stale" button. The frontend is **TypeScript**, not the +> React/JSX assumed by `design.md`. +> +> **No genuine product gaps remain.** This spec is retained for history; the +> `design.md` below is **obsolete** (see its banner). Do not implement it. + ## Phase 1: Requirements -**Status**: approved +**Status**: reconciled — satisfied by living-docs regen automation + corpus registry (was falsely marked "approved/complete") ### Problem statement diff --git a/docs/specs/regen-pipeline/tasks.md b/docs/specs/archive/regen-pipeline/tasks.md similarity index 80% rename from docs/specs/regen-pipeline/tasks.md rename to docs/specs/archive/regen-pipeline/tasks.md index 9cc0f78..7e12159 100644 --- a/docs/specs/regen-pipeline/tasks.md +++ b/docs/specs/archive/regen-pipeline/tasks.md @@ -2,9 +2,29 @@ ## Phase 3: Tasks -**Status**: complete +**Status**: NOT done as written — reconciled 2026-06-06 -> Shipped: `attune-author regenerate` CLI lives in `src/attune_author/cli.py:507` (handler) with the parser registered around line 154. Core logic in `maintenance.py` and `maintenance_batch.py`. CHANGELOG documents the batch variant. +> ## ⚠️ The task table below is INACCURATE +> +> A 2026-06-06 code audit found that **none** of the attune-author symbols in +> tasks 2–9 ever shipped (`_resolve_corpus_root`, `atomic_write`, +> `_patch_summaries_json`, `regen_template(corpus_root=…)`, `_regen`) and +> **none** of the attune-gui pieces in tasks 10–24 exist (`config.py` +> `ConfigState`, `/api/config`, `/api/templates/refresh-all`, +> `/api/browse/directory`, `CorpusSetup`, `App.jsx`). The "done" marks below are +> false. The earlier "Shipped" note conflated this spec with the unrelated +> hash-mismatch `attune-author regenerate` CLI — a different feature. +> +> **What actually satisfies the spec's user stories** (see `requirements.md` +> banner for the full mapping): +> - Single-doc regen → `POST /api/living-docs/docs/{id}/regenerate` (Jobs + +> `attune_author.generator.generate_feature_templates`). +> - Corpus config → corpus registry (`editor_corpora.py`, +> `/api/corpus/register`) + workspace config (`attune_gui.workspace`). +> - Bulk → `make regen-all` (Makefile). +> +> No code action is required: the product need is met. The table below is left +> intact only as a record of the original (unbuilt) plan. ### Implementation order diff --git a/docs/specs/regen-staleness-hash-mismatch/decisions.md b/docs/specs/archive/regen-staleness-hash-mismatch/decisions.md similarity index 94% rename from docs/specs/regen-staleness-hash-mismatch/decisions.md rename to docs/specs/archive/regen-staleness-hash-mismatch/decisions.md index 9cbde17..7e015e4 100644 --- a/docs/specs/regen-staleness-hash-mismatch/decisions.md +++ b/docs/specs/archive/regen-staleness-hash-mismatch/decisions.md @@ -1,6 +1,6 @@ # Decisions — Regen / staleness hash mismatch -**Status:** root cause confirmed 2026-05-27 — original hypothesis (budget truncation of hash inputs) was wrong; actual cause is LLM-polished frontmatter laundering. Fix direction concrete. Implementation TBD. +**Status:** ✅ DONE — shipped in PR #48 (commit 1b1c7c5) / v0.14.2. Root cause was LLM-polished frontmatter laundering (not the original budget-truncation hypothesis). Fix: `apply_polish_results` re-injects deterministic frontmatter fields via `_replace_polished_frontmatter` (`generator.py:483`, `_DETERMINISTIC_FRONTMATTER_FIELDS`). Regression test: `tests/unit/test_polished_frontmatter_reinjection.py`. CHANGELOG documents it under [0.14.2]. Phase 3 release shipped; attune-gui can pin ≥0.14.2 to unblock its Phase 2. **Owner:** Patrick **Filed:** 2026-05-25 (handoff from attune-gui Phase 2 blockers; see [attune-gui docs/specs/living-docs-regen-automation/decisions.md](https://github.com/Smart-AI-Memory/attune-gui/blob/main/docs/specs/living-docs-regen-automation/decisions.md#phase-2-blockers-discovered-2026-05-23)) diff --git a/docs/specs/polish-cache-hit-metrics/tasks.md b/docs/specs/polish-cache-hit-metrics/tasks.md deleted file mode 100644 index dfc6bf9..0000000 --- a/docs/specs/polish-cache-hit-metrics/tasks.md +++ /dev/null @@ -1,55 +0,0 @@ -# Tasks — Polish prompt-cache hit-rate telemetry - -## Phase 1 — Read the cache fields - -- [ ] **1.1** In `attune_author/polish.py`, capture - `response.usage.cache_creation_input_tokens` and - `response.usage.cache_read_input_tokens` from each - Anthropic API call -- [ ] **1.2** Compute hit rate: - `read / max(read + creation, 1)` -- [ ] **1.3** Add a `PolishCacheStats` dataclass for - structured passing - -## Phase 2 — Surface to user - -- [ ] **2.1** Print a one-line summary at end of polish run: - `Polish complete · cache hit: 87% (1241 read / 1421 total tokens)` -- [ ] **2.2** Format gracefully when both are zero (no cache - configured) - -## Phase 3 — Log to telemetry - -- [ ] **3.1** Append per-call to existing telemetry JSONL - (wherever attune-author writes telemetry) -- [ ] **3.2** Fields: timestamp, model, hit_rate, - read_tokens, creation_tokens, polish_target - -## Phase 4 — Threshold warning - -- [ ] **4.1** When invoked, read last N (e.g., 10) telemetry - records -- [ ] **4.2** Compute rolling mean hit rate -- [ ] **4.3** If <50%, print a warning at end of run with - pointer to docs - -## Phase 5 — Test - -- [ ] **5.1** Unit test: mock an Anthropic response with known - cache_creation / cache_read values; assert hit rate - computed correctly -- [ ] **5.2** Integration test (optional): run polish twice - back-to-back; second run should report >0% cache hit - -## Phase 6 — Docs - -- [ ] **6.1** README section: "Cache hit rate" — what it means, - what good values look like, what to do if it drops -- [ ] **6.2** Link from CHANGELOG when feature ships - -## Out of scope - -- Per-stage cache breakdown (system / examples / messages) -- Cost-in-dollars tracking (token-level only) -- Cache strategy changes -- Cross-package telemetry aggregation diff --git a/src/attune_author/doc_gen/_anthropic.py b/src/attune_author/doc_gen/_anthropic.py index e47f801..d3cf208 100644 --- a/src/attune_author/doc_gen/_anthropic.py +++ b/src/attune_author/doc_gen/_anthropic.py @@ -16,6 +16,8 @@ from typing import TYPE_CHECKING if TYPE_CHECKING: + from collections.abc import Callable + from anthropic import Anthropic logger = logging.getLogger(__name__) @@ -102,6 +104,7 @@ def call_anthropic( model: str, max_tokens: int, cache_system: bool = False, + on_cache_usage: Callable[[int, int, str], None] | None = None, ) -> str: """Make a single-turn ``messages.create`` call with retry/backoff. @@ -125,6 +128,13 @@ def call_anthropic( for sonnet/opus, 2048 for haiku); below that, the call still works but no cache is used. Cache token usage is emitted at INFO so callers can verify hits. + on_cache_usage: Optional callback invoked once per successful + call with ``(cache_creation_input_tokens, + cache_read_input_tokens, model)``. Lets a caller (e.g. the + polish pass) accumulate cache hit-rate telemetry without + this module owning that concern. Fired even when both + counts are zero so callers can distinguish "no cache + configured" from "never called". Returns: The first text block of the response, or the empty @@ -164,7 +174,9 @@ def call_anthropic( system=system_payload, messages=[{"role": "user", "content": user_message}], ) - _log_cache_usage(response, model) + creation, read = _log_cache_usage(response, model) + if on_cache_usage is not None: + on_cache_usage(creation, read, model) if response.content: return response.content[0].text return "" @@ -182,16 +194,20 @@ def call_anthropic( raise AnthropicCallError(_redact(str(last_exc))) from None -def _log_cache_usage(response: object, model: str) -> None: +def _log_cache_usage(response: object, model: str) -> tuple[int, int]: """Emit cache hit telemetry from an Anthropic response. Reads ``cache_creation_input_tokens`` and ``cache_read_input_tokens`` from the response's usage object when present and logs them at INFO. Older SDK responses without those fields are silently skipped. + + Returns: + ``(creation, read)`` token counts, defaulting to ``(0, 0)`` when + the response has no usage block or the SDK omits the fields. """ usage = getattr(response, "usage", None) if usage is None: - return + return (0, 0) creation = getattr(usage, "cache_creation_input_tokens", 0) or 0 read = getattr(usage, "cache_read_input_tokens", 0) or 0 if creation or read: @@ -201,3 +217,4 @@ def _log_cache_usage(response: object, model: str) -> None: creation, read, ) + return (creation, read) diff --git a/src/attune_author/maintenance.py b/src/attune_author/maintenance.py index 13788b5..7634fe9 100644 --- a/src/attune_author/maintenance.py +++ b/src/attune_author/maintenance.py @@ -102,8 +102,10 @@ def run_maintenance( # Reset Phase 3 faithfulness telemetry so the end-of-run summary # reflects this regen rather than carrying state across runs. from attune_author.generator import reset_faithfulness_telemetry + from attune_author.polish import reset_polish_cache_telemetry reset_faithfulness_telemetry() + reset_polish_cache_telemetry() for entry in report.help_entries: if not entry.is_stale: @@ -147,6 +149,15 @@ def run_maintenance( telemetry["cost_usd"], ) + # Prompt-cache hit-rate summary. Logged at INFO so it rides the + # same default `attune-author regenerate` output as the + # faithfulness line; silent when polish never ran this regen. + from attune_author.polish import format_polish_cache_summary + + cache_summary = format_polish_cache_summary() + if cache_summary is not None: + logger.info("%s", cache_summary) + return result diff --git a/src/attune_author/polish.py b/src/attune_author/polish.py index d4f1e0d..03b11df 100644 --- a/src/attune_author/polish.py +++ b/src/attune_author/polish.py @@ -39,6 +39,7 @@ import os import re import time +from dataclasses import dataclass from pathlib import Path from attune_author.doc_gen._anthropic import ( @@ -469,6 +470,118 @@ def build_polish_prompt( POLISH_MAX_TOKENS = 4096 POLISH_CACHE_SYSTEM = True +#: Rolling hit rate below which the end-of-run summary appends a +#: warning. A healthy polish run that re-touches templates should +#: sit well above this once the system prompt is cached; sustained +#: lows usually mean the cache boundary broke (prompt edit, model +#: alias drift) — see the README "Cache hit rate" section. +_CACHE_HIT_WARN_THRESHOLD = 0.5 + + +@dataclass(frozen=True) +class PolishCacheStats: + """Aggregate prompt-cache token usage across polish calls. + + ``creation_tokens`` are input tokens written into Anthropic's + prompt cache; ``read_tokens`` are input tokens served from it. + The hit rate is ``read / (read + creation)`` — the fraction of + cacheable input that came from cache rather than being re-billed. + """ + + calls: int = 0 + creation_tokens: int = 0 + read_tokens: int = 0 + + @property + def total_tokens(self) -> int: + return self.read_tokens + self.creation_tokens + + @property + def hit_rate(self) -> float: + """Cache read fraction in ``[0.0, 1.0]``; ``0.0`` when no + cacheable tokens were seen (avoids divide-by-zero).""" + return self.read_tokens / max(self.total_tokens, 1) + + def summary_line(self) -> str: + """One-line human summary for end-of-run output. + + Degrades gracefully when no cacheable tokens were seen (no + cache configured, or prompt below the caching threshold). + """ + if self.total_tokens == 0: + return "Polish cache: no cacheable tokens observed (cache not configured?)" + return ( + f"Polish cache hit: {self.hit_rate:.0%} " + f"({self.read_tokens} read / {self.total_tokens} total tokens, " + f"{self.calls} call(s))" + ) + + +def _polish_cache_telemetry() -> dict[str, int]: + """Per-process aggregate of prompt-cache token usage. + + Stored on the function as an attribute so the end-of-run summary + can read totals without module-level state — same idiom as + ``generator._faithfulness_telemetry``. Reset via + :func:`reset_polish_cache_telemetry`. + """ + state = getattr(_polish_cache_telemetry, "_state", None) + if state is None: + state = {"calls": 0, "creation": 0, "read": 0} + _polish_cache_telemetry._state = state # type: ignore[attr-defined] + return state + + +def reset_polish_cache_telemetry() -> None: + """Reset the per-process prompt-cache telemetry counters.""" + _polish_cache_telemetry._state = { # type: ignore[attr-defined] + "calls": 0, + "creation": 0, + "read": 0, + } + + +def _record_cache_usage(creation: int, read: int, model: str) -> None: + """Accumulate one polish call's cache token counts. + + Wired into :func:`call_anthropic` via its ``on_cache_usage`` + hook. ``model`` is accepted to match the callback signature but + not aggregated — per-model breakdown is out of scope (decisions.md). + """ + state = _polish_cache_telemetry() + state["calls"] += 1 + state["creation"] += creation + state["read"] += read + + +def polish_cache_stats() -> PolishCacheStats: + """Snapshot the current per-process prompt-cache aggregate.""" + state = _polish_cache_telemetry() + return PolishCacheStats( + calls=state["calls"], + creation_tokens=state["creation"], + read_tokens=state["read"], + ) + + +def format_polish_cache_summary() -> str | None: + """End-of-run summary line, or ``None`` if polish never ran. + + Appends a low-hit-rate warning when the run's hit rate falls below + :data:`_CACHE_HIT_WARN_THRESHOLD` and at least one cacheable token + was seen. Scope is the current process: callers reset at run start. + """ + stats = polish_cache_stats() + if stats.calls == 0: + return None + line = stats.summary_line() + if stats.total_tokens > 0 and stats.hit_rate < _CACHE_HIT_WARN_THRESHOLD: + line += ( + f" — WARNING: below {_CACHE_HIT_WARN_THRESHOLD:.0%}; " + "the prompt cache may have regressed (see README 'Cache hit rate')" + ) + return line + def _call_llm( content: str, @@ -516,6 +629,7 @@ def _call_llm( model=_POLISH_MODEL, max_tokens=POLISH_MAX_TOKENS, cache_system=POLISH_CACHE_SYSTEM, + on_cache_usage=_record_cache_usage, ) return polished or content @@ -525,12 +639,16 @@ def _call_llm( # or from the wrapping polish layer. __all__ = [ "AnthropicCallError", + "PolishCacheStats", "PolishError", "STRICT_ENV_VAR", "_env_strict_default", "build_source_summary", "clear_cache", + "format_polish_cache_summary", + "polish_cache_stats", "polish_template", + "reset_polish_cache_telemetry", ] diff --git a/tests/test_maintenance_batch.py b/tests/test_maintenance_batch.py index f846f70..6cfb769 100644 --- a/tests/test_maintenance_batch.py +++ b/tests/test_maintenance_batch.py @@ -19,7 +19,7 @@ from __future__ import annotations import json -from datetime import datetime, timezone +from datetime import datetime, timedelta, timezone from pathlib import Path from types import SimpleNamespace from unittest.mock import MagicMock, patch @@ -53,11 +53,16 @@ def _state(submitted_at: datetime | None = None) -> BatchState: + # Default to "recently submitted" relative to now so the fixture stays + # inside the 29-day retention window for status/cancel paths that read + # without an injected ``now=``. (A hardcoded date silently expires and + # breaks these tests once it ages past the window.) + submitted_at = submitted_at or (datetime.now(timezone.utc) - timedelta(days=1)) return BatchState( schema_version=1, batch_id="msgbatch_test", - submitted_at=submitted_at or datetime(2026, 5, 8, 18, 35, tzinfo=timezone.utc), - expected_completion_at=datetime(2026, 5, 8, 18, 41, tzinfo=timezone.utc), + submitted_at=submitted_at, + expected_completion_at=submitted_at + timedelta(minutes=6), model="claude-sonnet-4-6", requests=( BatchStateRequest("feat__auth__concept", "auth", "concept"), diff --git a/tests/unit/test_polish_cache_metrics.py b/tests/unit/test_polish_cache_metrics.py new file mode 100644 index 0000000..a0ab06d --- /dev/null +++ b/tests/unit/test_polish_cache_metrics.py @@ -0,0 +1,182 @@ +"""Tests for polish prompt-cache hit-rate telemetry. + +Covers the spec ``polish-cache-hit-metrics``: capturing +``cache_creation_input_tokens`` / ``cache_read_input_tokens`` from +Anthropic responses, the ``PolishCacheStats`` hit-rate math, the +per-process accumulator, the end-of-run summary line, and the +low-hit-rate threshold warning. +""" + +from __future__ import annotations + +from unittest.mock import MagicMock + +import pytest + +from attune_author import polish +from attune_author.doc_gen import _anthropic +from attune_author.polish import ( + PolishCacheStats, + format_polish_cache_summary, + polish_cache_stats, + reset_polish_cache_telemetry, +) + + +@pytest.fixture(autouse=True) +def _clean_telemetry(): + """Each test starts and ends with zeroed counters so the + per-process accumulator can't leak across tests.""" + reset_polish_cache_telemetry() + yield + reset_polish_cache_telemetry() + + +# --------------------------------------------------------------------------- +# PolishCacheStats hit-rate math +# --------------------------------------------------------------------------- + + +class TestPolishCacheStats: + def test_hit_rate_basic(self) -> None: + stats = PolishCacheStats(calls=1, creation_tokens=180, read_tokens=1241) + assert stats.total_tokens == 1421 + assert stats.hit_rate == pytest.approx(1241 / 1421) + + def test_hit_rate_full_hit(self) -> None: + stats = PolishCacheStats(calls=2, creation_tokens=0, read_tokens=2048) + assert stats.hit_rate == pytest.approx(1.0) + + def test_hit_rate_zero_tokens_is_safe(self) -> None: + """No cacheable tokens must not divide by zero.""" + stats = PolishCacheStats(calls=1, creation_tokens=0, read_tokens=0) + assert stats.total_tokens == 0 + assert stats.hit_rate == 0.0 + + def test_summary_line_with_tokens(self) -> None: + stats = PolishCacheStats(calls=3, creation_tokens=180, read_tokens=1241) + line = stats.summary_line() + assert "87%" in line # 1241/1421 == 0.873... + assert "1241 read" in line + assert "1421 total" in line + assert "3 call(s)" in line + + def test_summary_line_graceful_when_zero(self) -> None: + stats = PolishCacheStats(calls=1, creation_tokens=0, read_tokens=0) + assert "no cacheable tokens" in stats.summary_line().lower() + + +# --------------------------------------------------------------------------- +# call_anthropic on_cache_usage callback (the capture path) +# --------------------------------------------------------------------------- + + +def _mock_client(creation: int, read: int) -> MagicMock: + client = MagicMock() + block = MagicMock() + block.text = "polished" + response = MagicMock() + response.content = [block] + response.usage.cache_creation_input_tokens = creation + response.usage.cache_read_input_tokens = read + client.messages.create.return_value = response + return client + + +class TestOnCacheUsageCallback: + def test_callback_fires_with_token_counts(self) -> None: + seen: list[tuple[int, int, str]] = [] + client = _mock_client(creation=1024, read=512) + + _anthropic.call_anthropic( + client, + system="s", + user_message="u", + model="claude-sonnet-4-6", + max_tokens=10, + on_cache_usage=lambda c, r, m: seen.append((c, r, m)), + ) + + assert seen == [(1024, 512, "claude-sonnet-4-6")] + + def test_callback_fires_even_when_zero(self) -> None: + """Caller must be able to tell 'no cache' from 'never called'.""" + seen: list[tuple[int, int, str]] = [] + client = _mock_client(creation=0, read=0) + + _anthropic.call_anthropic( + client, + system="s", + user_message="u", + model="m", + max_tokens=10, + on_cache_usage=lambda c, r, m: seen.append((c, r, m)), + ) + + assert seen == [(0, 0, "m")] + + def test_no_callback_is_fine(self) -> None: + """Omitting the callback (doc-gen path) must not error.""" + client = _mock_client(creation=10, read=10) + out = _anthropic.call_anthropic( + client, system="s", user_message="u", model="m", max_tokens=10 + ) + assert out == "polished" + + +# --------------------------------------------------------------------------- +# Accumulator + reset +# --------------------------------------------------------------------------- + + +class TestAccumulator: + def test_records_accumulate_across_calls(self) -> None: + polish._record_cache_usage(1000, 0, "m") # creation-only (cold) + polish._record_cache_usage(100, 900, "m") # mostly read (warm) + + stats = polish_cache_stats() + assert stats.calls == 2 + assert stats.creation_tokens == 1100 + assert stats.read_tokens == 900 + assert stats.total_tokens == 2000 + assert stats.hit_rate == pytest.approx(0.45) + + def test_reset_zeroes_counters(self) -> None: + polish._record_cache_usage(100, 100, "m") + reset_polish_cache_telemetry() + stats = polish_cache_stats() + assert (stats.calls, stats.creation_tokens, stats.read_tokens) == (0, 0, 0) + + +# --------------------------------------------------------------------------- +# End-of-run summary + threshold warning +# --------------------------------------------------------------------------- + + +class TestSummary: + def test_summary_none_when_polish_never_ran(self) -> None: + assert format_polish_cache_summary() is None + + def test_summary_present_after_calls(self) -> None: + polish._record_cache_usage(180, 1241, "m") + summary = format_polish_cache_summary() + assert summary is not None + assert "Polish cache hit" in summary + assert "WARNING" not in summary # 87% is healthy + + def test_low_hit_rate_appends_warning(self) -> None: + # 100 read / 1000 total == 10% — below the 50% threshold. + polish._record_cache_usage(900, 100, "m") + summary = format_polish_cache_summary() + assert summary is not None + assert "WARNING" in summary + assert "below 50%" in summary + + def test_zero_token_run_warns_nothing(self) -> None: + """A run with calls but no cacheable tokens reports the + graceful line and no spurious threshold warning.""" + polish._record_cache_usage(0, 0, "m") + summary = format_polish_cache_summary() + assert summary is not None + assert "no cacheable tokens" in summary.lower() + assert "WARNING" not in summary