Smart-AI-Memory · silversurfer562 · May 16, 2026 · May 16, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,42 @@ changes land, not at tag time.
 
 ### Added
 
+- **Polish fact-check Phase 3 — faithfulness judge.** Wraps
+  `attune_rag.eval.faithfulness.FaithfulnessJudge` as a
+  post-polish step: scores each polished file's claims against
+  the source files it was generated from. When the score falls
+  below the configured threshold, appends a
+  `## Faithfulness review` block listing the unsupported claims
+  and the judge's reasoning. Best-effort: missing
+  `attune-rag[claude]`, missing `ANTHROPIC_API_KEY`, over-budget
+  estimates, and transient API failures all degrade silently
+  rather than blocking the polish.
+  - New package: `src/attune_author/faithfulness/` with the
+    judge wrapper, `FaithfulnessConfig`, `JudgeOutcome`,
+    `estimate_cost_usd` budget-gate helper, and a
+    `format_review_block` / `apply_review_block` soft-fail pair
+    that mirrors the Phase 1 `## Unresolved references` shape.
+  - Config schema:
+    `[tool.attune-author.fact-check.faithfulness]` with
+    `enabled`, `threshold` (default 0.95, pre-calibration), `budget_per_file_usd`
+    (default $0.10), `model` (default Sonnet 4.6 — Haiku 4.5 is
+    cheaper for high-volume runs), and
+    `block_polish_on_unavailable` (default False — flip to True
+    in CI where missing deps should be loud).
+  - Cost telemetry: per-process counters on the generator
+    module; `run_maintenance` resets them at start and logs an
+    INFO summary at end with call count, skip count, and total
+    estimated USD spent.
+  - `ATTUNE_AUTHOR_FAITHFULNESS=off` env-var override for one-off
+    disable without editing pyproject.
+  - 30 new tests under `tests/unit/faithfulness/`.
+  - Threshold calibration against the ops-dashboard fixture
+    (tasks 3.3, 3.4) is **deferred** until the live-LLM Phase 2
+    acceptance run; today's default of 0.95 is documented as
+    pre-calibration in `decisions.md`.
+  - Spec: `docs/specs/polish-fact-check/`. Phase 4 (tutorial
+    code-fence mypy) remains on the roadmap.
+
 - **Polish fact-check Phase 2 — ground-truth context
   injection.** Builds three sentinel-tagged blocks
   (`<cli_help>`, `<public_api>`, `<dataclasses>`) and injects

diff --git a/README.md b/README.md
@@ -162,6 +162,37 @@ CLI flags, fabricated private-module imports, wrong route
 paths, hallucinated counts) at the prompt layer, rather than
 relying solely on the post-generation fact-check to catch them.
 
+## Faithfulness review (Phase 3)
+
+The Phase 3 faithfulness judge wraps
+`attune_rag.eval.faithfulness.FaithfulnessJudge` as a
+post-polish step: it scores each polished file's claims against
+the source files it was generated from. When the score falls
+below the configured threshold, a `## Faithfulness review`
+block is appended to the polished file listing the unsupported
+claims and the judge's reasoning.
+
+The judge is **opt-in** because it makes real Anthropic API
+calls. To enable:
+
+```toml
+[tool.attune-author.fact-check.faithfulness]
+enabled = true
+threshold = 0.95            # below this triggers a review block
+budget_per_file_usd = 0.10  # skip if estimated cost exceeds cap
+model = "claude-sonnet-4-6" # haiku is ~1/3 the cost
+```
+
+The judge is best-effort. Missing `attune-rag[claude]`, missing
+`ANTHROPIC_API_KEY`, over-budget cost estimates, and transient
+API failures all degrade silently rather than blocking the
+polish. Set `block_polish_on_unavailable = true` in CI lanes
+where missing deps should fail loudly instead.
+
+End-of-run telemetry (call count, skip count, total estimated
+USD) logs at INFO level after `attune-author regenerate`. Set
+`ATTUNE_AUTHOR_FAITHFULNESS=off` to disable for a single run.
+
 ## Polish cache
 
 `attune-author` caches LLM polish responses on disk so re-generating an

diff --git a/docs/specs/polish-fact-check/decisions.md b/docs/specs/polish-fact-check/decisions.md
@@ -73,3 +73,38 @@ To be filled in during Phase 3 implementation:
     faithfulness judge ships, it will require its own real-LLM
     calibration run. Folding the cost-delta measurement into that
     run avoids two separate real-LLM cycles.
+- 2026-05-16 — Phase 3 shipped. New decisions captured during
+  implementation:
+  - **Opt-in default**: `enabled=False` ships in
+    `FaithfulnessConfig` and the pyproject loader, because the
+    judge makes real Anthropic API calls and we shouldn't bill
+    users for it silently on the first run after install. The
+    Phase 1 fact-check is enabled by default (no API calls); the
+    Phase 3 judge is not.
+  - **Synchronous wrapper via `asyncio.run`**: the existing
+    polish pipeline is synchronous, so the async
+    `FaithfulnessJudge.score` coroutine is bridged with
+    `asyncio.run`. This precludes calling the judge from inside
+    a running event loop (we don't, today), but keeps the
+    surface aligned with the rest of attune-author.
+  - **Best-effort vs strict**: missing extras / missing API key
+    / transient failures all default to `JudgeOutcome(score=None,
+    skipped_reason=…)` rather than raising. CI lanes that need
+    loud failures opt in via `block_polish_on_unavailable = true`.
+  - **Budget gate uses character-count heuristic, not tokenizer**:
+    `estimate_cost_usd(chars, model)` divides chars by 4 to get
+    a rough token count and multiplies by a per-model price
+    lookup. Accurate to ~20% — well inside what a $0.10 budget
+    cap cares about. A real tokenizer is a future change if
+    drift surfaces.
+  - **Cost telemetry as function attribute, not module global**:
+    `_faithfulness_telemetry()` stores the counter dict on its
+    own `_state` attribute so it's resettable, mockable, and
+    doesn't leak module-level state. Mirrors how the polish
+    cache exposes its store.
+  - **Calibration deferred**: tasks 3.3 and 3.4 require a real
+    LLM run against the ops-dashboard pre-fix and post-fix
+    fixtures. The placeholder threshold of `0.95` ships as the
+    default and the calibration is scheduled to land alongside
+    the live-LLM Phase 2 acceptance run so a single real-API
+    cycle covers both phases' open work.
diff --git a/docs/specs/polish-fact-check/tasks.md b/docs/specs/polish-fact-check/tasks.md
@@ -110,23 +110,22 @@ the `FactCheckReport` plumbing.
 
 | # | Task | Layer | Status | Notes |
 |---|------|-------|--------|-------|
-| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` |
-| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` |
-| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 |
-| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers |
-| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold |
-| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log |
-| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source |
-| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | |
-| 3.9 | Update CHANGELOG + README | attune-author | todo | |
+| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | **done** | `[tool.attune-author.fact-check.faithfulness]` sub-table; defaults threshold=0.95, budget=$0.10, model=Sonnet 4.6, enabled=False (opt-in) |
+| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | **done** | Wraps `FaithfulnessJudge` via `asyncio.run`; best-effort: missing extra / missing API key / over-budget all return `JudgeOutcome(score=None, skipped_reason=…)` rather than raising. `block_polish_on_unavailable=True` opt-in for strict CI. |
+| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | deferred | Requires real-LLM run; placeholder default `0.95` documented in decisions.md as pre-calibration. Calibration scheduled alongside live-LLM Phase 2 acceptance run. |
+| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | deferred | Empty calibration record retained; will populate when 3.3 runs. |
+| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | **done** | `generator._run_faithfulness_judge` called after `_run_fact_check`; appends `## Faithfulness review` block when below threshold. `ATTUNE_AUTHOR_FAITHFULNESS=off` env override. |
+| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | **done** | Per-process telemetry state on `_faithfulness_telemetry`; `run_maintenance` resets at start and logs INFO summary at end (calls, skipped, total estimated $). |
+| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | **done** | `test_pipeline_wiring.py::test_run_faithfulness_judge_appends_review_block_when_below_threshold` + `test_judge.py::test_judge_below_threshold_flags_threshold_not_met` |
+| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | **done** | `test_judge.py::test_judge_skipped_when_over_budget` |
+| 3.9 | Update CHANGELOG + README | attune-author | **done** | CHANGELOG under Unreleased; README adds a "Faithfulness review (Phase 3)" subsection. |
 
 ### Phase 3 exit checklist
 
-- [ ] Tasks 3.1–3.9 done
-- [ ] Calibration shows clean separation between pre-fix and post-fix
-      fixture scores
-- [ ] Threshold + cap configurable
-- [ ] Spec status updated
+- [x] Tasks 3.1, 3.2, 3.5–3.9 done (30 new tests)
+- [x] Threshold + cap configurable
+- [x] Spec status updated
+- [ ] Calibration (tasks 3.3, 3.4) — deferred until real-LLM run lands; placeholder default `threshold=0.95` documented in `decisions.md`
 
 ---