Skip to content

Commit 3a9da04

Browse files
feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check)
Phase 3 adds an opt-in faithfulness judge that scores polished documents against the source files they were generated from. When the score falls below the configured threshold, a `## Faithfulness review` block listing the unsupported claims and the judge's reasoning is appended to the polished file. Pairs with Phase 1 (AST fact-check after generation) and Phase 2 (ground-truth context injection before generation) to give three distinct interventions against polish-pass hallucinations. New package: src/attune_author/faithfulness/ - judge wrapper around attune_rag.eval.faithfulness.FaithfulnessJudge via asyncio.run (the polish pipeline is sync) - FaithfulnessConfig: threshold (0.95 pre-calibration default), budget_per_file_usd ($0.10), model (Sonnet 4.6 — Haiku is ~1/3 the cost), block_polish_on_unavailable for strict CI - estimate_cost_usd: chars-to-tokens heuristic + per-model price lookup, used as the budget gate so we never invoke the judge when the estimated cost exceeds the cap - format_review_block + apply_review_block: soft-fail formatter matching the Phase 1 ## Unresolved references shape Wiring: - generator._run_faithfulness_judge runs after _run_fact_check on every polished file. Reads optional pyproject config. - generator._faithfulness_telemetry / reset_faithfulness_telemetry: per-process counters; run_maintenance resets them at start and logs INFO summary at end (calls, skipped, total estimated $). - ATTUNE_AUTHOR_FAITHFULNESS=off env override. Best-effort contract: missing attune-rag[claude], missing ANTHROPIC_API_KEY, over-budget estimates, transient API failures all degrade silently. The judge never blocks the polish. Tests: 30 new tests under tests/unit/faithfulness/ covering the budget gate, every skip path, the happy path, the below-threshold review-block append, env-var override, telemetry reset, and unexpected-exception swallowing. Full suite: 926 passed, 37 pre-existing skips. Threshold calibration (tasks 3.3, 3.4) deferred to the same real-LLM run that closes Phase 2's live-LLM acceptance gate — folding two API cycles into one. Default of 0.95 is documented as pre-calibration in decisions.md. Spec: docs/specs/polish-fact-check/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent d5c060b commit 3a9da04

12 files changed

Lines changed: 1168 additions & 14 deletions

File tree

CHANGELOG.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,42 @@ changes land, not at tag time.
1515

1616
### Added
1717

18+
- **Polish fact-check Phase 3 — faithfulness judge.** Wraps
19+
`attune_rag.eval.faithfulness.FaithfulnessJudge` as a
20+
post-polish step: scores each polished file's claims against
21+
the source files it was generated from. When the score falls
22+
below the configured threshold, appends a
23+
`## Faithfulness review` block listing the unsupported claims
24+
and the judge's reasoning. Best-effort: missing
25+
`attune-rag[claude]`, missing `ANTHROPIC_API_KEY`, over-budget
26+
estimates, and transient API failures all degrade silently
27+
rather than blocking the polish.
28+
- New package: `src/attune_author/faithfulness/` with the
29+
judge wrapper, `FaithfulnessConfig`, `JudgeOutcome`,
30+
`estimate_cost_usd` budget-gate helper, and a
31+
`format_review_block` / `apply_review_block` soft-fail pair
32+
that mirrors the Phase 1 `## Unresolved references` shape.
33+
- Config schema:
34+
`[tool.attune-author.fact-check.faithfulness]` with
35+
`enabled`, `threshold` (default 0.95, pre-calibration), `budget_per_file_usd`
36+
(default $0.10), `model` (default Sonnet 4.6 — Haiku 4.5 is
37+
cheaper for high-volume runs), and
38+
`block_polish_on_unavailable` (default False — flip to True
39+
in CI where missing deps should be loud).
40+
- Cost telemetry: per-process counters on the generator
41+
module; `run_maintenance` resets them at start and logs an
42+
INFO summary at end with call count, skip count, and total
43+
estimated USD spent.
44+
- `ATTUNE_AUTHOR_FAITHFULNESS=off` env-var override for one-off
45+
disable without editing pyproject.
46+
- 30 new tests under `tests/unit/faithfulness/`.
47+
- Threshold calibration against the ops-dashboard fixture
48+
(tasks 3.3, 3.4) is **deferred** until the live-LLM Phase 2
49+
acceptance run; today's default of 0.95 is documented as
50+
pre-calibration in `decisions.md`.
51+
- Spec: `docs/specs/polish-fact-check/`. Phase 4 (tutorial
52+
code-fence mypy) remains on the roadmap.
53+
1854
- **Polish fact-check Phase 2 — ground-truth context
1955
injection.** Builds three sentinel-tagged blocks
2056
(`<cli_help>`, `<public_api>`, `<dataclasses>`) and injects

README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,37 @@ CLI flags, fabricated private-module imports, wrong route
162162
paths, hallucinated counts) at the prompt layer, rather than
163163
relying solely on the post-generation fact-check to catch them.
164164

165+
## Faithfulness review (Phase 3)
166+
167+
The Phase 3 faithfulness judge wraps
168+
`attune_rag.eval.faithfulness.FaithfulnessJudge` as a
169+
post-polish step: it scores each polished file's claims against
170+
the source files it was generated from. When the score falls
171+
below the configured threshold, a `## Faithfulness review`
172+
block is appended to the polished file listing the unsupported
173+
claims and the judge's reasoning.
174+
175+
The judge is **opt-in** because it makes real Anthropic API
176+
calls. To enable:
177+
178+
```toml
179+
[tool.attune-author.fact-check.faithfulness]
180+
enabled = true
181+
threshold = 0.95 # below this triggers a review block
182+
budget_per_file_usd = 0.10 # skip if estimated cost exceeds cap
183+
model = "claude-sonnet-4-6" # haiku is ~1/3 the cost
184+
```
185+
186+
The judge is best-effort. Missing `attune-rag[claude]`, missing
187+
`ANTHROPIC_API_KEY`, over-budget cost estimates, and transient
188+
API failures all degrade silently rather than blocking the
189+
polish. Set `block_polish_on_unavailable = true` in CI lanes
190+
where missing deps should fail loudly instead.
191+
192+
End-of-run telemetry (call count, skip count, total estimated
193+
USD) logs at INFO level after `attune-author regenerate`. Set
194+
`ATTUNE_AUTHOR_FAITHFULNESS=off` to disable for a single run.
195+
165196
## Polish cache
166197

167198
`attune-author` caches LLM polish responses on disk so re-generating an

docs/specs/polish-fact-check/decisions.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,38 @@ To be filled in during Phase 3 implementation:
7373
faithfulness judge ships, it will require its own real-LLM
7474
calibration run. Folding the cost-delta measurement into that
7575
run avoids two separate real-LLM cycles.
76+
- 2026-05-16 — Phase 3 shipped. New decisions captured during
77+
implementation:
78+
- **Opt-in default**: `enabled=False` ships in
79+
`FaithfulnessConfig` and the pyproject loader, because the
80+
judge makes real Anthropic API calls and we shouldn't bill
81+
users for it silently on the first run after install. The
82+
Phase 1 fact-check is enabled by default (no API calls); the
83+
Phase 3 judge is not.
84+
- **Synchronous wrapper via `asyncio.run`**: the existing
85+
polish pipeline is synchronous, so the async
86+
`FaithfulnessJudge.score` coroutine is bridged with
87+
`asyncio.run`. This precludes calling the judge from inside
88+
a running event loop (we don't, today), but keeps the
89+
surface aligned with the rest of attune-author.
90+
- **Best-effort vs strict**: missing extras / missing API key
91+
/ transient failures all default to `JudgeOutcome(score=None,
92+
skipped_reason=…)` rather than raising. CI lanes that need
93+
loud failures opt in via `block_polish_on_unavailable = true`.
94+
- **Budget gate uses character-count heuristic, not tokenizer**:
95+
`estimate_cost_usd(chars, model)` divides chars by 4 to get
96+
a rough token count and multiplies by a per-model price
97+
lookup. Accurate to ~20% — well inside what a $0.10 budget
98+
cap cares about. A real tokenizer is a future change if
99+
drift surfaces.
100+
- **Cost telemetry as function attribute, not module global**:
101+
`_faithfulness_telemetry()` stores the counter dict on its
102+
own `_state` attribute so it's resettable, mockable, and
103+
doesn't leak module-level state. Mirrors how the polish
104+
cache exposes its store.
105+
- **Calibration deferred**: tasks 3.3 and 3.4 require a real
106+
LLM run against the ops-dashboard pre-fix and post-fix
107+
fixtures. The placeholder threshold of `0.95` ships as the
108+
default and the calibration is scheduled to land alongside
109+
the live-LLM Phase 2 acceptance run so a single real-API
110+
cycle covers both phases' open work.

docs/specs/polish-fact-check/tasks.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -110,23 +110,22 @@ the `FactCheckReport` plumbing.
110110

111111
| # | Task | Layer | Status | Notes |
112112
|---|------|-------|--------|-------|
113-
| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` |
114-
| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` |
115-
| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 |
116-
| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers |
117-
| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold |
118-
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log |
119-
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source |
120-
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | |
121-
| 3.9 | Update CHANGELOG + README | attune-author | todo | |
113+
| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | **done** | `[tool.attune-author.fact-check.faithfulness]` sub-table; defaults threshold=0.95, budget=$0.10, model=Sonnet 4.6, enabled=False (opt-in) |
114+
| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | **done** | Wraps `FaithfulnessJudge` via `asyncio.run`; best-effort: missing extra / missing API key / over-budget all return `JudgeOutcome(score=None, skipped_reason=…)` rather than raising. `block_polish_on_unavailable=True` opt-in for strict CI. |
115+
| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | deferred | Requires real-LLM run; placeholder default `0.95` documented in decisions.md as pre-calibration. Calibration scheduled alongside live-LLM Phase 2 acceptance run. |
116+
| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | deferred | Empty calibration record retained; will populate when 3.3 runs. |
117+
| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | **done** | `generator._run_faithfulness_judge` called after `_run_fact_check`; appends `## Faithfulness review` block when below threshold. `ATTUNE_AUTHOR_FAITHFULNESS=off` env override. |
118+
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | **done** | Per-process telemetry state on `_faithfulness_telemetry`; `run_maintenance` resets at start and logs INFO summary at end (calls, skipped, total estimated $). |
119+
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | **done** | `test_pipeline_wiring.py::test_run_faithfulness_judge_appends_review_block_when_below_threshold` + `test_judge.py::test_judge_below_threshold_flags_threshold_not_met` |
120+
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | **done** | `test_judge.py::test_judge_skipped_when_over_budget` |
121+
| 3.9 | Update CHANGELOG + README | attune-author | **done** | CHANGELOG under Unreleased; README adds a "Faithfulness review (Phase 3)" subsection. |
122122

123123
### Phase 3 exit checklist
124124

125-
- [ ] Tasks 3.1–3.9 done
126-
- [ ] Calibration shows clean separation between pre-fix and post-fix
127-
fixture scores
128-
- [ ] Threshold + cap configurable
129-
- [ ] Spec status updated
125+
- [x] Tasks 3.1, 3.2, 3.5–3.9 done (30 new tests)
126+
- [x] Threshold + cap configurable
127+
- [x] Spec status updated
128+
- [ ] Calibration (tasks 3.3, 3.4) — deferred until real-LLM run lands; placeholder default `threshold=0.95` documented in `decisions.md`
130129

131130
---
132131

0 commit comments

Comments
 (0)