You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check)
Phase 3 adds an opt-in faithfulness judge that scores polished
documents against the source files they were generated from.
When the score falls below the configured threshold, a
`## Faithfulness review` block listing the unsupported claims and
the judge's reasoning is appended to the polished file.
Pairs with Phase 1 (AST fact-check after generation) and Phase 2
(ground-truth context injection before generation) to give three
distinct interventions against polish-pass hallucinations.
New package: src/attune_author/faithfulness/
- judge wrapper around attune_rag.eval.faithfulness.FaithfulnessJudge
via asyncio.run (the polish pipeline is sync)
- FaithfulnessConfig: threshold (0.95 pre-calibration default),
budget_per_file_usd ($0.10), model (Sonnet 4.6 — Haiku is ~1/3
the cost), block_polish_on_unavailable for strict CI
- estimate_cost_usd: chars-to-tokens heuristic + per-model price
lookup, used as the budget gate so we never invoke the judge
when the estimated cost exceeds the cap
- format_review_block + apply_review_block: soft-fail formatter
matching the Phase 1 ## Unresolved references shape
Wiring:
- generator._run_faithfulness_judge runs after _run_fact_check
on every polished file. Reads optional pyproject config.
- generator._faithfulness_telemetry / reset_faithfulness_telemetry:
per-process counters; run_maintenance resets them at start and
logs INFO summary at end (calls, skipped, total estimated $).
- ATTUNE_AUTHOR_FAITHFULNESS=off env override.
Best-effort contract: missing attune-rag[claude], missing
ANTHROPIC_API_KEY, over-budget estimates, transient API failures
all degrade silently. The judge never blocks the polish.
Tests: 30 new tests under tests/unit/faithfulness/ covering the
budget gate, every skip path, the happy path, the
below-threshold review-block append, env-var override, telemetry
reset, and unexpected-exception swallowing. Full suite: 926
passed, 37 pre-existing skips.
Threshold calibration (tasks 3.3, 3.4) deferred to the same
real-LLM run that closes Phase 2's live-LLM acceptance gate —
folding two API cycles into one. Default of 0.95 is documented as
pre-calibration in decisions.md.
Spec: docs/specs/polish-fact-check/
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate`| attune-author |todo|Use existing telemetry hooks if any; otherwise log|
119
-
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author |todo|Construct a polished file that contradicts the source|
120
-
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author |todo ||
| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author |**done**| Wraps `FaithfulnessJudge` via `asyncio.run`; best-effort: missing extra / missing API key / over-budget all return `JudgeOutcome(score=None, skipped_reason=…)` rather than raising. `block_polish_on_unavailable=True` opt-in for strict CI.|
115
+
| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author |deferred|Requires real-LLM run; placeholder default `0.95` documented in decisions.md as pre-calibration. Calibration scheduled alongside live-LLM Phase 2 acceptance run.|
116
+
| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author |deferred|Empty calibration record retained; will populate when 3.3 runs.|
117
+
| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author |**done**|`generator._run_faithfulness_judge` called after `_run_fact_check`; appends `## Faithfulness review` block when below threshold. `ATTUNE_AUTHOR_FAITHFULNESS=off` env override.|
118
+
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate`| attune-author |**done**|Per-process telemetry state on `_faithfulness_telemetry`; `run_maintenance` resets at start and logs INFO summary at end (calls, skipped, total estimated $).|
119
+
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author |**done**|`test_pipeline_wiring.py::test_run_faithfulness_judge_appends_review_block_when_below_threshold` + `test_judge.py::test_judge_below_threshold_flags_threshold_not_met`|
120
+
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author |**done**|`test_judge.py::test_judge_skipped_when_over_budget`|
0 commit comments