feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check)#36
Merged
Merged
Conversation
This was referenced May 16, 2026
…check)
Phase 3 adds an opt-in faithfulness judge that scores polished
documents against the source files they were generated from.
When the score falls below the configured threshold, a
`## Faithfulness review` block listing the unsupported claims and
the judge's reasoning is appended to the polished file.
Pairs with Phase 1 (AST fact-check after generation) and Phase 2
(ground-truth context injection before generation) to give three
distinct interventions against polish-pass hallucinations.
New package: src/attune_author/faithfulness/
- judge wrapper around attune_rag.eval.faithfulness.FaithfulnessJudge
via asyncio.run (the polish pipeline is sync)
- FaithfulnessConfig: threshold (0.95 pre-calibration default),
budget_per_file_usd ($0.10), model (Sonnet 4.6 — Haiku is ~1/3
the cost), block_polish_on_unavailable for strict CI
- estimate_cost_usd: chars-to-tokens heuristic + per-model price
lookup, used as the budget gate so we never invoke the judge
when the estimated cost exceeds the cap
- format_review_block + apply_review_block: soft-fail formatter
matching the Phase 1 ## Unresolved references shape
Wiring:
- generator._run_faithfulness_judge runs after _run_fact_check
on every polished file. Reads optional pyproject config.
- generator._faithfulness_telemetry / reset_faithfulness_telemetry:
per-process counters; run_maintenance resets them at start and
logs INFO summary at end (calls, skipped, total estimated $).
- ATTUNE_AUTHOR_FAITHFULNESS=off env override.
Best-effort contract: missing attune-rag[claude], missing
ANTHROPIC_API_KEY, over-budget estimates, transient API failures
all degrade silently. The judge never blocks the polish.
Tests: 30 new tests under tests/unit/faithfulness/ covering the
budget gate, every skip path, the happy path, the
below-threshold review-block append, env-var override, telemetry
reset, and unexpected-exception swallowing. Full suite: 926
passed, 37 pre-existing skips.
Threshold calibration (tasks 3.3, 3.4) deferred to the same
real-LLM run that closes Phase 2's live-LLM acceptance gate —
folding two API cycles into one. Default of 0.95 is documented as
pre-calibration in decisions.md.
Spec: docs/specs/polish-fact-check/
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
42f3aa6 to
3a9da04
Compare
4 tasks
silversurfer562
added a commit
that referenced
this pull request
May 22, 2026
#38) Follow-up to the polish-fact-check Phase 3 PR (#36) that landed the faithfulness judge. This commit adds the local helpers used to smoke-test the judge end-to-end without paying for a full attune-author regenerate cycle. Two changes: 1. scripts/test_faithfulness.py Tiny harness that picks the smallest feature in features.yaml (fewest source files) and regenerates its 3 core kinds (concept/task/reference) with telemetry-reset + summary-print + review-block detection. Cost on Haiku 4.5 ≈ $0.03 per run. Refuses to run without ANTHROPIC_API_KEY in env. Usage: uv run python scripts/test_faithfulness.py uv run python scripts/test_faithfulness.py <feature_name> 2. pyproject.toml: enable the judge for attune-author's own self-dogfood help regeneration. With this, anyone running `attune-author regenerate` against attune-author with auth available exercises the Phase 3 pipeline end-to-end — matches the pattern attune-author already uses for the polish pass (live API calls during dogfood). Configured on Haiku 4.5 (~1/3 the cost of Sonnet 4.6) since the threshold + budget defaults are pre-calibration and a cheaper model is fine for the initial measurement pass. Why ship this as a follow-up rather than baking it into #36: the Phase 3 PR was scoped to the implementation + tests; the spec defines `enabled=false` as the global default (opt-in, since the judge makes real API calls). Flipping it on for the attune-author repo itself is a per-project preference, not a default change. Same shape as how attune-author has always defaulted polish-strict on for its own dogfood while the package default is lenient. Post-Phase-0 of the sibling-subscription-auth spec (attune-ai PR #406), this also exercises the subscription- routing path for Claude Code users — though the wire-up to actually use claude_agent_sdk lives in Phase 1, which hasn't shipped yet, so today this still requires ANTHROPIC_API_KEY. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #35 (Phase 2) — please review and merge that first.
attune_rag.eval.faithfulness.FaithfulnessJudgeas a post-polish step. Every polished file gets scored against its source files; below-threshold scores append a## Faithfulness reviewblock listing the unsupported claims and the judge's reasoning.src/attune_author/faithfulness/package: judge wrapper (sync-bridged viaasyncio.run),FaithfulnessConfigdataclass, char-based cost estimator that gates the call before paying for it, and aformat_review_block/apply_review_blockpair matching the Phase 1 soft-fail shape.generator.apply_polish_resultsafter the Phase 1 fact-check pass. Per-process cost telemetry counters reset atrun_maintenancestart and summary-logged at end.enabled = falsein config). The judge is best-effort: missingattune-rag[claude], missingANTHROPIC_API_KEY, over-budget cost estimates, and transient API failures all degrade silently rather than blocking the polish. CI lanes can opt into strict mode withblock_polish_on_unavailable = true.Motivation
Phase 1 catches mistakes after generation (regex/AST matching for known-bad shapes — invented imports, unknown CLI flags, broken links, wrong counts). Phase 2 prevents them during generation (ground-truth context injection). Phase 3 covers the middle ground: an LLM-as-judge that catches shapes Phase 1 can't pattern-match (e.g. the missing-security-callout for
0.0.0.0from attune-ai PR #351's fixture) and that Phase 2 can't fully prevent.Configuration
End-of-run telemetry log:
Test plan
tests/unit/faithfulness/covering — the budget gate, every skip path (disabled, missing file, no sources, over-budget, missing extra, missing key, transient failure), the happy path with mockedFaithfulnessJudge, the below-threshold review-block append, env-var override (ATTUNE_AUTHOR_FAITHFULNESS=off), telemetry counters + reset, and unexpected-exception swallowingruff checkclean across all touched filesthreshold=0.95is documented as pre-calibration indecisions.md.Notes for review
uv.lockis intentionally excluded — pre-existing drift (lockfile recordedattune-author 0.6.1), separate cleanup PR.pyproject.toml.asyncio.run: the existing polish pipeline is synchronous andFaithfulnessJudge.scoreis async. The bridge is at one call site; expanding to native-async polish is a future change if drift surfaces.decisions.md.🤖 Generated with Claude Code