feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check) by silversurfer562 · Pull Request #36 · Smart-AI-Memory/attune-author

silversurfer562 · 2026-05-16T07:01:48Z

Summary

Stacked on #35 (Phase 2) — please review and merge that first.

Ship Phase 3 of the polish-fact-check spec: wrap attune_rag.eval.faithfulness.FaithfulnessJudge as a post-polish step. Every polished file gets scored against its source files; below-threshold scores append a ## Faithfulness review block listing the unsupported claims and the judge's reasoning.
New src/attune_author/faithfulness/ package: judge wrapper (sync-bridged via asyncio.run), FaithfulnessConfig dataclass, char-based cost estimator that gates the call before paying for it, and a format_review_block / apply_review_block pair matching the Phase 1 soft-fail shape.
Wired into generator.apply_polish_results after the Phase 1 fact-check pass. Per-process cost telemetry counters reset at run_maintenance start and summary-logged at end.
Opt-in by default (enabled = false in config). The judge is best-effort: missing attune-rag[claude], missing ANTHROPIC_API_KEY, over-budget cost estimates, and transient API failures all degrade silently rather than blocking the polish. CI lanes can opt into strict mode with block_polish_on_unavailable = true.

Motivation

Phase 1 catches mistakes after generation (regex/AST matching for known-bad shapes — invented imports, unknown CLI flags, broken links, wrong counts). Phase 2 prevents them during generation (ground-truth context injection). Phase 3 covers the middle ground: an LLM-as-judge that catches shapes Phase 1 can't pattern-match (e.g. the missing-security-callout for 0.0.0.0 from attune-ai PR #351's fixture) and that Phase 2 can't fully prevent.

Configuration

[tool.attune-author.fact-check.faithfulness]
enabled = true                # opt-in; defaults to false
threshold = 0.95              # below this triggers a review block
budget_per_file_usd = 0.10    # skip if estimated cost exceeds cap
model = "claude-sonnet-4-6"   # haiku is ~1/3 the cost
block_polish_on_unavailable = false  # set true in strict CI lanes

End-of-run telemetry log:

INFO Faithfulness judge: 11 call(s), 2 skipped, estimated cost $0.0537

Test plan

Unit tests: 30 new tests under tests/unit/faithfulness/ covering — the budget gate, every skip path (disabled, missing file, no sources, over-budget, missing extra, missing key, transient failure), the happy path with mocked FaithfulnessJudge, the below-threshold review-block append, env-var override (ATTUNE_AUTHOR_FAITHFULNESS=off), telemetry counters + reset, and unexpected-exception swallowing
Full attune-author suite: 926 passed, 37 pre-existing skips
ruff check clean across all touched files
Live-LLM acceptance (gated): threshold calibration against the ops-dashboard pre-fix / post-fix fixture (tasks 3.3 + 3.4) — scheduled to land alongside Phase 2's live-LLM acceptance run so a single real-API cycle covers both phases' open items. The placeholder default threshold=0.95 is documented as pre-calibration in decisions.md.

Notes for review

uv.lock is intentionally excluded — pre-existing drift (lockfile recorded attune-author 0.6.1), separate cleanup PR.
Why opt-in: the judge makes real Anthropic API calls. We shouldn't bill users for it silently on first run after install. Phase 1 (no API calls) defaults on; Phase 3 (real API calls) defaults off. Easy to flip per-project via pyproject.toml.
Why asyncio.run: the existing polish pipeline is synchronous and FaithfulnessJudge.score is async. The bridge is at one call site; expanding to native-async polish is a future change if drift surfaces.
Why char-based cost estimate vs real tokenizer: the budget gate cares about ~$0.10 precision; chars/4 token estimate is accurate to ~20% which is well inside that gate. Documented in decisions.md.

🤖 Generated with Claude Code

…check) Phase 3 adds an opt-in faithfulness judge that scores polished documents against the source files they were generated from. When the score falls below the configured threshold, a `## Faithfulness review` block listing the unsupported claims and the judge's reasoning is appended to the polished file. Pairs with Phase 1 (AST fact-check after generation) and Phase 2 (ground-truth context injection before generation) to give three distinct interventions against polish-pass hallucinations. New package: src/attune_author/faithfulness/ - judge wrapper around attune_rag.eval.faithfulness.FaithfulnessJudge via asyncio.run (the polish pipeline is sync) - FaithfulnessConfig: threshold (0.95 pre-calibration default), budget_per_file_usd ($0.10), model (Sonnet 4.6 — Haiku is ~1/3 the cost), block_polish_on_unavailable for strict CI - estimate_cost_usd: chars-to-tokens heuristic + per-model price lookup, used as the budget gate so we never invoke the judge when the estimated cost exceeds the cap - format_review_block + apply_review_block: soft-fail formatter matching the Phase 1 ## Unresolved references shape Wiring: - generator._run_faithfulness_judge runs after _run_fact_check on every polished file. Reads optional pyproject config. - generator._faithfulness_telemetry / reset_faithfulness_telemetry: per-process counters; run_maintenance resets them at start and logs INFO summary at end (calls, skipped, total estimated $). - ATTUNE_AUTHOR_FAITHFULNESS=off env override. Best-effort contract: missing attune-rag[claude], missing ANTHROPIC_API_KEY, over-budget estimates, transient API failures all degrade silently. The judge never blocks the polish. Tests: 30 new tests under tests/unit/faithfulness/ covering the budget gate, every skip path, the happy path, the below-threshold review-block append, env-var override, telemetry reset, and unexpected-exception swallowing. Full suite: 926 passed, 37 pre-existing skips. Threshold calibration (tasks 3.3, 3.4) deferred to the same real-LLM run that closes Phase 2's live-LLM acceptance gate — folding two API cycles into one. Default of 0.95 is documented as pre-calibration in decisions.md. Spec: docs/specs/polish-fact-check/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#38) Follow-up to the polish-fact-check Phase 3 PR (#36) that landed the faithfulness judge. This commit adds the local helpers used to smoke-test the judge end-to-end without paying for a full attune-author regenerate cycle. Two changes: 1. scripts/test_faithfulness.py Tiny harness that picks the smallest feature in features.yaml (fewest source files) and regenerates its 3 core kinds (concept/task/reference) with telemetry-reset + summary-print + review-block detection. Cost on Haiku 4.5 ≈ $0.03 per run. Refuses to run without ANTHROPIC_API_KEY in env. Usage: uv run python scripts/test_faithfulness.py uv run python scripts/test_faithfulness.py <feature_name> 2. pyproject.toml: enable the judge for attune-author's own self-dogfood help regeneration. With this, anyone running `attune-author regenerate` against attune-author with auth available exercises the Phase 3 pipeline end-to-end — matches the pattern attune-author already uses for the polish pass (live API calls during dogfood). Configured on Haiku 4.5 (~1/3 the cost of Sonnet 4.6) since the threshold + budget defaults are pre-calibration and a cheaper model is fine for the initial measurement pass. Why ship this as a follow-up rather than baking it into #36: the Phase 3 PR was scoped to the implementation + tests; the spec defines `enabled=false` as the global default (opt-in, since the judge makes real API calls). Flipping it on for the attune-author repo itself is a per-project preference, not a default change. Same shape as how attune-author has always defaulted polish-strict on for its own dogfood while the package default is lenient. Post-Phase-0 of the sibling-subscription-auth spec (attune-ai PR #406), this also exercises the subscription- routing path for Claude Code users — though the wire-up to actually use claude_agent_sdk lives in Phase 1, which hasn't shipped yet, so today this still requires ANTHROPIC_API_KEY. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 16, 2026

feat(polish): tutorial code-fence static check (Phase 4 of polish-fact-check) #37

Merged

docs(specs): draft sibling-package subscription-auth spec Smart-AI-Memory/attune-ai#404

Merged

silversurfer562 changed the base branch from feat/polish-fact-check-phase-2 to main May 16, 2026 07:57

silversurfer562 force-pushed the feat/polish-fact-check-phase-3 branch from 42f3aa6 to 3a9da04 Compare May 16, 2026 07:59

silversurfer562 merged commit d3c5f3e into main May 16, 2026
12 checks passed

silversurfer562 deleted the feat/polish-fact-check-phase-3 branch May 16, 2026 07:59

silversurfer562 mentioned this pull request May 16, 2026

chore(faithfulness): smoke-test helper + enable judge for self-dogfood #38

Merged

4 tasks

silversurfer562 mentioned this pull request May 22, 2026

release: v0.14.0 — polish-fact-check Phases 2-4 + workspace_staleness helper #40

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check)#36

feat(polish): faithfulness judge integration (Phase 3 of polish-fact-check)#36
silversurfer562 merged 1 commit into
mainfrom
feat/polish-fact-check-phase-3

silversurfer562 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

silversurfer562 commented May 16, 2026

Summary

Motivation

Configuration

Test plan

Notes for review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant