Skip to content

Commit 8d92e37

Browse files
chore(faithfulness): smoke-test helper + enable judge for self-dogfood (#38)
Follow-up to the polish-fact-check Phase 3 PR (#36) that landed the faithfulness judge. This commit adds the local helpers used to smoke-test the judge end-to-end without paying for a full attune-author regenerate cycle. Two changes: 1. scripts/test_faithfulness.py Tiny harness that picks the smallest feature in features.yaml (fewest source files) and regenerates its 3 core kinds (concept/task/reference) with telemetry-reset + summary-print + review-block detection. Cost on Haiku 4.5 ≈ $0.03 per run. Refuses to run without ANTHROPIC_API_KEY in env. Usage: uv run python scripts/test_faithfulness.py uv run python scripts/test_faithfulness.py <feature_name> 2. pyproject.toml: enable the judge for attune-author's own self-dogfood help regeneration. With this, anyone running `attune-author regenerate` against attune-author with auth available exercises the Phase 3 pipeline end-to-end — matches the pattern attune-author already uses for the polish pass (live API calls during dogfood). Configured on Haiku 4.5 (~1/3 the cost of Sonnet 4.6) since the threshold + budget defaults are pre-calibration and a cheaper model is fine for the initial measurement pass. Why ship this as a follow-up rather than baking it into #36: the Phase 3 PR was scoped to the implementation + tests; the spec defines `enabled=false` as the global default (opt-in, since the judge makes real API calls). Flipping it on for the attune-author repo itself is a per-project preference, not a default change. Same shape as how attune-author has always defaulted polish-strict on for its own dogfood while the package default is lenient. Post-Phase-0 of the sibling-subscription-auth spec (attune-ai PR #406), this also exercises the subscription- routing path for Claude Code users — though the wire-up to actually use claude_agent_sdk lives in Phase 1, which hasn't shipped yet, so today this still requires ANTHROPIC_API_KEY. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 829f03f commit 8d92e37

2 files changed

Lines changed: 140 additions & 0 deletions

File tree

pyproject.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,14 @@ source = ["attune_author"]
115115
branch = true
116116
omit = ["*/tests/*", "*/conftest.py"]
117117

118+
# Faithfulness judge (Phase 3) — opt-in. Scores polished docs
119+
# against source. See docs/specs/polish-fact-check/.
120+
[tool.attune-author.fact-check.faithfulness]
121+
enabled = true
122+
threshold = 0.95
123+
budget_per_file_usd = 0.10
124+
model = "claude-haiku-4-5-20251001"
125+
118126
[tool.coverage.report]
119127
show_missing = true
120128
skip_covered = false

scripts/test_faithfulness.py

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
"""Smoke-test the Phase 3 faithfulness judge on one feature.
2+
3+
Picks the smallest feature in features.yaml (fewest source files),
4+
regenerates its 3 core kinds (concept/task/reference), and reports
5+
on whether the judge fired, what it scored, and whether any review
6+
blocks were appended.
7+
8+
Designed for low cost — one feature × 3 kinds × Haiku 4.5 ≈ $0.03.
9+
10+
Usage::
11+
12+
export ANTHROPIC_API_KEY=sk-ant-...
13+
uv run python scripts/test_faithfulness.py [feature_name]
14+
15+
Without a feature_name argument, the script picks the feature
16+
with the fewest source files.
17+
"""
18+
19+
from __future__ import annotations
20+
21+
import logging
22+
import os
23+
import sys
24+
from pathlib import Path
25+
26+
from attune_author.generator import (
27+
_faithfulness_telemetry,
28+
generate_feature_templates,
29+
reset_faithfulness_telemetry,
30+
)
31+
from attune_author.manifest import load_manifest
32+
33+
REPO_ROOT = Path(__file__).resolve().parent.parent
34+
35+
36+
def _setup_logging() -> None:
37+
logging.basicConfig(
38+
level=logging.INFO,
39+
format="%(asctime)s %(levelname)-7s %(name)s %(message)s",
40+
datefmt="%H:%M:%S",
41+
)
42+
# Surface faithfulness module logs at INFO.
43+
logging.getLogger("attune_author.faithfulness").setLevel(logging.INFO)
44+
logging.getLogger("attune_author.generator").setLevel(logging.INFO)
45+
46+
47+
def _pick_smallest_feature(manifest) -> str:
48+
candidates = sorted(
49+
manifest.features.values(),
50+
key=lambda f: len(f.files),
51+
)
52+
return candidates[0].name
53+
54+
55+
def main() -> int:
56+
_setup_logging()
57+
58+
if not os.environ.get("ANTHROPIC_API_KEY"):
59+
print("ERROR: ANTHROPIC_API_KEY not set", file=sys.stderr)
60+
print(
61+
"Export your key and re-run: export ANTHROPIC_API_KEY=sk-ant-...",
62+
file=sys.stderr,
63+
)
64+
return 1
65+
66+
manifest = load_manifest(REPO_ROOT / ".help")
67+
feature_name = sys.argv[1] if len(sys.argv) > 1 else _pick_smallest_feature(manifest)
68+
feature = manifest.features.get(feature_name)
69+
if feature is None:
70+
print(f"ERROR: unknown feature {feature_name!r}", file=sys.stderr)
71+
print(f"Available: {sorted(manifest.features)}", file=sys.stderr)
72+
return 1
73+
74+
print("\n=== Phase 3 faithfulness smoke test ===")
75+
print(f"Feature: {feature_name}")
76+
print(f"Source files: {feature.files}")
77+
print(f"Working dir: {REPO_ROOT}")
78+
print()
79+
80+
reset_faithfulness_telemetry()
81+
82+
# Run generate from the repo root so cwd-relative resolution
83+
# (matched_files, project_root) lines up.
84+
original_cwd = Path.cwd()
85+
try:
86+
os.chdir(REPO_ROOT)
87+
result = generate_feature_templates(
88+
feature=feature,
89+
help_dir=REPO_ROOT / ".help",
90+
project_root=REPO_ROOT,
91+
overwrite=True,
92+
)
93+
finally:
94+
os.chdir(original_cwd)
95+
96+
telemetry = _faithfulness_telemetry()
97+
print()
98+
print("=== Judge telemetry ===")
99+
print(f" Calls: {int(telemetry['calls'])}")
100+
print(f" Skipped: {int(telemetry['skipped'])}")
101+
print(f" Estimated cost: ${telemetry['cost_usd']:.4f}")
102+
print()
103+
104+
review_blocks_found = 0
105+
print("=== Polished templates ===")
106+
for tmpl in result.templates:
107+
text = tmpl.path.read_text(encoding="utf-8")
108+
has_review = "## Faithfulness review" in text
109+
marker = "REVIEW" if has_review else "clean"
110+
print(f" [{marker}] {tmpl.path.relative_to(REPO_ROOT)}")
111+
if has_review:
112+
review_blocks_found += 1
113+
114+
print()
115+
if telemetry["calls"] == 0 and telemetry["skipped"] == 0:
116+
print("Judge did NOT run. Diagnostic checks:")
117+
print(" - Is `enabled = true` set in " "[tool.attune-author.fact-check.faithfulness]?")
118+
print(" - Is ATTUNE_AUTHOR_FAITHFULNESS=off in the environment?")
119+
return 2
120+
121+
if review_blocks_found:
122+
print(
123+
f"{review_blocks_found} template(s) flagged below threshold — "
124+
f"check the files above for the ## Faithfulness review block."
125+
)
126+
else:
127+
print("All templates scored at or above threshold.")
128+
return 0
129+
130+
131+
if __name__ == "__main__":
132+
sys.exit(main())

0 commit comments

Comments
 (0)