Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,42 @@ changes land, not at tag time.

### Added

- **Polish fact-check Phase 3 — faithfulness judge.** Wraps
`attune_rag.eval.faithfulness.FaithfulnessJudge` as a
post-polish step: scores each polished file's claims against
the source files it was generated from. When the score falls
below the configured threshold, appends a
`## Faithfulness review` block listing the unsupported claims
and the judge's reasoning. Best-effort: missing
`attune-rag[claude]`, missing `ANTHROPIC_API_KEY`, over-budget
estimates, and transient API failures all degrade silently
rather than blocking the polish.
- New package: `src/attune_author/faithfulness/` with the
judge wrapper, `FaithfulnessConfig`, `JudgeOutcome`,
`estimate_cost_usd` budget-gate helper, and a
`format_review_block` / `apply_review_block` soft-fail pair
that mirrors the Phase 1 `## Unresolved references` shape.
- Config schema:
`[tool.attune-author.fact-check.faithfulness]` with
`enabled`, `threshold` (default 0.95, pre-calibration), `budget_per_file_usd`
(default $0.10), `model` (default Sonnet 4.6 — Haiku 4.5 is
cheaper for high-volume runs), and
`block_polish_on_unavailable` (default False — flip to True
in CI where missing deps should be loud).
- Cost telemetry: per-process counters on the generator
module; `run_maintenance` resets them at start and logs an
INFO summary at end with call count, skip count, and total
estimated USD spent.
- `ATTUNE_AUTHOR_FAITHFULNESS=off` env-var override for one-off
disable without editing pyproject.
- 30 new tests under `tests/unit/faithfulness/`.
- Threshold calibration against the ops-dashboard fixture
(tasks 3.3, 3.4) is **deferred** until the live-LLM Phase 2
acceptance run; today's default of 0.95 is documented as
pre-calibration in `decisions.md`.
- Spec: `docs/specs/polish-fact-check/`. Phase 4 (tutorial
code-fence mypy) remains on the roadmap.

- **Polish fact-check Phase 2 — ground-truth context
injection.** Builds three sentinel-tagged blocks
(`<cli_help>`, `<public_api>`, `<dataclasses>`) and injects
Expand Down
31 changes: 31 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,37 @@ CLI flags, fabricated private-module imports, wrong route
paths, hallucinated counts) at the prompt layer, rather than
relying solely on the post-generation fact-check to catch them.

## Faithfulness review (Phase 3)

The Phase 3 faithfulness judge wraps
`attune_rag.eval.faithfulness.FaithfulnessJudge` as a
post-polish step: it scores each polished file's claims against
the source files it was generated from. When the score falls
below the configured threshold, a `## Faithfulness review`
block is appended to the polished file listing the unsupported
claims and the judge's reasoning.

The judge is **opt-in** because it makes real Anthropic API
calls. To enable:

```toml
[tool.attune-author.fact-check.faithfulness]
enabled = true
threshold = 0.95 # below this triggers a review block
budget_per_file_usd = 0.10 # skip if estimated cost exceeds cap
model = "claude-sonnet-4-6" # haiku is ~1/3 the cost
```

The judge is best-effort. Missing `attune-rag[claude]`, missing
`ANTHROPIC_API_KEY`, over-budget cost estimates, and transient
API failures all degrade silently rather than blocking the
polish. Set `block_polish_on_unavailable = true` in CI lanes
where missing deps should fail loudly instead.

End-of-run telemetry (call count, skip count, total estimated
USD) logs at INFO level after `attune-author regenerate`. Set
`ATTUNE_AUTHOR_FAITHFULNESS=off` to disable for a single run.

## Polish cache

`attune-author` caches LLM polish responses on disk so re-generating an
Expand Down
35 changes: 35 additions & 0 deletions docs/specs/polish-fact-check/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,38 @@ To be filled in during Phase 3 implementation:
faithfulness judge ships, it will require its own real-LLM
calibration run. Folding the cost-delta measurement into that
run avoids two separate real-LLM cycles.
- 2026-05-16 — Phase 3 shipped. New decisions captured during
implementation:
- **Opt-in default**: `enabled=False` ships in
`FaithfulnessConfig` and the pyproject loader, because the
judge makes real Anthropic API calls and we shouldn't bill
users for it silently on the first run after install. The
Phase 1 fact-check is enabled by default (no API calls); the
Phase 3 judge is not.
- **Synchronous wrapper via `asyncio.run`**: the existing
polish pipeline is synchronous, so the async
`FaithfulnessJudge.score` coroutine is bridged with
`asyncio.run`. This precludes calling the judge from inside
a running event loop (we don't, today), but keeps the
surface aligned with the rest of attune-author.
- **Best-effort vs strict**: missing extras / missing API key
/ transient failures all default to `JudgeOutcome(score=None,
skipped_reason=…)` rather than raising. CI lanes that need
loud failures opt in via `block_polish_on_unavailable = true`.
- **Budget gate uses character-count heuristic, not tokenizer**:
`estimate_cost_usd(chars, model)` divides chars by 4 to get
a rough token count and multiplies by a per-model price
lookup. Accurate to ~20% — well inside what a $0.10 budget
cap cares about. A real tokenizer is a future change if
drift surfaces.
- **Cost telemetry as function attribute, not module global**:
`_faithfulness_telemetry()` stores the counter dict on its
own `_state` attribute so it's resettable, mockable, and
doesn't leak module-level state. Mirrors how the polish
cache exposes its store.
- **Calibration deferred**: tasks 3.3 and 3.4 require a real
LLM run against the ops-dashboard pre-fix and post-fix
fixtures. The placeholder threshold of `0.95` ships as the
default and the calibration is scheduled to land alongside
the live-LLM Phase 2 acceptance run so a single real-API
cycle covers both phases' open work.
27 changes: 13 additions & 14 deletions docs/specs/polish-fact-check/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,23 +110,22 @@ the `FactCheckReport` plumbing.

| # | Task | Layer | Status | Notes |
|---|------|-------|--------|-------|
| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | todo | Default threshold `0.95`; default cap `$0.10/feature` |
| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | todo | Wraps `attune_rag.eval.faithfulness.FaithfulnessJudge` |
| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | todo | Pre-fix should score < 0.9 mean; post-fix ≥ 0.95 |
| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | todo | Pre-committed matrix entry; concrete numbers |
| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | todo | Append `## Faithfulness review` block when below threshold |
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | todo | Use existing telemetry hooks if any; otherwise log |
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | todo | Construct a polished file that contradicts the source |
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | todo | |
| 3.9 | Update CHANGELOG + README | attune-author | todo | |
| 3.1 | Add faithfulness-threshold + budget-cap config to `[tool.attune-author.fact-check]` | attune-author | **done** | `[tool.attune-author.fact-check.faithfulness]` sub-table; defaults threshold=0.95, budget=$0.10, model=Sonnet 4.6, enabled=False (opt-in) |
| 3.2 | Implement `faithfulness.judge_polished_file(polished_path, source_paths, config)` wrapper | attune-author | **done** | Wraps `FaithfulnessJudge` via `asyncio.run`; best-effort: missing extra / missing API key / over-budget all return `JudgeOutcome(score=None, skipped_reason=…)` rather than raising. `block_polish_on_unavailable=True` opt-in for strict CI. |
| 3.3 | Calibrate threshold against ops-dashboard fixture | attune-author | deferred | Requires real-LLM run; placeholder default `0.95` documented in decisions.md as pre-calibration. Calibration scheduled alongside live-LLM Phase 2 acceptance run. |
| 3.4 | Document calibration result in `decisions.md` (or design doc) | attune-author | deferred | Empty calibration record retained; will populate when 3.3 runs. |
| 3.5 | Wire judge into post-polish pipeline (after Phase 1 fact-check) | attune-author | **done** | `generator._run_faithfulness_judge` called after `_run_fact_check`; appends `## Faithfulness review` block when below threshold. `ATTUNE_AUTHOR_FAITHFULNESS=off` env override. |
| 3.6 | Cost telemetry: aggregate per-feature judge cost; report at end of `regenerate` | attune-author | **done** | Per-process telemetry state on `_faithfulness_telemetry`; `run_maintenance` resets at start and logs INFO summary at end (calls, skipped, total estimated $). |
| 3.7 | Test: judge runs and writes review block on a deliberately unfaithful synthetic input | attune-author | **done** | `test_pipeline_wiring.py::test_run_faithfulness_judge_appends_review_block_when_below_threshold` + `test_judge.py::test_judge_below_threshold_flags_threshold_not_met` |
| 3.8 | Test: budget cap skips judge call when estimated cost exceeds threshold | attune-author | **done** | `test_judge.py::test_judge_skipped_when_over_budget` |
| 3.9 | Update CHANGELOG + README | attune-author | **done** | CHANGELOG under Unreleased; README adds a "Faithfulness review (Phase 3)" subsection. |

### Phase 3 exit checklist

- [ ] Tasks 3.1–3.9 done
- [ ] Calibration shows clean separation between pre-fix and post-fix
fixture scores
- [ ] Threshold + cap configurable
- [ ] Spec status updated
- [x] Tasks 3.1, 3.2, 3.5–3.9 done (30 new tests)
- [x] Threshold + cap configurable
- [x] Spec status updated
- [ ] Calibration (tasks 3.3, 3.4) — deferred until real-LLM run lands; placeholder default `threshold=0.95` documented in `decisions.md`

---

Expand Down
Loading
Loading