Skip to content

Commit 93df30c

Browse files
spec(docs): polish-fact-check umbrella spec — 4 phases to reduce hallucinations (#27)
Adds docs/specs/polish-fact-check/ — an umbrella spec for a four-phase intervention ladder that shifts polish-pass verification work from human editorial review to automated checks. Motivated by a regression fixture from attune-ai PR #351 (Smart-AI-Memory/attune-ai#351), where the ops-dashboard regen produced six factual errors in a single feature's docs: - 1 hallucinated CLI flag (--allow-run, real flag is --read-only) - 2 hallucinated private module paths (attune.ops._readers) - 4 hallucinated cross-references - 1 hallucinated count (498 templates vs real 259) - 2 wrong route paths - 1 insecure example (0.0.0.0 without auth callout) Three of the six (CLI flag, private imports, wrong routes) would actively break readers who follow the docs literally. Four phases, each shipping as its own PR: Phase 1: AST-based post-generation fact-check (Python refs, CLI flags, Markdown links, numeric claims) — catches 5 of 6 fixture errors. Cheapest, no LLM cost. Phase 2: Ground-truth context injection into polish prompt (CLI --help output, __all__, dataclass fields). Phase 3: Adapt attune-rag faithfulness judge as a polish post-step. Catches the 6th fixture error (missing-security-callout). Phase 4: Static analysis of tutorial code samples (mypy + ast.parse). Execution tiers explicitly deferred to Phase 4.2 for security reasons. Files: - requirements.md — problem statement, scope, acceptance - design.md — architecture, per-phase API shapes, open design questions - tasks.md — numbered tasks per phase, exit checklists - decisions.md — pre-committed decision matrix (introduces a spec-file convention) Status: draft. Awaiting review/approval before Phase 1 implementation begins. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent d9b9f09 commit 93df30c

4 files changed

Lines changed: 763 additions & 0 deletions

File tree

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Spec: Polish Fact-Check — Decisions
2+
3+
> Pre-committed decisions per the existing lesson "Pre-committed
4+
> decision matrices survive contact with data." Edits to this file
5+
> after Phase 1 ships require a follow-up PR with rationale.
6+
7+
---
8+
9+
## Decision matrix
10+
11+
| Decision | Choice | Rationale |
12+
|---|---|---|
13+
| Phase 1 default failure mode | Soft-fail (`## Unresolved references` block at bottom of file) | Lets us measure noise vs signal before tightening the gate. Mirrors test-quality-program's "measure first, gate later" rubric pattern. |
14+
| Phase 1 strict-fail escalation criterion | Move to strict-fail if soft-fail rate drops below 5% across two consecutive **weekly** regens | Weekly cadence matches the help-system's intended regen rhythm; monthly would delay the escalation decision unnecessarily. |
15+
| Phase 1 numeric-claim check | Required (not stretch) | Patrick tightened the acceptance gate from 4/6 to 5/6 errors caught at Phase 1 ship. Numeric claims are AST-pattern-detectable; only the "missing security callout" failure mode stays for Phase 3. |
16+
| Phase 1 CLI-ref version coupling | Acceptable, with proactive user messaging | When a flag isn't found in `attune <cmd> --help`, the finding message includes (a) the installed attune-ai version, (b) instructions to verify against the target version, (c) override snippet. See `design.md` § Check 2. |
17+
| Phase 3 default faithfulness threshold | `0.95` (mean across paragraphs in a single file) | Untested at spec-draft time. Will be **calibrated** against the ops-dashboard regression fixture in Phase 3 task #3.3 before defaulting. If calibration shows pre-fix mean ≥ 0.9 or post-fix mean < 0.95, the threshold gets re-decided. |
18+
| Phase 3 threshold override mechanism | `pyproject.toml` `[tool.attune-author.fact-check]` + per-invocation CLI flag | Two-level override: project-wide config for sustained policy, CLI flag for one-off runs. |
19+
| Phase 3 budget cap | Skip judge call if estimated cost > `$0.10` for a single feature regen | Hard cap protects against unexpected cost when regenerating a feature with many kinds. Configurable. |
20+
| Phase 4 default | Tier 0 (static analysis only); execution requires explicit opt-in | Static analysis catches the documented failure modes (e.g. `_readers` private-module hallucinations) without executing untrusted LLM-generated code. Execution tiers documented in design.md for Phase 4.2 follow-up. |
21+
| Phase 4 execution opt-in mechanism | `# attune-author: exec` frontmatter on individual code samples | Sample-level granularity, not file-level — keeps the human reviewer responsible for confirming each blessed sample has no side effects. |
22+
| Spec-file convention going forward | Include `decisions.md` alongside `requirements.md` / `design.md` / `tasks.md` | Patrick's call (2026-05-14). Extracts pre-committed decisions from the spec body so they're easy to audit and update independently. |
23+
24+
---
25+
26+
## Calibration record
27+
28+
To be filled in during Phase 3 implementation:
29+
30+
- [ ] **Phase 3 threshold calibration** — Phase 3 task #3.3 / #3.4
31+
- Pre-fix ops-dashboard mean faithfulness score: _TBD_
32+
- Post-fix ops-dashboard mean faithfulness score: _TBD_
33+
- Default threshold after calibration: _TBD_
34+
35+
---
36+
37+
## Decision-change log
38+
39+
> Append entries here when a decision above is revised. Reference the PR
40+
> that revised it.
41+
42+
- 2026-05-14 — Initial decisions captured during spec draft. Patrick
43+
approved.

0 commit comments

Comments
 (0)