Skip to content

Commit d5c060b

Browse files
feat(polish): ground-truth context injection (Phase 2 of polish-fact-check) (#35)
Phase 2 of the polish-fact-check spec changes what the model sees during the polish pass rather than catching mistakes after the fact (Phase 1's job). Three sentinel-tagged blocks carrying authoritative surface details are injected into the user message, and a short anchoring clause is appended to the system prompt instructing the model to only reference names that appear verbatim in those blocks. Goal: prevent the six hallucination shapes documented in attune-ai PR #351's ops-dashboard editorial pass (invented CLI flags, fabricated _readers/_models imports, wrong route paths, hallucinated counts) at the prompt layer. New package: src/attune_author/ground_truth/ - cli_help.py subprocess (10s timeout) + LRU cache per (exe,sub,cwd) - public_api.py AST walk: __all__ + public function/class signatures - dataclass_refs.py AST walk: @DataClass field names + type strings (named to avoid shadowing the stdlib module) - budget.py 5KB cap; drop order dataclasses to public_api to cli_help - config.py [tool.attune-author.context-injection] schema Wiring: - Feature.cli_command optional field on the manifest model (legacy manifests round-trip; save omits the field when None). - build_polish_prompt (used by both sync and batch paths) gains include_ground_truth_anchor. Cache key shifts when set so old cached entries invalidate cleanly without bespoke plumbing. - generator._maybe_polish and maintenance_batch._collect_polish_prompts each build the ground-truth string once per feature and prepend it to the RAG hook's existing augmented_context. Tests: 60 new tests under tests/unit/ground_truth/ covering each extractor, budget drop order, config loading, build_context shape, and the polish-prompt integration (anchor clause + sentinel-tag embedding). Full suite: 896 passed, 37 pre-existing skips. Decisions captured during impl (see decisions.md): - Compose with RAG instead of replacing it. - Anchor clause as system-prompt suffix. - Cache-key participation via system-prompt change. - CLI flags + live-LLM acceptance + cost-delta deferred to follow-ups. Spec: docs/specs/polish-fact-check/ Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0d7e9b6 commit d5c060b

23 files changed

Lines changed: 1802 additions & 19 deletions

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,38 @@ and this project adheres to
1313
Work in progress for the next release. Add entries here as
1414
changes land, not at tag time.
1515

16+
### Added
17+
18+
- **Polish fact-check Phase 2 — ground-truth context
19+
injection.** Builds three sentinel-tagged blocks
20+
(`<cli_help>`, `<public_api>`, `<dataclasses>`) and injects
21+
them into the polish prompt above the existing source
22+
summary, with a short anchoring clause appended to the
23+
system prompt instructing the model to only reference names
24+
that appear verbatim in those blocks. Goal: prevent the
25+
six hallucination shapes documented in attune-ai PR #351's
26+
ops-dashboard editorial pass rather than catching them after
27+
the fact (Phase 1's job).
28+
- New package: `src/attune_author/ground_truth/` with
29+
`cli_help.py` (subprocess + LRU cache), `public_api.py`
30+
(AST walk for `__all__` + function/class signatures),
31+
`dataclass_refs.py` (AST walk for `@dataclass` field
32+
names/types), `budget.py` (5KB cap with documented drop
33+
order: dataclasses → public_api → cli_help), and
34+
`config.py` (`[tool.attune-author.context-injection]`
35+
schema).
36+
- New `Feature.cli_command` field on the manifest model;
37+
legacy manifests without this field continue to load. Save
38+
omits the field when `None`.
39+
- `build_polish_prompt` (used by both the synchronous and
40+
batch paths) accepts a new `include_ground_truth_anchor`
41+
flag; when True, the anchoring clause is appended to the
42+
system prompt and the prompt cache key shifts accordingly.
43+
- 60 new tests under `tests/unit/ground_truth/`.
44+
- Spec: `docs/specs/polish-fact-check/`. Phase 3
45+
(faithfulness judge integration) and Phase 4 (tutorial
46+
code-fence mypy) remain on the roadmap.
47+
1648
## [0.13.0] - 2026-05-15
1749

1850
> **Note**: skipping `0.12.0`. The internal `release/v0.12.0`

README.md

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,8 +114,53 @@ check_numeric_refs = true
114114

115115
This is Phase 1 of the [polish-fact-check
116116
spec](docs/specs/polish-fact-check/). Phase 2 (ground-truth
117-
context injection), Phase 3 (faithfulness judge), and Phase 4
118-
(tutorial static check) are tracked in `tasks.md`.
117+
context injection) shipped alongside it. Phase 3 (faithfulness
118+
judge) and Phase 4 (tutorial static check) remain on the
119+
roadmap.
120+
121+
## Ground-truth context (polish-prompt injection)
122+
123+
Phase 2 of the polish-fact-check spec changes what the model
124+
sees during the polish pass: three sentinel-tagged blocks
125+
carrying authoritative surface details are injected into the
126+
user message before the source summary, and a short anchoring
127+
clause is appended to the system prompt instructing the model
128+
to only reference names that appear verbatim in those blocks.
129+
130+
The three blocks:
131+
132+
- `<cli_help>`: captured `<cli> <subcommand> --help` output.
133+
Driven by an optional `cli_command:` field on each feature in
134+
`features.yaml` (e.g., `cli_command: ops` for a feature whose
135+
primary UX is `attune ops`). Absence skips this block.
136+
- `<public_api>`: AST-extracted `__all__` lists plus signatures
137+
for every public function and class in the feature's source
138+
files.
139+
- `<dataclasses>`: AST-extracted field names + type annotations
140+
for every public `@dataclass` in the feature's source files.
141+
142+
The combined block list is capped at 5 KB by default. When the
143+
budget is exceeded, blocks drop in this order: dataclasses,
144+
public_api, cli_help — the most authoritative anchor stays the
145+
longest.
146+
147+
Configure via `pyproject.toml`:
148+
149+
```toml
150+
[tool.attune-author.context-injection]
151+
enabled = true
152+
inject_cli_help = true
153+
inject_public_api = true
154+
inject_dataclasses = true
155+
budget_bytes = 5120
156+
cli_executable = "attune"
157+
```
158+
159+
The goal is to prevent the six hallucination shapes documented
160+
in attune-ai PR #351's ops-dashboard editorial pass (invented
161+
CLI flags, fabricated private-module imports, wrong route
162+
paths, hallucinated counts) at the prompt layer, rather than
163+
relying solely on the post-generation fact-check to catch them.
119164

120165
## Polish cache
121166

docs/specs/polish-fact-check/decisions.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,35 @@ To be filled in during Phase 3 implementation:
4141
4242
- 2026-05-14 — Initial decisions captured during spec draft. Patrick
4343
approved.
44+
- 2026-05-16 — Phase 2 shipped. New decisions captured during
45+
implementation:
46+
- **Composition with RAG context**: ground-truth context is
47+
prepended to the RAG hook's existing `augmented_context` rather
48+
than replacing it. Rationale: the two carry orthogonal information
49+
(RAG retrieves similar templates; ground-truth pins names) so
50+
keeping both maximizes prompt utility within the budget.
51+
- **Anchor clause as system-prompt suffix**: the
52+
`ANCHORING_CLAUSE` appends to the existing per-template-type
53+
system prompt rather than replacing or wrapping it. Rationale:
54+
minimises drift from the existing polish system prompts, which are
55+
already large (~6KB) and cache-friendly; the suffix is short and
56+
behaviorally additive.
57+
- **Cache-key participation**: when the anchor clause is added,
58+
the system prompt changes — and the polish-cache key already
59+
includes the system prompt, so existing cached entries are
60+
invalidated cleanly without bespoke cache-key plumbing.
61+
- **CLI flags deferred (task 2.8)**: env-driven defaults via
62+
`[tool.attune-author.context-injection]` in `pyproject.toml`
63+
were sufficient for the first iteration. CLI flags can be added
64+
in a follow-up alongside Phase 3's `--faithfulness-threshold`
65+
flag.
66+
- **Live-LLM acceptance gate deferred**: task 2.10 splits into
67+
a unit-level part (assert sentinel blocks reach the user
68+
message + anchor clause reaches the system prompt — done) and
69+
a live-LLM part (actually polish ops-dashboard with Phase 2 on
70+
+ Phase 1 off and observe 0/3 high-severity errors). The
71+
live-LLM part stays gated behind real-API-key availability.
72+
- **Cost-delta measurement deferred to Phase 3**: when the
73+
faithfulness judge ships, it will require its own real-LLM
74+
calibration run. Folding the cost-delta measurement into that
75+
run avoids two separate real-LLM cycles.

docs/specs/polish-fact-check/tasks.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -78,26 +78,28 @@ code).
7878

7979
| # | Task | Layer | Status | Notes |
8080
|---|------|-------|--------|-------|
81-
| 2.1 | Add `cli_command` field to `Feature` (the manifest model) | attune-author | todo | Optional; absence skips CLI-help injection |
82-
| 2.2 | Implement `ground_truth.extract_cli_help(cli_cmd, subcommand, project_root)` | attune-author | todo | `subprocess.run(...)` with timeout; cache per (cmd, subcommand) pair |
83-
| 2.3 | Implement `ground_truth.extract_public_api(source_paths)` | attune-author | todo | AST-walk for `__all__` + non-underscore-prefixed defs |
84-
| 2.4 | Implement `ground_truth.extract_dataclasses(source_paths)` | attune-author | todo | AST-walk for `@dataclass`; collect field names + type strings |
85-
| 2.5 | Add `<cli_help>`, `<public_api>`, `<dataclasses>` sentinel blocks to polish prompt builder | attune-author | todo | Match existing context-block format |
86-
| 2.6 | Add system-prompt anchoring clause | attune-author | todo | "Ground-truth context blocks contain surface details — names you use must appear verbatim" |
87-
| 2.7 | Implement 5KB context budget enforcement with drop order | attune-author | todo | Log warning on drop; never fail |
88-
| 2.8 | Add `[tool.attune-author.context-injection]` config + CLI flags | attune-author | todo | Defaults: all three sources on, 5KB budget |
89-
| 2.9 | Test: ground-truth extractors produce expected output on ops-dashboard source | attune-author | todo | Snapshot tests |
90-
| 2.10 | Test: polishing ops-dashboard with Phase 2 on, Phase 1 off recurs 0/3 high-severity errors | attune-author | todo | The acceptance gate from `design.md` |
91-
| 2.11 | Test: budget enforcement drops sources in documented order | attune-author | todo | Artificial 1KB cap forces drops |
92-
| 2.12 | Cost-delta measurement: 3-feature regression set with vs without Phase 2 | attune-author | todo | Record in CHANGELOG; should be < 10% |
93-
| 2.13 | Update CHANGELOG + README | attune-author | todo | |
81+
| 2.1 | Add `cli_command` field to `Feature` (the manifest model) | attune-author | **done** | Optional; load/save preserve; defaults None |
82+
| 2.2 | Implement `ground_truth.extract_cli_help(cli_cmd, subcommand, project_root)` | attune-author | **done** | `subprocess.run(...)` with 10s timeout; `@lru_cache` per (exe, sub, cwd) |
83+
| 2.3 | Implement `ground_truth.extract_public_api(source_paths)` | attune-author | **done** | AST walk: `__all__` + public function/class signatures (incl. method bodies) |
84+
| 2.4 | Implement `ground_truth.extract_dataclasses(source_paths)` | attune-author | **done** | AST walk: `@dataclass` decorator + AnnAssign field collection. Module named `dataclass_refs` to avoid stdlib shadowing |
85+
| 2.5 | Add `<cli_help>`, `<public_api>`, `<dataclasses>` sentinel blocks to polish prompt builder | attune-author | **done** | Composed in `ground_truth.build_context`; prepended to RAG context when both exist |
86+
| 2.6 | Add system-prompt anchoring clause | attune-author | **done** | `ANCHORING_CLAUSE` exposed; appended via new `include_ground_truth_anchor` flag on `polish_template`/`build_polish_prompt`. Cache key shifts accordingly. |
87+
| 2.7 | Implement 5KB context budget enforcement with drop order | attune-author | **done** | `ground_truth.budget.enforce_budget`; drops dataclasses → public_api → cli_help; logs warning per drop |
88+
| 2.8 | Add `[tool.attune-author.context-injection]` config + CLI flags | attune-author | **done** | Config schema landed (enabled, per-source toggles, budget, executable); CLI flag deferred (env-driven defaults sufficient for first iteration) |
89+
| 2.9 | Test: ground-truth extractors produce expected output on ops-dashboard source | attune-author | **done** | 25 tests across `test_public_api.py` + `test_dataclass_refs.py` |
90+
| 2.10 | Test: polishing ops-dashboard with Phase 2 on, Phase 1 off recurs 0/3 high-severity errors | attune-author | **partial** | Unit-level: `test_polish_integration.py` asserts the sentinel blocks reach the user message and the anchor clause reaches the system prompt. Live-LLM acceptance run gated to a follow-up once an `ANTHROPIC_API_KEY` lane is available. |
91+
| 2.11 | Test: budget enforcement drops sources in documented order | attune-author | **done** | 8 tests in `test_budget.py` covering drop order, fallback, log emission |
92+
| 2.12 | Cost-delta measurement: 3-feature regression set with vs without Phase 2 | attune-author | deferred | Requires real-LLM run; defer to Phase 3 calibration when judge cost is also measured |
93+
| 2.13 | Update CHANGELOG + README | attune-author | **done** | CHANGELOG entry under Unreleased. README addition in same PR. |
9494

9595
### Phase 2 exit checklist
9696

97-
- [ ] Tasks 2.1–2.13 done
98-
- [ ] 0/3 high-severity ops-dashboard errors recur in Phase-2-only polish
99-
- [ ] Cost delta < 10%
100-
- [ ] Spec status updated
97+
- [x] Tasks 2.1–2.11, 2.13 done (60 new tests)
98+
- [x] Spec status updated
99+
- [ ] Live acceptance: 0/3 high-severity ops-dashboard errors recur in
100+
Phase-2-only polish (requires real-LLM run — gated to a follow-up
101+
task once `ANTHROPIC_API_KEY` is available in a CI lane)
102+
- [ ] Cost delta < 10% (deferred to Phase 3 calibration run)
101103

102104
---
103105

src/attune_author/generator.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ def _parallel_polish(
3939
feature: object,
4040
source_info: object,
4141
use_rag: bool,
42+
matched_files: list[str] | None = None,
43+
project_root: Path | None = None,
4244
) -> dict[str, tuple[str, Path]]:
4345
"""Polish a batch of rendered templates concurrently.
4446
@@ -47,6 +49,12 @@ def _parallel_polish(
4749
feature: Feature being documented (read-only, thread-safe).
4850
source_info: Extracted source info (read-only, thread-safe).
4951
use_rag: Whether to use RAG grounding during polish.
52+
matched_files: Source file paths (relative to ``project_root``)
53+
for the feature, used by Phase 2 ground-truth context
54+
injection. ``None`` skips that injection.
55+
project_root: Consumer project root; required when
56+
``matched_files`` is supplied. Used to resolve relative
57+
paths and to run the consumer's CLI for ``--help``.
5058
5159
Returns:
5260
Mapping of depth -> (polished_content, out_path). Raises
@@ -60,6 +68,8 @@ def _task(depth: str, content: str, out_path: Path) -> tuple[str, str, Path]:
6068
source_info, # type: ignore[arg-type]
6169
template_type=depth,
6270
use_rag=use_rag,
71+
matched_files=matched_files,
72+
project_root=project_root,
6373
)
6474
return depth, polished, out_path
6575

@@ -307,6 +317,8 @@ def generate_feature_templates(
307317
prep.feature,
308318
prep.source_info,
309319
prep.use_rag,
320+
matched_files=list(prep.matched_files),
321+
project_root=Path(project_root),
310322
)
311323
polished_text: dict[str, str] = {depth: text for depth, (text, _path) in polished.items()}
312324

@@ -512,6 +524,8 @@ def _maybe_polish(
512524
source_info: _SourceInfo,
513525
template_type: str = "generic",
514526
use_rag: bool = True,
527+
matched_files: list[str] | None = None,
528+
project_root: Path | None = None,
515529
) -> str:
516530
"""Run the LLM polish pass on rendered template content.
517531
@@ -564,12 +578,34 @@ def _maybe_polish(
564578
template_type,
565579
)
566580

581+
# Phase 2 ground-truth context injection. The block carries
582+
# authoritative surface details (CLI --help, public API, dataclass
583+
# fields). Composed BEFORE the RAG block so the model reads the
584+
# ground truth first; the anchor clause is added to the system
585+
# prompt only when this block is actually present.
586+
ground_truth_text: str | None = None
587+
if matched_files and project_root is not None:
588+
from attune_author.ground_truth import build_context as build_ground_truth
589+
590+
absolute_sources = [project_root / rel_path for rel_path in matched_files]
591+
ground_truth_text = build_ground_truth(
592+
feature,
593+
absolute_sources,
594+
project_root=project_root,
595+
)
596+
597+
if ground_truth_text and augmented_context:
598+
augmented_context = ground_truth_text + "\n" + augmented_context
599+
elif ground_truth_text:
600+
augmented_context = ground_truth_text
601+
567602
return polish_template(
568603
content,
569604
feature.name,
570605
summary,
571606
template_type=template_type,
572607
augmented_context=augmented_context,
608+
include_ground_truth_anchor=ground_truth_text is not None,
573609
)
574610

575611

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
"""Ground-truth context injection for the polish pass.
2+
3+
Phase 2 of the polish-fact-check spec
4+
(``docs/specs/polish-fact-check``). Builds and injects authoritative
5+
surface details (CLI ``--help`` output, public API signatures,
6+
dataclass fields) into the polish prompt so the LLM has to anchor
7+
on real names instead of inventing them.
8+
9+
The fact-check pass (Phase 1) catches mistakes after the fact.
10+
This phase prevents them by changing what the model sees.
11+
"""
12+
13+
from __future__ import annotations
14+
15+
import logging
16+
from pathlib import Path
17+
18+
from attune_author.manifest import Feature
19+
20+
from .budget import enforce_budget
21+
from .cli_help import extract_cli_help
22+
from .config import GroundTruthConfig, load_config
23+
from .dataclass_refs import extract_dataclasses
24+
from .public_api import extract_public_api
25+
26+
logger = logging.getLogger(__name__)
27+
28+
29+
#: System-prompt clause appended (by callers that inject context)
30+
#: instructing the model to anchor on the ground-truth blocks. The
31+
#: text is intentionally short and concrete — the existing polish
32+
#: system prompts are already long, so we keep this addition tight.
33+
ANCHORING_CLAUSE = (
34+
"\n\nThe user message contains <cli_help>, <public_api>, and "
35+
"<dataclasses> blocks with ground-truth surface details for this "
36+
"feature. When you reference a CLI flag, public function, import "
37+
"path, or dataclass field, it MUST appear verbatim in one of those "
38+
"blocks. If you need to describe something not in the ground "
39+
"truth, describe the behavior without inventing a specific name."
40+
)
41+
42+
43+
def build_context(
44+
feature: Feature,
45+
source_paths: list[Path],
46+
*,
47+
project_root: Path,
48+
config: GroundTruthConfig | None = None,
49+
) -> str | None:
50+
"""Build a ground-truth context string for the polish prompt.
51+
52+
Args:
53+
feature: The feature being documented. ``feature.cli_command``
54+
drives the CLI ``--help`` block; absence skips that block.
55+
source_paths: Source ``.py`` files matched by ``feature.files``.
56+
Used for ``__all__``, public-API signatures, and dataclass
57+
extraction.
58+
project_root: Used to invoke the consumer's CLI for ``--help``.
59+
config: Optional explicit config; ``None`` loads from the
60+
project's ``pyproject.toml``.
61+
62+
Returns:
63+
A context string with sentinel-tagged blocks, ready to pass as
64+
``augmented_context=`` to :func:`attune_author.polish.polish_template`.
65+
Returns ``None`` if the feature is disabled or no source had any
66+
extractable surface.
67+
"""
68+
cfg = config if config is not None else load_config(project_root)
69+
if not cfg.enabled:
70+
return None
71+
72+
blocks: list[tuple[str, str]] = []
73+
74+
cli_help_text = ""
75+
if cfg.inject_cli_help and feature.cli_command:
76+
cli_help_text = extract_cli_help(
77+
cfg.cli_executable,
78+
feature.cli_command,
79+
project_root=project_root,
80+
)
81+
if cli_help_text:
82+
blocks.append(("cli_help", cli_help_text))
83+
84+
public_api_text = ""
85+
if cfg.inject_public_api:
86+
public_api_text = extract_public_api(source_paths)
87+
if public_api_text:
88+
blocks.append(("public_api", public_api_text))
89+
90+
dataclass_text = ""
91+
if cfg.inject_dataclasses:
92+
dataclass_text = extract_dataclasses(source_paths)
93+
if dataclass_text:
94+
blocks.append(("dataclasses", dataclass_text))
95+
96+
if not blocks:
97+
return None
98+
99+
blocks = enforce_budget(blocks, cfg.budget_bytes)
100+
101+
parts: list[str] = ["## Ground-truth context\n"]
102+
for tag, body in blocks:
103+
parts.append(f"<{tag}>\n{body.rstrip()}\n</{tag}>\n")
104+
return "\n".join(parts) + "\n"
105+
106+
107+
__all__ = [
108+
"ANCHORING_CLAUSE",
109+
"GroundTruthConfig",
110+
"build_context",
111+
"enforce_budget",
112+
"extract_cli_help",
113+
"extract_dataclasses",
114+
"extract_public_api",
115+
"load_config",
116+
]

0 commit comments

Comments
 (0)