Skip to content

Commit 05b75c0

Browse files
authored
docs: add harness engineering guides and reorganize docs structure (#589)
Separate user-facing guides from developer-facing docs: - Create docs/contributing/ for internal developer docs - Move adding-a-language.md from guides/ to contributing/ - Add docs/contributing/harness-engineering.md (internal practices) - Add docs/use-cases/harness-engineering.md (user-facing use case) - Update references in README, SUPPORT, CONTRIBUTING
1 parent 5fff684 commit 05b75c0

6 files changed

Lines changed: 635 additions & 3 deletions

File tree

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ recall will be questioned during review.
248248
Adding a new language is one of the most impactful contributions. We have a
249249
dedicated step-by-step guide:
250250

251-
**[Adding a New Language](docs/guides/adding-a-language.md)**
251+
**[Adding a New Language](docs/contributing/adding-a-language.md)**
252252

253253
This covers the full dual-engine workflow (WASM + native Rust), including every
254254
file to modify, code templates, and a verification checklist.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -791,7 +791,7 @@ npm install
791791
npm test
792792
```
793793

794-
Looking to add a new language? Check out **[Adding a New Language](docs/guides/adding-a-language.md)**.
794+
Looking to add a new language? Check out **[Adding a New Language](docs/contributing/adding-a-language.md)**.
795795

796796
## 📄 License
797797

SUPPORT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Here are the best ways to get help with codegraph:
99
- [README](README.md) — Quick start, commands, features, and configuration
1010
- [CONTRIBUTING.md](CONTRIBUTING.md) — Development setup and contribution guide
1111
- [Recommended Practices](docs/guides/recommended-practices.md) — Git hooks, CI/CD, AI agent integration
12-
- [Adding a New Language](docs/guides/adding-a-language.md) — Step-by-step language support guide
12+
- [Adding a New Language](docs/contributing/adding-a-language.md) — Step-by-step language support guide
1313

1414
### Questions & Discussions
1515

File renamed without changes.
Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Harness AI Engineering
2+
3+
A practical guide to building systems that prevent AI coding agents from repeating mistakes.
4+
5+
---
6+
7+
## What is Harness Engineering?
8+
9+
The term was coined by Mitchell Hashimoto (creator of Terraform and Ghostty). The core principle:
10+
11+
> Every time an agent makes a mistake, you invest time engineering a solution so the agent never makes that mistake again.
12+
13+
The formula: **Model + Harness = Agent**. The harness is the set of constraints, tools, documentation, and feedback loops that keep an agent productive. A mediocre model with a great harness outperforms a great model with no harness.
14+
15+
This is not a one-time setup — it's a discipline that grows with every failure.
16+
17+
---
18+
19+
## The 4-Layer Defense Model
20+
21+
Based on the INNOQ model, quality control stacks in layers — each catches what the previous one missed.
22+
23+
### Layer 1: Deterministic Guardrails
24+
25+
Automated checks that mechanically prevent bad code from landing.
26+
27+
**Pre-commit hooks** (fast, local):
28+
- Unit tests, integration tests
29+
- Architecture tests (dependency direction, cycle detection)
30+
- Linting and formatting
31+
- Blast radius thresholds
32+
33+
**CI pipeline** (thorough, remote):
34+
- End-to-end tests
35+
- Security scans
36+
- Static analysis
37+
- Change validation gates
38+
39+
Zero-tolerance enforcement: the agent cannot proceed until all checks pass. No warnings — only blocking failures that force self-correction.
40+
41+
```bash
42+
# Example: codegraph pre-commit gate
43+
codegraph build
44+
codegraph check --staged --no-new-cycles --max-blast-radius 50 -T
45+
```
46+
47+
### Layer 2: AI Review
48+
49+
A separate AI agent reviews the code independently. It examines requirement fulfillment, architecture compliance, and code smells that static analysis misses. This provides consistent, fast evaluation without human bottlenecks.
50+
51+
### Layer 3: Selective Human Review
52+
53+
Developers focus exclusively on core business logic and domain decisions. Standard patterns, boilerplate, and mapping code stay within the harness's scope. Shift from "read every line" to "targeted attention based on risk."
54+
55+
### Layer 4: Product Testing
56+
57+
Functional verification: does the software work as intended? Feature testing, behavior verification, UX validation. Preview environments deployed per merge request.
58+
59+
**Accountability test:** "Would you ship this if you were on call tonight?" If no, the harness needs strengthening.
60+
61+
---
62+
63+
## Practice 1: AGENTS.md as Table of Contents
64+
65+
Your `CLAUDE.md` / `AGENTS.md` is the highest-leverage harness component. It's injected into the system prompt — roughly one-third of the instructions the agent can follow with consistency.
66+
67+
**Rules:**
68+
- Keep it under ~100 lines. Every line should correspond to a specific observed failure.
69+
- Use it as a pointer to deeper docs, not an encyclopedia.
70+
- Never auto-generate it — LLM-generated instruction files increase cost ~20% with no accuracy improvement (ETH Zurich study). Human-written, failure-driven instructions are what work.
71+
72+
**Structure:**
73+
74+
```markdown
75+
# CLAUDE.md
76+
77+
## Build
78+
- Run full build: `npm run build`
79+
- Run tests: `npm test`
80+
- Run lint: `npm run lint`
81+
82+
## Architecture
83+
- Dependency direction: Types -> Config -> Repo -> Service -> Runtime -> UI
84+
- Never import from a layer to the right
85+
86+
## Coding rules
87+
- All logging must be structured (JSON)
88+
- Max file size: 500 lines
89+
90+
## When you finish a task
91+
- Run tests before committing
92+
- Write descriptive commit message
93+
- Update progress file
94+
```
95+
96+
Start small. Add rules only when the agent fails repeatedly on the same point. The Ghostty project's `AGENTS.md` is deliberately terse: build commands, test commands, directory structure, and one anti-pattern rule. Each line earns its place by preventing a specific observed failure.
97+
98+
---
99+
100+
## Practice 2: Remediation-Focused Linter Messages
101+
102+
OpenAI's key finding: custom linters with remediation-focused error messages are critical because **the error message becomes part of the agent's context when it fails**.
103+
104+
**Ineffective:**
105+
```
106+
Error: Invalid import
107+
```
108+
109+
**Effective:**
110+
```
111+
Error: Service layer cannot import from UI layer.
112+
Move this logic to a Provider or restructure the dependency.
113+
See docs/ARCHITECTURE.md#layers
114+
```
115+
116+
The remediation message teaches the agent how to fix the problem in-context, enabling self-correction without human intervention. Write linter messages as if they are instructions to an agent — because they are.
117+
118+
With codegraph, this is built-in:
119+
120+
```bash
121+
# codegraph check provides actionable output
122+
codegraph check --staged --no-new-cycles --max-blast-radius 50 -T
123+
# Output: "Cycle detected: A -> B -> C -> A. Break the cycle by..."
124+
# Output: "Blast radius 67 exceeds threshold 50. Function X affects..."
125+
```
126+
127+
---
128+
129+
## Practice 3: Silent Success, Loud Failure
130+
131+
Running full test suites (thousands of passing tests) floods the context window. The agent loses track of its task and starts hallucinating about test files it just read.
132+
133+
**Rule:** Configure scripts so stdout on success is minimal. Only surface errors.
134+
135+
```bash
136+
# Bad: 4,000 lines of passing tests flood context
137+
npm test
138+
139+
# Good: swallow passing output, surface only failures
140+
npm test > /dev/null 2>&1 || npm test
141+
```
142+
143+
With Claude Code hooks, this is the default pattern — hooks that exit 0 produce no output. Only non-zero exits surface messages to the agent.
144+
145+
---
146+
147+
## Practice 4: Mechanical Architecture Enforcement
148+
149+
Don't document "please follow this pattern" — enforce it mechanically. Agents replicate patterns that already exist in the repository, even suboptimal ones. Without mechanical enforcement, bad patterns compound exponentially.
150+
151+
**Dependency direction:**
152+
```
153+
Types -> Config -> Repo -> Service -> Runtime -> UI
154+
```
155+
156+
**Enforcement tools:**
157+
- `codegraph check --no-boundary-violations` — blocks imports that violate layer direction
158+
- `codegraph cycles` — detects circular dependencies
159+
- Custom ESLint rules or `dependency-cruiser` for additional constraints
160+
- CI gates that fail the build on violations
161+
162+
The agent literally cannot create an import that violates the direction. It doesn't need to "know" the rule — the harness enforces it.
163+
164+
---
165+
166+
## Practice 5: Sub-Agents as Context Firewalls
167+
168+
Sub-agents encapsulate discrete tasks in isolated context windows. The parent agent only sees the prompt sent and the final result — no intermediate tool calls, file reads, or search results pollute the parent's context.
169+
170+
**Good uses for sub-agents:**
171+
- Research and code exploration
172+
- Implementation of isolated features
173+
- Code review
174+
- Test generation
175+
176+
**Cost optimization:** Use expensive models (Opus) for orchestration, cheaper models (Sonnet/Haiku) for sub-agents. Return format should be highly condensed with `filepath:line` citations.
177+
178+
**Anti-pattern:** Role-based agents ("frontend engineer" vs "backend engineer") don't work well. Task-based agents work.
179+
180+
---
181+
182+
## Practice 6: Progress Files for Long-Running Tasks
183+
184+
Anthropic documented this pattern for agents that work across many sessions. The core challenge: each new context window starts with no memory.
185+
186+
**Two-agent architecture:**
187+
188+
1. **Initializer agent** (runs once):
189+
- Creates `init.sh` (one-command environment setup)
190+
- Creates `progress.txt` (work history log)
191+
- Creates `features.json` (comprehensive feature breakdown with pass/fail status)
192+
- Makes initial commit documenting everything
193+
194+
2. **Coding agent** (every subsequent session):
195+
- Read git logs and progress files for context
196+
- Select single highest-priority incomplete feature
197+
- Implement incrementally
198+
- Run end-to-end verification
199+
- Commit and update progress documentation
200+
201+
**Key details:**
202+
- Use JSON for feature tracking (not markdown) — agents are less likely to overwrite structured data
203+
- Track failed approaches and why they didn't work — prevents repeating dead ends
204+
- One feature per session — scope creep across features degrades quality
205+
206+
---
207+
208+
## Practice 7: End-to-End Verification
209+
210+
Agents tend to mark features complete without adequate testing. Without explicit prompting, they use unit tests or curl commands but fail to verify end-to-end functionality.
211+
212+
**Solution:** Give the agent tools for end-to-end verification:
213+
- Browser automation (Puppeteer MCP) for UI testing
214+
- `codegraph diff-impact --staged` for structural impact verification
215+
- Integration test suites that exercise real code paths
216+
217+
The agent must verify features work as a user would experience them, not just that the code compiles.
218+
219+
---
220+
221+
## Practice 8: Wrapper CLIs Over MCP Servers
222+
223+
MCP tool descriptions consume thousands of tokens from the system prompt. For simple integrations, a wrapper CLI with 5-6 usage examples in your AGENTS.md is cheaper and often more effective.
224+
225+
```markdown
226+
## Issue tracking
227+
Use `./scripts/issues.sh` to manage issues:
228+
- `./scripts/issues.sh list --status open` — list open issues
229+
- `./scripts/issues.sh get PROJ-123` — get issue details
230+
- `./scripts/issues.sh update PROJ-123 --status done` — close an issue
231+
```
232+
233+
Reserve MCP for tools that benefit from structured schema and dynamic discovery (like codegraph's 30+ tools). Use wrapper CLIs for simple CRUD operations.
234+
235+
---
236+
237+
## Practice 9: Continuous Garbage Collection
238+
239+
Instead of periodic cleanup sprints, encode golden principles as lint rules and run background agent tasks on cadence to auto-generate targeted refactoring PRs.
240+
241+
Human taste is captured once in the rule, then enforced continuously:
242+
243+
```bash
244+
# Scheduled: find code that violates current standards
245+
codegraph roles --role dead -T # Find dead code
246+
codegraph triage -T # Risk-ranked priority queue
247+
codegraph check -T # Health gate violations
248+
```
249+
250+
The engineering discipline shifts from code quality to **scaffolding quality** — the tooling, documentation, feedback loops, and architectural constraints that maintain coherence during autonomous code generation.
251+
252+
---
253+
254+
## Applying This to Codegraph Projects
255+
256+
Codegraph already implements most of these practices. Here's how they map:
257+
258+
| Harness Practice | Codegraph Implementation |
259+
|---|---|
260+
| Deterministic guardrails | `codegraph check` pre-commit gates, cycle detection, blast radius thresholds |
261+
| Remediation-focused errors | `codegraph check` output includes what violated and where |
262+
| Mechanical architecture | `codegraph check --no-boundary-violations`, `codegraph cycles` |
263+
| Silent success / loud failure | Claude Code hooks exit silently on success |
264+
| AGENTS.md | `CLAUDE.md` with codegraph workflow commands |
265+
| Progress tracking | Titan Paradigm skills with state files |
266+
| Sub-agent context isolation | Claude Code sub-agents with `/worktree` isolation |
267+
| End-to-end verification | `codegraph diff-impact --staged` structural verification |
268+
| Continuous garbage collection | `codegraph triage`, `codegraph roles --role dead` |
269+
270+
### Quick Start
271+
272+
To add harness engineering to an existing codegraph project:
273+
274+
1. **Create `CLAUDE.md`** with build commands and your top 5 failure-driven rules
275+
2. **Add pre-commit hooks** using codegraph check:
276+
```bash
277+
codegraph check --staged --no-new-cycles --max-blast-radius 50 -T
278+
```
279+
3. **Configure CI gates** with `codegraph check -T` in your pipeline
280+
4. **Set up Claude Code hooks** — see [Claude Code Hooks Guide](../examples/claude-code-hooks/README.md) for ready-to-use scripts
281+
5. **Add boundary rules** in `.codegraphrc.json` to enforce your architecture mechanically
282+
6. **Iterate:** every time the agent makes a mistake, add a rule or a check. The harness grows with every failure.
283+
284+
---
285+
286+
## Sources
287+
288+
- [Mitchell Hashimoto — My AI Adoption Journey](https://mitchellh.com/writing/my-ai-adoption-journey)
289+
- [Ghostty AGENTS.md](https://github.com/ghostty-org/ghostty/blob/main/AGENTS.md)
290+
- [Anthropic — Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
291+
- [OpenAI — Harness Engineering](https://openai.com/index/harness-engineering/)
292+
- [INNOQ — From Vibe Coder to Code Owner](https://www.innoq.com/en/blog/2026/02/from-vibe-coder-to-code-owner/)
293+
- [HumanLayer — Skill Issue: Harness Engineering for Coding Agents](https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents)
294+
- [Martin Fowler — Harness Engineering](https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html)

0 commit comments

Comments
 (0)