|
| 1 | +# Harness AI Engineering |
| 2 | + |
| 3 | +A practical guide to building systems that prevent AI coding agents from repeating mistakes. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## What is Harness Engineering? |
| 8 | + |
| 9 | +The term was coined by Mitchell Hashimoto (creator of Terraform and Ghostty). The core principle: |
| 10 | + |
| 11 | +> Every time an agent makes a mistake, you invest time engineering a solution so the agent never makes that mistake again. |
| 12 | +
|
| 13 | +The formula: **Model + Harness = Agent**. The harness is the set of constraints, tools, documentation, and feedback loops that keep an agent productive. A mediocre model with a great harness outperforms a great model with no harness. |
| 14 | + |
| 15 | +This is not a one-time setup — it's a discipline that grows with every failure. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## The 4-Layer Defense Model |
| 20 | + |
| 21 | +Based on the INNOQ model, quality control stacks in layers — each catches what the previous one missed. |
| 22 | + |
| 23 | +### Layer 1: Deterministic Guardrails |
| 24 | + |
| 25 | +Automated checks that mechanically prevent bad code from landing. |
| 26 | + |
| 27 | +**Pre-commit hooks** (fast, local): |
| 28 | +- Unit tests, integration tests |
| 29 | +- Architecture tests (dependency direction, cycle detection) |
| 30 | +- Linting and formatting |
| 31 | +- Blast radius thresholds |
| 32 | + |
| 33 | +**CI pipeline** (thorough, remote): |
| 34 | +- End-to-end tests |
| 35 | +- Security scans |
| 36 | +- Static analysis |
| 37 | +- Change validation gates |
| 38 | + |
| 39 | +Zero-tolerance enforcement: the agent cannot proceed until all checks pass. No warnings — only blocking failures that force self-correction. |
| 40 | + |
| 41 | +```bash |
| 42 | +# Example: codegraph pre-commit gate |
| 43 | +codegraph build |
| 44 | +codegraph check --staged --no-new-cycles --max-blast-radius 50 -T |
| 45 | +``` |
| 46 | + |
| 47 | +### Layer 2: AI Review |
| 48 | + |
| 49 | +A separate AI agent reviews the code independently. It examines requirement fulfillment, architecture compliance, and code smells that static analysis misses. This provides consistent, fast evaluation without human bottlenecks. |
| 50 | + |
| 51 | +### Layer 3: Selective Human Review |
| 52 | + |
| 53 | +Developers focus exclusively on core business logic and domain decisions. Standard patterns, boilerplate, and mapping code stay within the harness's scope. Shift from "read every line" to "targeted attention based on risk." |
| 54 | + |
| 55 | +### Layer 4: Product Testing |
| 56 | + |
| 57 | +Functional verification: does the software work as intended? Feature testing, behavior verification, UX validation. Preview environments deployed per merge request. |
| 58 | + |
| 59 | +**Accountability test:** "Would you ship this if you were on call tonight?" If no, the harness needs strengthening. |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Practice 1: AGENTS.md as Table of Contents |
| 64 | + |
| 65 | +Your `CLAUDE.md` / `AGENTS.md` is the highest-leverage harness component. It's injected into the system prompt — roughly one-third of the instructions the agent can follow with consistency. |
| 66 | + |
| 67 | +**Rules:** |
| 68 | +- Keep it under ~100 lines. Every line should correspond to a specific observed failure. |
| 69 | +- Use it as a pointer to deeper docs, not an encyclopedia. |
| 70 | +- Never auto-generate it — LLM-generated instruction files increase cost ~20% with no accuracy improvement (ETH Zurich study). Human-written, failure-driven instructions are what work. |
| 71 | + |
| 72 | +**Structure:** |
| 73 | + |
| 74 | +```markdown |
| 75 | +# CLAUDE.md |
| 76 | + |
| 77 | +## Build |
| 78 | +- Run full build: `npm run build` |
| 79 | +- Run tests: `npm test` |
| 80 | +- Run lint: `npm run lint` |
| 81 | + |
| 82 | +## Architecture |
| 83 | +- Dependency direction: Types -> Config -> Repo -> Service -> Runtime -> UI |
| 84 | +- Never import from a layer to the right |
| 85 | + |
| 86 | +## Coding rules |
| 87 | +- All logging must be structured (JSON) |
| 88 | +- Max file size: 500 lines |
| 89 | + |
| 90 | +## When you finish a task |
| 91 | +- Run tests before committing |
| 92 | +- Write descriptive commit message |
| 93 | +- Update progress file |
| 94 | +``` |
| 95 | + |
| 96 | +Start small. Add rules only when the agent fails repeatedly on the same point. The Ghostty project's `AGENTS.md` is deliberately terse: build commands, test commands, directory structure, and one anti-pattern rule. Each line earns its place by preventing a specific observed failure. |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +## Practice 2: Remediation-Focused Linter Messages |
| 101 | + |
| 102 | +OpenAI's key finding: custom linters with remediation-focused error messages are critical because **the error message becomes part of the agent's context when it fails**. |
| 103 | + |
| 104 | +**Ineffective:** |
| 105 | +``` |
| 106 | +Error: Invalid import |
| 107 | +``` |
| 108 | + |
| 109 | +**Effective:** |
| 110 | +``` |
| 111 | +Error: Service layer cannot import from UI layer. |
| 112 | +Move this logic to a Provider or restructure the dependency. |
| 113 | +See docs/ARCHITECTURE.md#layers |
| 114 | +``` |
| 115 | + |
| 116 | +The remediation message teaches the agent how to fix the problem in-context, enabling self-correction without human intervention. Write linter messages as if they are instructions to an agent — because they are. |
| 117 | + |
| 118 | +With codegraph, this is built-in: |
| 119 | + |
| 120 | +```bash |
| 121 | +# codegraph check provides actionable output |
| 122 | +codegraph check --staged --no-new-cycles --max-blast-radius 50 -T |
| 123 | +# Output: "Cycle detected: A -> B -> C -> A. Break the cycle by..." |
| 124 | +# Output: "Blast radius 67 exceeds threshold 50. Function X affects..." |
| 125 | +``` |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## Practice 3: Silent Success, Loud Failure |
| 130 | + |
| 131 | +Running full test suites (thousands of passing tests) floods the context window. The agent loses track of its task and starts hallucinating about test files it just read. |
| 132 | + |
| 133 | +**Rule:** Configure scripts so stdout on success is minimal. Only surface errors. |
| 134 | + |
| 135 | +```bash |
| 136 | +# Bad: 4,000 lines of passing tests flood context |
| 137 | +npm test |
| 138 | + |
| 139 | +# Good: swallow passing output, surface only failures |
| 140 | +npm test > /dev/null 2>&1 || npm test |
| 141 | +``` |
| 142 | + |
| 143 | +With Claude Code hooks, this is the default pattern — hooks that exit 0 produce no output. Only non-zero exits surface messages to the agent. |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Practice 4: Mechanical Architecture Enforcement |
| 148 | + |
| 149 | +Don't document "please follow this pattern" — enforce it mechanically. Agents replicate patterns that already exist in the repository, even suboptimal ones. Without mechanical enforcement, bad patterns compound exponentially. |
| 150 | + |
| 151 | +**Dependency direction:** |
| 152 | +``` |
| 153 | +Types -> Config -> Repo -> Service -> Runtime -> UI |
| 154 | +``` |
| 155 | + |
| 156 | +**Enforcement tools:** |
| 157 | +- `codegraph check --no-boundary-violations` — blocks imports that violate layer direction |
| 158 | +- `codegraph cycles` — detects circular dependencies |
| 159 | +- Custom ESLint rules or `dependency-cruiser` for additional constraints |
| 160 | +- CI gates that fail the build on violations |
| 161 | + |
| 162 | +The agent literally cannot create an import that violates the direction. It doesn't need to "know" the rule — the harness enforces it. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +## Practice 5: Sub-Agents as Context Firewalls |
| 167 | + |
| 168 | +Sub-agents encapsulate discrete tasks in isolated context windows. The parent agent only sees the prompt sent and the final result — no intermediate tool calls, file reads, or search results pollute the parent's context. |
| 169 | + |
| 170 | +**Good uses for sub-agents:** |
| 171 | +- Research and code exploration |
| 172 | +- Implementation of isolated features |
| 173 | +- Code review |
| 174 | +- Test generation |
| 175 | + |
| 176 | +**Cost optimization:** Use expensive models (Opus) for orchestration, cheaper models (Sonnet/Haiku) for sub-agents. Return format should be highly condensed with `filepath:line` citations. |
| 177 | + |
| 178 | +**Anti-pattern:** Role-based agents ("frontend engineer" vs "backend engineer") don't work well. Task-based agents work. |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +## Practice 6: Progress Files for Long-Running Tasks |
| 183 | + |
| 184 | +Anthropic documented this pattern for agents that work across many sessions. The core challenge: each new context window starts with no memory. |
| 185 | + |
| 186 | +**Two-agent architecture:** |
| 187 | + |
| 188 | +1. **Initializer agent** (runs once): |
| 189 | + - Creates `init.sh` (one-command environment setup) |
| 190 | + - Creates `progress.txt` (work history log) |
| 191 | + - Creates `features.json` (comprehensive feature breakdown with pass/fail status) |
| 192 | + - Makes initial commit documenting everything |
| 193 | + |
| 194 | +2. **Coding agent** (every subsequent session): |
| 195 | + - Read git logs and progress files for context |
| 196 | + - Select single highest-priority incomplete feature |
| 197 | + - Implement incrementally |
| 198 | + - Run end-to-end verification |
| 199 | + - Commit and update progress documentation |
| 200 | + |
| 201 | +**Key details:** |
| 202 | +- Use JSON for feature tracking (not markdown) — agents are less likely to overwrite structured data |
| 203 | +- Track failed approaches and why they didn't work — prevents repeating dead ends |
| 204 | +- One feature per session — scope creep across features degrades quality |
| 205 | + |
| 206 | +--- |
| 207 | + |
| 208 | +## Practice 7: End-to-End Verification |
| 209 | + |
| 210 | +Agents tend to mark features complete without adequate testing. Without explicit prompting, they use unit tests or curl commands but fail to verify end-to-end functionality. |
| 211 | + |
| 212 | +**Solution:** Give the agent tools for end-to-end verification: |
| 213 | +- Browser automation (Puppeteer MCP) for UI testing |
| 214 | +- `codegraph diff-impact --staged` for structural impact verification |
| 215 | +- Integration test suites that exercise real code paths |
| 216 | + |
| 217 | +The agent must verify features work as a user would experience them, not just that the code compiles. |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +## Practice 8: Wrapper CLIs Over MCP Servers |
| 222 | + |
| 223 | +MCP tool descriptions consume thousands of tokens from the system prompt. For simple integrations, a wrapper CLI with 5-6 usage examples in your AGENTS.md is cheaper and often more effective. |
| 224 | + |
| 225 | +```markdown |
| 226 | +## Issue tracking |
| 227 | +Use `./scripts/issues.sh` to manage issues: |
| 228 | +- `./scripts/issues.sh list --status open` — list open issues |
| 229 | +- `./scripts/issues.sh get PROJ-123` — get issue details |
| 230 | +- `./scripts/issues.sh update PROJ-123 --status done` — close an issue |
| 231 | +``` |
| 232 | + |
| 233 | +Reserve MCP for tools that benefit from structured schema and dynamic discovery (like codegraph's 30+ tools). Use wrapper CLIs for simple CRUD operations. |
| 234 | + |
| 235 | +--- |
| 236 | + |
| 237 | +## Practice 9: Continuous Garbage Collection |
| 238 | + |
| 239 | +Instead of periodic cleanup sprints, encode golden principles as lint rules and run background agent tasks on cadence to auto-generate targeted refactoring PRs. |
| 240 | + |
| 241 | +Human taste is captured once in the rule, then enforced continuously: |
| 242 | + |
| 243 | +```bash |
| 244 | +# Scheduled: find code that violates current standards |
| 245 | +codegraph roles --role dead -T # Find dead code |
| 246 | +codegraph triage -T # Risk-ranked priority queue |
| 247 | +codegraph check -T # Health gate violations |
| 248 | +``` |
| 249 | + |
| 250 | +The engineering discipline shifts from code quality to **scaffolding quality** — the tooling, documentation, feedback loops, and architectural constraints that maintain coherence during autonomous code generation. |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +## Applying This to Codegraph Projects |
| 255 | + |
| 256 | +Codegraph already implements most of these practices. Here's how they map: |
| 257 | + |
| 258 | +| Harness Practice | Codegraph Implementation | |
| 259 | +|---|---| |
| 260 | +| Deterministic guardrails | `codegraph check` pre-commit gates, cycle detection, blast radius thresholds | |
| 261 | +| Remediation-focused errors | `codegraph check` output includes what violated and where | |
| 262 | +| Mechanical architecture | `codegraph check --no-boundary-violations`, `codegraph cycles` | |
| 263 | +| Silent success / loud failure | Claude Code hooks exit silently on success | |
| 264 | +| AGENTS.md | `CLAUDE.md` with codegraph workflow commands | |
| 265 | +| Progress tracking | Titan Paradigm skills with state files | |
| 266 | +| Sub-agent context isolation | Claude Code sub-agents with `/worktree` isolation | |
| 267 | +| End-to-end verification | `codegraph diff-impact --staged` structural verification | |
| 268 | +| Continuous garbage collection | `codegraph triage`, `codegraph roles --role dead` | |
| 269 | + |
| 270 | +### Quick Start |
| 271 | + |
| 272 | +To add harness engineering to an existing codegraph project: |
| 273 | + |
| 274 | +1. **Create `CLAUDE.md`** with build commands and your top 5 failure-driven rules |
| 275 | +2. **Add pre-commit hooks** using codegraph check: |
| 276 | + ```bash |
| 277 | + codegraph check --staged --no-new-cycles --max-blast-radius 50 -T |
| 278 | + ``` |
| 279 | +3. **Configure CI gates** with `codegraph check -T` in your pipeline |
| 280 | +4. **Set up Claude Code hooks** — see [Claude Code Hooks Guide](../examples/claude-code-hooks/README.md) for ready-to-use scripts |
| 281 | +5. **Add boundary rules** in `.codegraphrc.json` to enforce your architecture mechanically |
| 282 | +6. **Iterate:** every time the agent makes a mistake, add a rule or a check. The harness grows with every failure. |
| 283 | + |
| 284 | +--- |
| 285 | + |
| 286 | +## Sources |
| 287 | + |
| 288 | +- [Mitchell Hashimoto — My AI Adoption Journey](https://mitchellh.com/writing/my-ai-adoption-journey) |
| 289 | +- [Ghostty AGENTS.md](https://github.com/ghostty-org/ghostty/blob/main/AGENTS.md) |
| 290 | +- [Anthropic — Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) |
| 291 | +- [OpenAI — Harness Engineering](https://openai.com/index/harness-engineering/) |
| 292 | +- [INNOQ — From Vibe Coder to Code Owner](https://www.innoq.com/en/blog/2026/02/from-vibe-coder-to-code-owner/) |
| 293 | +- [HumanLayer — Skill Issue: Harness Engineering for Coding Agents](https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents) |
| 294 | +- [Martin Fowler — Harness Engineering](https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html) |
0 commit comments