Harness Engineering for Coding Agents
Your model is not the bottleneck. Your harness is.
if your agent only works when AGENTS.md grows:
you do not have alignment
you have prompt debt
fix:
smaller root router
harder verification gates
repo-owned evalsStop shipping giant AGENTS files and calling it alignment. This Codex skill turns repo chaos into an enforceable operating model: short root router, hard proof, migration governance, and review-loop measurement grounded in real repo history.
- Read the skill
- See the benchmark
- Read the release page
- OpenClaw benchmark:
92.0review-loop vs47.3single-pass - Root router reduced from
229lines to66 - Migration verification now runs cross-platform and remote D1 confirms
0091/0135are historical registry exceptions - Optional CI guard can verify remote registry state when Cloudflare secrets are present
The root context stops routing and starts narrating everything.
The diff looks good, but nothing meaningful actually ran.
Humans stop reviewing code and start reviewing confidence.
SQL files, remote registry, docs, and prod truth stop matching.
Teams compare models on public toy tasks while their own repo keeps failing locally.
- Root context acts as a router
- Real commands are explicit
- One truth source is named
- Legacy paths are classified
- Proof actually runs
- A reusable guardrail is encoded
- The next run inherits less ambiguity
Measured on a real repo, not benchmark theater
92.0historical-review-loop47.3historical-single-pass229 -> 66root router shrink0 criticalmigration audit after documented exceptions and gaps
Most teams still respond to agent failure with more prompt.
That is not a scaling strategy.
The durable pattern is:
- smaller root
- harder proof
- narrower autonomy
- repo-specific evals
This skill packages that pattern into something a repo can actually enforce.
Harness engineering for coding agents: smaller root routers, hard verification gates, migration governance, and repo-specific evals that make review-loop discipline measurable.
If your agent only works when AGENTS.md keeps getting larger, you do not have alignment. You have prompt debt. This skill turns that debt into machinery: smaller root context, explicit repo maps, hard proof, migration governance, and benchmark runs grounded in your own git history.
Most teams do not have a model problem.
They have a harness problem.
When quality drops, they add more words to AGENTS.md, ship larger diffs, and hope review catches it.
We turned the opposite pattern into a Codex skill:
- root router instead of prompt sprawl
- hard proof instead of diff vibes
- repo-specific evals instead of benchmark theater
- migration governance instead of production folklore
On OpenClaw's first historical benchmark:
- review-loop governance:
92.0 - single-pass shipping:
47.3
That gap is the product.
- Show HN: Your model is not the bottleneck. Your harness is.
- Show HN: Harness Engineering for Coding Agents
- Show HN: We benchmarked review-loop governance vs single-pass agent shipping on a real repo
- Your model is not the bottleneck. Your harness is. I turned that into a Codex skill.
- I built a Codex skill for harness engineering and benchmarked it on real repo history.
- Short root router plus repo-specific evals beat giant AGENTS files on our coding-agent benchmark.
- Your model is not the bottleneck. Your harness is.
- Smaller root. Harder proof. Better agents.
- Stop scaling prompt debt.
- Measure the harness, not just the model.