Skip to content

Latest commit

 

History

History
154 lines (97 loc) · 4.01 KB

File metadata and controls

154 lines (97 loc) · 4.01 KB

Landing Copy

Hero

Title

Harness Engineering for Coding Agents

One-line verdict

Your model is not the bottleneck. Your harness is.

Side code block

if your agent only works when AGENTS.md grows:
  you do not have alignment
  you have prompt debt

fix:
  smaller root router
  harder verification gates
  repo-owned evals

Subhead

Stop shipping giant AGENTS files and calling it alignment. This Codex skill turns repo chaos into an enforceable operating model: short root router, hard proof, migration governance, and review-loop measurement grounded in real repo history.

CTA labels

  • Read the skill
  • See the benchmark
  • Read the release page

Proof Strip

  • OpenClaw benchmark: 92.0 review-loop vs 47.3 single-pass
  • Root router reduced from 229 lines to 66
  • Migration verification now runs cross-platform and remote D1 confirms 0091 / 0135 are historical registry exceptions
  • Optional CI guard can verify remote registry state when Cloudflare secrets are present

Five Breakdown Modes

1. Context Debt

The root context stops routing and starts narrating everything.

2. Verification Theater

The diff looks good, but nothing meaningful actually ran.

3. Review Collapse

Humans stop reviewing code and start reviewing confidence.

4. Migration Drift

SQL files, remote registry, docs, and prod truth stop matching.

5. Benchmark Cosplay

Teams compare models on public toy tasks while their own repo keeps failing locally.

Seven Mandatory Checks

  1. Root context acts as a router
  2. Real commands are explicit
  3. One truth source is named
  4. Legacy paths are classified
  5. Proof actually runs
  6. A reusable guardrail is encoded
  7. The next run inherits less ambiguity

Scoreboard Block

Section title

Measured on a real repo, not benchmark theater

Metrics

  • 92.0 historical-review-loop
  • 47.3 historical-single-pass
  • 229 -> 66 root router shrink
  • 0 critical migration audit after documented exceptions and gaps

Short Body Copy

Most teams still respond to agent failure with more prompt.

That is not a scaling strategy.

The durable pattern is:

  • smaller root
  • harder proof
  • narrower autonomy
  • repo-specific evals

This skill packages that pattern into something a repo can actually enforce.

GitHub Description

Harness engineering for coding agents: smaller root routers, hard verification gates, migration governance, and repo-specific evals that make review-loop discipline measurable.

Social Preview Text

If your agent only works when AGENTS.md keeps getting larger, you do not have alignment. You have prompt debt. This skill turns that debt into machinery: smaller root context, explicit repo maps, hard proof, migration governance, and benchmark runs grounded in your own git history.

Launch Post

Most teams do not have a model problem.

They have a harness problem.

When quality drops, they add more words to AGENTS.md, ship larger diffs, and hope review catches it.

We turned the opposite pattern into a Codex skill:

  • root router instead of prompt sprawl
  • hard proof instead of diff vibes
  • repo-specific evals instead of benchmark theater
  • migration governance instead of production folklore

On OpenClaw's first historical benchmark:

  • review-loop governance: 92.0
  • single-pass shipping: 47.3

That gap is the product.

HN Titles

  • Show HN: Your model is not the bottleneck. Your harness is.
  • Show HN: Harness Engineering for Coding Agents
  • Show HN: We benchmarked review-loop governance vs single-pass agent shipping on a real repo

Reddit Titles

  • Your model is not the bottleneck. Your harness is. I turned that into a Codex skill.
  • I built a Codex skill for harness engineering and benchmarked it on real repo history.
  • Short root router plus repo-specific evals beat giant AGENTS files on our coding-agent benchmark.

One-line Taglines

  • Your model is not the bottleneck. Your harness is.
  • Smaller root. Harder proof. Better agents.
  • Stop scaling prompt debt.
  • Measure the harness, not just the model.