Add ClawBench to Evals & Verification by reacher-z · Pull Request #34 · ai-boost/awesome-harness-engineering

reacher-z · 2026-05-21T00:04:21Z

Adds ClawBench to Evals & Verification, right next to Claw-Eval (different project / overlapping name; complementary scope).

ClawBench is a live-website browser-agent benchmark with two-stage scoring:

Stage 1 — deterministic HTTP-request interception at the per-task URL/method schema
Stage 2 — LLM judge on the intercepted payload

It catches "right endpoint, wrong payload" errors that screenshot-judge benchmarks miss, and runs on real Uber Eats / Indeed / Craigslist / etc. (not Docker sandboxes).

arXiv: https://arxiv.org/abs/2604.08523
283 tasks (V1 153 + V2 130) across 163 live platforms · 15 life categories
Code: https://github.com/reacher-z/ClawBench · Live leaderboard: https://claw-bench.com

Sits next to Claw-Eval / SWE-bench / Inspect AI / tau-bench in the section. Different focus from Claw-Eval (we're live-web-specific with payload interception; they're general agent capability with Pass^3) — happy to add a disambiguation note in the entry if helpful.

Affiliation: I'm one of the maintainers.

Add ClawBench to Evals & Verification

be05c5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClawBench to Evals & Verification#34

Add ClawBench to Evals & Verification#34
reacher-z wants to merge 1 commit into
ai-boost:mainfrom
reacher-z:add-clawbench

reacher-z commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reacher-z commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant