Skip to content

Add ClawBench to Evals & Verification#34

Open
reacher-z wants to merge 1 commit into
ai-boost:mainfrom
reacher-z:add-clawbench
Open

Add ClawBench to Evals & Verification#34
reacher-z wants to merge 1 commit into
ai-boost:mainfrom
reacher-z:add-clawbench

Conversation

@reacher-z
Copy link
Copy Markdown

Adds ClawBench to Evals & Verification, right next to Claw-Eval (different project / overlapping name; complementary scope).

ClawBench is a live-website browser-agent benchmark with two-stage scoring:

  • Stage 1 — deterministic HTTP-request interception at the per-task URL/method schema
  • Stage 2 — LLM judge on the intercepted payload

It catches "right endpoint, wrong payload" errors that screenshot-judge benchmarks miss, and runs on real Uber Eats / Indeed / Craigslist / etc. (not Docker sandboxes).

Sits next to Claw-Eval / SWE-bench / Inspect AI / tau-bench in the section. Different focus from Claw-Eval (we're live-web-specific with payload interception; they're general agent capability with Pass^3) — happy to add a disambiguation note in the entry if helpful.

Affiliation: I'm one of the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant