Part of #1020
The problem
Individual pieces (sandbox, Python tool, requirements, Bash tool) can each look healthy in isolation and still fail to compose into something a user would want to use. We need example tasks that exercise the full stack, a benchmark that produces credible numbers on repair success, and a release-ready narrative.
Acceptable implementation
- Example tasks runnable as tests (per Mellea convention), at minimum:
- Generate and save a matplotlib plot.
- Load a CSV, compute a summary, save a chart.
- Inspect a repo and run a targeted test command.
- Modify a file and show a git diff.
- A benchmark suite that verifies sandbox isolation and measures repair success rates against Granite 4.1. The numbers should both guide the requirements work (which failures are we not repairing?) and hold up in a release post.
- A release one-pager covering what's newly possible, what the UX looks like, and what's explicitly out of scope — approved by the stakeholders who need to approve it before launch.
Design questions worth resolving
- Is the benchmark harness a new thing, or an extension of
m eval? Lean on what exists unless there's a reason not to.
- "Repaired on retry", "worked first try", and "refused safely" are all wins; the benchmark framing should distinguish them.
- A running failure matrix exists in the requirements work; this track should consume it, not duplicate it.
Browser automation, computer use, and API-to-tool generation are out of scope.
Part of #1020
The problem
Individual pieces (sandbox, Python tool, requirements, Bash tool) can each look healthy in isolation and still fail to compose into something a user would want to use. We need example tasks that exercise the full stack, a benchmark that produces credible numbers on repair success, and a release-ready narrative.
Acceptable implementation
Design questions worth resolving
m eval? Lean on what exists unless there's a reason not to.Browser automation, computer use, and API-to-tool generation are out of scope.