Tool / Requirement Examples and evaluation loop

Part of #1020 

### The problem

Individual pieces (sandbox, Python tool, requirements, Bash tool) can each look healthy in isolation and still fail to compose into something a user would want to use. We need example tasks that exercise the full stack, a benchmark that produces credible numbers on repair success, and a release-ready narrative.

### Acceptable implementation

- Example tasks runnable as tests (per Mellea convention), at minimum:
  - Generate and save a matplotlib plot.
  - Load a CSV, compute a summary, save a chart.
  - Inspect a repo and run a targeted test command.
  - Modify a file and show a git diff.
- A benchmark suite that verifies sandbox isolation and measures repair success rates against Granite 4.1. The numbers should both guide the requirements work (which failures are we not repairing?) and hold up in a release post.
- A release one-pager covering what's newly possible, what the UX looks like, and what's explicitly out of scope — approved by the stakeholders who need to approve it before launch.

### Design questions worth resolving

- Is the benchmark harness a new thing, or an extension of `m eval`? Lean on what exists unless there's a reason not to.
- "Repaired on retry", "worked first try", and "refused safely" are all wins; the benchmark framing should distinguish them.
- A running failure matrix exists in the requirements work; this track should consume it, not duplicate it.

Browser automation, computer use, and API-to-tool generation are out of scope.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tool / Requirement Examples and evaluation loop #1025

The problem

Acceptable implementation

Design questions worth resolving

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Tool / Requirement Examples and evaluation loop #1025

Description

The problem

Acceptable implementation

Design questions worth resolving

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions