Evaluation: build benchmark fixtures for Search, Read, Context, and learning regressions

## Problem

Token reduction work can regress quietly. A change that shrinks output may remove essential context; a ranking tweak may reduce average bytes while making exploration worse. Wash needs repeatable fixtures that measure both size and usefulness proxies.

## Goal

Add an evaluation harness that runs fixed tasks against fixture corpora and reports output size, call count, hit quality, and regression status.

## Fixture tasks

Cover at least:

- Find and read a target function.
- Explore a subsystem across several files.
- Make a multi-edit refactor.
- Diagnose a build error.
- Diagnose a failing test.
- Summarize git changes.
- Search a large codebase with many noisy matches.

## Metrics

- Tool calls per task.
- Result bytes per tool and total.
- Estimated tokens per tool and total.
- Whether expected files appear in top results.
- Whether expected line ranges appear.
- Whether caps were hit.
- Whether repeated same-arg calls occurred.

## Acceptance criteria

- Add a command or script that runs all benchmark tasks locally.
- CI can run a fast subset.
- Regression output clearly identifies which task changed.
- Fixture expectations are stored as data, not hard-coded in test logic.
- Evaluation can compare two runs and print deltas.

## Implementation notes

Reuse existing `fixtures/corpus` where possible. Extend rather than replacing `legacy-ts/scripts/burn-compare.js` ideas. Keep the fast subset small enough for PR checks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: build benchmark fixtures for Search, Read, Context, and learning regressions #43

Problem

Goal

Fixture tasks

Metrics

Acceptance criteria

Implementation notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation: build benchmark fixtures for Search, Read, Context, and learning regressions #43

Description

Problem

Goal

Fixture tasks

Metrics

Acceptance criteria

Implementation notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions