Problem
Token reduction work can regress quietly. A change that shrinks output may remove essential context; a ranking tweak may reduce average bytes while making exploration worse. Wash needs repeatable fixtures that measure both size and usefulness proxies.
Goal
Add an evaluation harness that runs fixed tasks against fixture corpora and reports output size, call count, hit quality, and regression status.
Fixture tasks
Cover at least:
- Find and read a target function.
- Explore a subsystem across several files.
- Make a multi-edit refactor.
- Diagnose a build error.
- Diagnose a failing test.
- Summarize git changes.
- Search a large codebase with many noisy matches.
Metrics
- Tool calls per task.
- Result bytes per tool and total.
- Estimated tokens per tool and total.
- Whether expected files appear in top results.
- Whether expected line ranges appear.
- Whether caps were hit.
- Whether repeated same-arg calls occurred.
Acceptance criteria
- Add a command or script that runs all benchmark tasks locally.
- CI can run a fast subset.
- Regression output clearly identifies which task changed.
- Fixture expectations are stored as data, not hard-coded in test logic.
- Evaluation can compare two runs and print deltas.
Implementation notes
Reuse existing fixtures/corpus where possible. Extend rather than replacing legacy-ts/scripts/burn-compare.js ideas. Keep the fast subset small enough for PR checks.
Problem
Token reduction work can regress quietly. A change that shrinks output may remove essential context; a ranking tweak may reduce average bytes while making exploration worse. Wash needs repeatable fixtures that measure both size and usefulness proxies.
Goal
Add an evaluation harness that runs fixed tasks against fixture corpora and reports output size, call count, hit quality, and regression status.
Fixture tasks
Cover at least:
Metrics
Acceptance criteria
Implementation notes
Reuse existing
fixtures/corpuswhere possible. Extend rather than replacinglegacy-ts/scripts/burn-compare.jsideas. Keep the fast subset small enough for PR checks.