Skip to content

Evaluation: build benchmark fixtures for Search, Read, Context, and learning regressions #43

@willwashburn

Description

@willwashburn

Problem

Token reduction work can regress quietly. A change that shrinks output may remove essential context; a ranking tweak may reduce average bytes while making exploration worse. Wash needs repeatable fixtures that measure both size and usefulness proxies.

Goal

Add an evaluation harness that runs fixed tasks against fixture corpora and reports output size, call count, hit quality, and regression status.

Fixture tasks

Cover at least:

  • Find and read a target function.
  • Explore a subsystem across several files.
  • Make a multi-edit refactor.
  • Diagnose a build error.
  • Diagnose a failing test.
  • Summarize git changes.
  • Search a large codebase with many noisy matches.

Metrics

  • Tool calls per task.
  • Result bytes per tool and total.
  • Estimated tokens per tool and total.
  • Whether expected files appear in top results.
  • Whether expected line ranges appear.
  • Whether caps were hit.
  • Whether repeated same-arg calls occurred.

Acceptance criteria

  • Add a command or script that runs all benchmark tasks locally.
  • CI can run a fast subset.
  • Regression output clearly identifies which task changed.
  • Fixture expectations are stored as data, not hard-coded in test logic.
  • Evaluation can compare two runs and print deltas.

Implementation notes

Reuse existing fixtures/corpus where possible. Extend rather than replacing legacy-ts/scripts/burn-compare.js ideas. Keep the fast subset small enough for PR checks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions