Skip to content

Eval: add batch runner and aggregate reporting #52

@olaservo

Description

@olaservo

Problem

There's no way to run all 7 eval tasks at once. Each must be run individually, and there's no aggregate pass/fail summary across tasks.

Currently available:

  • npm run eval:greeting (individual)
  • npm run eval:code-style (individual)
  • etc.

Missing:

  • npm run eval:all — run all tasks, produce a summary table
  • Aggregate result file combining all task outcomes

Suggestion

Add a batch runner that:

  1. Discovers all task configs from evals/tasks/*.json
  2. Runs each sequentially (or with configurable parallelism)
  3. Produces a summary table (task, mode, pass/fail, cost, duration)
  4. Saves an aggregate result JSON
  5. Exits with non-zero code if any task fails

Could be a new evals/eval-all.ts or a --all flag on the existing harness.

Files

  • evals/eval.ts
  • package.json (new script needed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions