Problem
There's no way to run all 7 eval tasks at once. Each must be run individually, and there's no aggregate pass/fail summary across tasks.
Currently available:
npm run eval:greeting (individual)
npm run eval:code-style (individual)
- etc.
Missing:
npm run eval:all — run all tasks, produce a summary table
- Aggregate result file combining all task outcomes
Suggestion
Add a batch runner that:
- Discovers all task configs from
evals/tasks/*.json
- Runs each sequentially (or with configurable parallelism)
- Produces a summary table (task, mode, pass/fail, cost, duration)
- Saves an aggregate result JSON
- Exits with non-zero code if any task fails
Could be a new evals/eval-all.ts or a --all flag on the existing harness.
Files
evals/eval.ts
package.json (new script needed)
Problem
There's no way to run all 7 eval tasks at once. Each must be run individually, and there's no aggregate pass/fail summary across tasks.
Currently available:
npm run eval:greeting(individual)npm run eval:code-style(individual)Missing:
npm run eval:all— run all tasks, produce a summary tableSuggestion
Add a batch runner that:
evals/tasks/*.jsonCould be a new
evals/eval-all.tsor a--allflag on the existing harness.Files
evals/eval.tspackage.json(new script needed)