What does one benchmark (a test run across each model) look like?
A proposal:
Benchmarks have:
- A name
- One BAML function template. The template is parameterized over some independent variables
- Several tags
What are the independent variables? They're variables whose impact on the benchmark we wish to measure:
- Which model (this is the most important independent variable)
- Whether or not we ask for CoT via the prompt
- Whether or not we use aggressive tricks to reduce token count, like aliasing field names to shorter forms and eliding object key quotation marks.
What's a tag? A free-form single-word label that can be used for selecting subsets of benchmarks.
tag = "Agent" | "Multi-Tool" | "Productivity" | "Computer Use" | "Code Assistant" | ...
What does one benchmark (a test run across each model) look like?
A proposal:
Benchmarks have:
What are the independent variables? They're variables whose impact on the benchmark we wish to measure:
What's a tag? A free-form single-word label that can be used for selecting subsets of benchmarks.
tag = "Agent" | "Multi-Tool" | "Productivity" | "Computer Use" | "Code Assistant" | ...