Skip to content

Define the benchmarks #2

@imalsogreg

Description

@imalsogreg

What does one benchmark (a test run across each model) look like?

A proposal:

Benchmarks have:

  • A name
  • One BAML function template. The template is parameterized over some independent variables
  • Several tags

What are the independent variables? They're variables whose impact on the benchmark we wish to measure:

  • Which model (this is the most important independent variable)
  • Whether or not we ask for CoT via the prompt
  • Whether or not we use aggressive tricks to reduce token count, like aliasing field names to shorter forms and eliding object key quotation marks.

What's a tag? A free-form single-word label that can be used for selecting subsets of benchmarks.

tag = "Agent" | "Multi-Tool" | "Productivity" | "Computer Use" | "Code Assistant" | ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions