Skip to content

support running X eval cases against Y traces #151

@peterj

Description

@peterj

Today, for each trace the runner calls _find_expected_invocations once. That picks at either the first case whose first user message matches the trace's first user message. Or, if there's no match, it's a fallback to the first case. All metrics for that trace then run once, all against that single gold that was picked out. There's no lopp currently that would allow us to score the same trace against eval case A, then against against eval case B, etc.

Example
Eval set has two cases: Case A (gold conversation starts with user: “Book a flight to NYC”) and Case B (starts with “Cancel my order”).

You submit one trace whose first user turn is “Book a flight to NYC”. The heuristic selects Case A; tool_trajectory_avg_score / response_match_score (etc.) compare the trace to A’s gold only. Case B is not scored against this trace in that evaluation — you do not get “trace vs B” results unless you change the pipeline (separate product / likely runner changes).

Desired state
With an input of X eval cases and Y traces, we should run and score each case against each trace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions