support running X eval cases against Y traces

Today, for each trace the runner calls `_find_expected_invocations` once. That picks at either the first case whose first user message matches the trace's first user message. Or, if there's no match, it's a fallback to the first case.  All metrics for that trace then run once, all against that single gold that was picked out. There's no lopp currently that would allow us to score the same trace against eval case A, then against against eval case B, etc. 

**Example**
Eval set has two cases: Case A (gold conversation starts with user: “Book a flight to NYC”) and Case B (starts with “Cancel my order”). 

You submit one trace whose first user turn is “Book a flight to NYC”. The heuristic selects Case A; tool_trajectory_avg_score / response_match_score (etc.) compare the trace to A’s gold only. Case B is not scored against this trace in that evaluation — you do not get “trace vs B” results unless you change the pipeline (separate product / likely runner changes).

**Desired state**
With an input of X eval cases and Y traces, we should run and score **each case against each trace**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support running X eval cases against Y traces #151

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

support running X eval cases against Y traces #151

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions