Today, for each trace the runner calls _find_expected_invocations once. That picks at either the first case whose first user message matches the trace's first user message. Or, if there's no match, it's a fallback to the first case. All metrics for that trace then run once, all against that single gold that was picked out. There's no lopp currently that would allow us to score the same trace against eval case A, then against against eval case B, etc.
Example
Eval set has two cases: Case A (gold conversation starts with user: “Book a flight to NYC”) and Case B (starts with “Cancel my order”).
You submit one trace whose first user turn is “Book a flight to NYC”. The heuristic selects Case A; tool_trajectory_avg_score / response_match_score (etc.) compare the trace to A’s gold only. Case B is not scored against this trace in that evaluation — you do not get “trace vs B” results unless you change the pipeline (separate product / likely runner changes).
Desired state
With an input of X eval cases and Y traces, we should run and score each case against each trace.
Today, for each trace the runner calls
_find_expected_invocationsonce. That picks at either the first case whose first user message matches the trace's first user message. Or, if there's no match, it's a fallback to the first case. All metrics for that trace then run once, all against that single gold that was picked out. There's no lopp currently that would allow us to score the same trace against eval case A, then against against eval case B, etc.Example
Eval set has two cases: Case A (gold conversation starts with user: “Book a flight to NYC”) and Case B (starts with “Cancel my order”).
You submit one trace whose first user turn is “Book a flight to NYC”. The heuristic selects Case A; tool_trajectory_avg_score / response_match_score (etc.) compare the trace to A’s gold only. Case B is not scored against this trace in that evaluation — you do not get “trace vs B” results unless you change the pipeline (separate product / likely runner changes).
Desired state
With an input of X eval cases and Y traces, we should run and score each case against each trace.