What metrics do we collect for each `(model * query)`? One quick proposal: - Wall-clock time - Parse succeeded - Number of check failures (use checks as evals) - Output tokens
What metrics do we collect for each
(model * query)?One quick proposal: