You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- **SKIPPED** ⏭️: Evaluation skipped due to prior failure (when `skip_on_failure` is enabled)
1182
1183
1184
+
### Performance Metrics (API Enabled Only)
1185
+
1186
+
**API Latency**: Response time per API call with percentile stats (p50, p95, p99). Cached responses (zero tokens) are excluded to avoid skewing statistics.
1187
+
1188
+
**Streaming Metrics**: Time-to-first-token, streaming duration, and tokens/second when using streaming endpoints.
1189
+
1190
+
**Token Usage**: Track consumption across Judge LLM, embeddings, and API calls.
1191
+
1192
+
**Note:** Cached responses are detected by zero `api_input_tokens` and `api_output_tokens` — latency is set to 0 for these.
1193
+
1183
1194
### Score Quality Levels
1184
1195
1185
1196
| Score | Quality | Recommendation |
@@ -1912,4 +1923,3 @@ This comprehensive guide has covered everything you need to know to effectively
1912
1923
*This guide is designed to make AI evaluation accessible to everyone. Whether you're a product manager making decisions, a QA engineer testing systems, or a developer integrating evaluation into workflows, you now have everything you need to ensure your AI applications meet quality standards.*
0 commit comments