Skip to content

Commit 377a66b

Browse files
authored
[LEADS-348, LEADS-364] API Latency and Token Calculation (#230)
* LEADS-349-calculate-aggregated-score-from-key-metrics * API latency calculation
1 parent f09e6d0 commit 377a66b

24 files changed

Lines changed: 1332 additions & 661 deletions

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ A comprehensive framework for evaluating GenAI applications.
1414
- **API Integration**: Direct integration with external API for real-time data generation (if enabled)
1515
- **Setup/Cleanup Scripts**: Support for running setup and cleanup scripts before/after each conversation evaluation (applicable when API is enabled)
1616
- **Token Usage Tracking**: Track input/output tokens for both API calls and Judge LLM evaluations (per-judge tracking for panel mode)
17+
- **API Latency Tracking**: Measure and analyze API response times with percentile statistics (p50, p95, p99) for performance monitoring
1718
- **Streaming Performance Metrics**: Capture time-to-first-token (TTFT), streaming duration, and tokens/second when using streaming endpoint
1819
- **Statistical Analysis**: Statistics for every metric with score distribution analysis
1920
- **Rich Output**: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)

config/system.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,7 @@ storage:
274274
- "response"
275275
- "api_input_tokens"
276276
- "api_output_tokens"
277+
- "agent_latency"
277278
# Streaming performance metrics (only populated when using streaming endpoint)
278279
- "time_to_first_token" # Time to first token in seconds
279280
- "streaming_duration" # Total streaming duration in seconds

docs/EVALUATION_GUIDE.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1127,6 +1127,7 @@ Contains every metric evaluation with:
11271127
- Detailed reasoning
11281128
- Query and response text
11291129
- Execution time
1130+
- API latency
11301131
11311132
**Use for:** Drilling into specific failures, detailed analysis
11321133
@@ -1180,6 +1181,16 @@ ragas:faithfulness:
11801181
- **ERROR** ⚠️: Evaluation couldn't complete (missing data, API failure, etc.)
11811182
- **SKIPPED** ⏭️: Evaluation skipped due to prior failure (when `skip_on_failure` is enabled)
11821183
1184+
### Performance Metrics (API Enabled Only)
1185+
1186+
**API Latency**: Response time per API call with percentile stats (p50, p95, p99). Cached responses (zero tokens) are excluded to avoid skewing statistics.
1187+
1188+
**Streaming Metrics**: Time-to-first-token, streaming duration, and tokens/second when using streaming endpoints.
1189+
1190+
**Token Usage**: Track consumption across Judge LLM, embeddings, and API calls.
1191+
1192+
**Note:** Cached responses are detected by zero `api_input_tokens` and `api_output_tokens` — latency is set to 0 for these.
1193+
11831194
### Score Quality Levels
11841195
11851196
| Score | Quality | Recommendation |
@@ -1912,4 +1923,3 @@ This comprehensive guide has covered everything you need to know to effectively
19121923
*This guide is designed to make AI evaluation accessible to everyone. Whether you're a product manager making decisions, a QA engineer testing systems, or a developer integrating evaluation into workflows, you now have everything you need to ensure your AI applications meet quality standards.*
19131924

19141925
**Happy Evaluating! 🚀**
1915-

src/lightspeed_evaluation/core/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@
110110
# Streaming performance metrics
111111
"time_to_first_token",
112112
"streaming_duration",
113+
"agent_latency",
113114
"tokens_per_second",
114115
"tool_calls",
115116
"contexts",

src/lightspeed_evaluation/core/models/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
ConversationStats,
4444
TagStats,
4545
StreamingStats,
46-
ApiTokenUsage,
46+
AgentTokenUsage,
4747
ConfidenceInterval,
4848
DetailedStats,
4949
)
@@ -84,7 +84,7 @@
8484
"ConversationStats",
8585
"TagStats",
8686
"StreamingStats",
87-
"ApiTokenUsage",
87+
"AgentTokenUsage",
8888
"ConfidenceInterval",
8989
"DetailedStats",
9090
# API models

src/lightspeed_evaluation/core/models/data.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,11 @@ class TurnData(StreamingMetricsMixin):
8484
default=0, ge=0, description="Output tokens used by API call"
8585
)
8686

87+
# API execution time tracking (per turn)
88+
agent_latency: float = Field(
89+
default=0, ge=0, description="API call latency for this turn in seconds"
90+
)
91+
8792
# Per-turn metrics support
8893
turn_metrics: Optional[list[str]] = Field(
8994
default=None,
@@ -526,6 +531,11 @@ class EvaluationResult(MetricResult, StreamingMetricsMixin):
526531
execution_time: float = Field(
527532
default=0, ge=0, description="Execution time in seconds"
528533
)
534+
agent_latency: float = Field(
535+
default=0,
536+
ge=0,
537+
description="API latency in seconds (per turn or average for conversation)",
538+
)
529539
api_input_tokens: int = Field(default=0, ge=0, description="API input tokens used")
530540
api_output_tokens: int = Field(
531541
default=0, ge=0, description="API output tokens used"

src/lightspeed_evaluation/core/models/quality.py

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,13 @@
99

1010
from pydantic import BaseModel, Field
1111

12-
from lightspeed_evaluation.core.models import MetricStats, ScoreStatistics
12+
from lightspeed_evaluation.core.models.statistics import (
13+
MetricStats,
14+
NumericStats,
15+
ScoreStatistics,
16+
AgentTokenStats,
17+
)
18+
1319

1420
logger = logging.getLogger(__name__)
1521

@@ -44,17 +50,18 @@ class QualityReport(BaseModel):
4450
default_factory=list,
4551
description="Warnings about quality metrics configuration or usage",
4652
)
47-
api_latency: float = Field(
48-
default=0.0, description="[Placeholder] Average API response time in seconds"
53+
agent_latency_stats: Optional[NumericStats] = Field(
54+
default=None, description="Agent latency statistics"
4955
)
50-
api_tokens: int = Field(
51-
default=0,
52-
description="[Placeholder] Total number of tokens consumed across all API calls",
56+
agent_token_stats: Optional[AgentTokenStats] = Field(
57+
default=None, description="Agent token usage statistics"
5358
)
5459

5560
@staticmethod
5661
def create_report(
5762
by_metric: dict[str, MetricStats],
63+
agent_latency_stats: Optional[NumericStats],
64+
agent_token_stats: Optional[AgentTokenStats],
5865
quality_score_metrics: list[str],
5966
) -> Optional["QualityReport"]:
6067
"""Creates a quality report with aggregated quality score from selected metrics.
@@ -64,6 +71,8 @@ def create_report(
6471
6572
Args:
6673
by_metric: Dictionary mapping metric identifiers to their computed statistics.
74+
agent_latency_stats: Agent API latency statistics (p50, p95, p99).
75+
agent_token_stats: Agent token usage statistics with percentiles.
6776
quality_score_metrics: Metric identifiers to include in quality score calculation.
6877
All specified metrics must exist in by_metric.
6978
@@ -148,14 +157,13 @@ def create_report(
148157
if stats is not None:
149158
extra_metrics[metric_id] = stats
150159

151-
# Calculate aggregated quality score
152-
aggregated_score = QualityReport._calculate_quality_score(quality_metrics)
153-
154160
return QualityReport(
155-
quality_score=aggregated_score,
161+
quality_score=QualityReport._calculate_quality_score(quality_metrics),
156162
quality_metrics=quality_metrics,
157163
extra_metrics=extra_metrics,
158164
warnings=warnings,
165+
agent_latency_stats=agent_latency_stats,
166+
agent_token_stats=agent_token_stats,
159167
)
160168

161169
@staticmethod

src/lightspeed_evaluation/core/models/statistics.py

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,16 @@
55

66

77
class NumericStats(BaseModel):
8-
"""Numeric statistics for a set of values (e.g., TTFT, duration)."""
8+
"""Numeric statistics for a set of values (e.g., TTFT, duration, latency)."""
99

1010
count: int = Field(default=0, description="Number of values")
1111
mean: Optional[float] = Field(default=None, description="Mean value")
1212
median: Optional[float] = Field(default=None, description="Median value")
1313
std: Optional[float] = Field(default=None, description="Standard deviation")
1414
min_value: Optional[float] = Field(default=None, description="Minimum value")
1515
max_value: Optional[float] = Field(default=None, description="Maximum value")
16+
p95: Optional[float] = Field(default=None, description="95th percentile")
17+
p99: Optional[float] = Field(default=None, description="99th percentile")
1618

1719

1820
class ConfidenceInterval(BaseModel):
@@ -116,11 +118,25 @@ class StreamingStats(BaseModel):
116118
)
117119

118120

119-
class ApiTokenUsage(BaseModel):
120-
"""API token usage totals."""
121+
class AgentTokenStats(BaseModel):
122+
"""Agent token usage statistics with percentiles."""
123+
124+
input: Optional[NumericStats] = Field(
125+
default=None, description="Input token statistics"
126+
)
127+
output: Optional[NumericStats] = Field(
128+
default=None, description="Output token statistics"
129+
)
130+
131+
132+
class AgentTokenUsage(BaseModel):
133+
"""Agent token usage totals and statistics."""
121134

122135
total_api_input_tokens: int = Field(default=0, description="Total API input tokens")
123136
total_api_output_tokens: int = Field(
124137
default=0, description="Total API output tokens"
125138
)
126139
total_api_tokens: int = Field(default=0, description="Total API tokens")
140+
statistics: Optional[AgentTokenStats] = Field(
141+
default=None, description="Agent token usage statistics with percentiles"
142+
)

src/lightspeed_evaluation/core/models/summary.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,17 @@
1010
EvaluationResult,
1111
)
1212
from lightspeed_evaluation.core.models.statistics import (
13-
ApiTokenUsage,
13+
AgentTokenUsage,
14+
NumericStats,
1415
ConversationStats,
1516
MetricStats,
1617
OverallStats,
1718
StreamingStats,
1819
TagStats,
1920
)
2021
from lightspeed_evaluation.core.output.statistics import (
21-
compute_api_token_usage,
22+
compute_agent_token_usage,
23+
compute_agent_latency_stats,
2224
compute_overall_stats,
2325
compute_streaming_stats,
2426
compute_tag_stats,
@@ -50,8 +52,11 @@ class EvaluationSummary(BaseModel):
5052
by_tag: dict[str, TagStats] = Field(
5153
default_factory=dict, description="Statistics per tag"
5254
)
53-
api_tokens: Optional[ApiTokenUsage] = Field(
54-
default=None, description="API token usage (when evaluation data provided)"
55+
agent_token_usage: Optional[AgentTokenUsage] = Field(
56+
default=None, description="Agent token usage with totals and statistics"
57+
)
58+
agent_latency_stats: Optional[NumericStats] = Field(
59+
default=None, description="Agent latency statistics (when API enabled)"
5560
)
5661
streaming: Optional[StreamingStats] = Field(
5762
default=None, description="Streaming performance stats (when available)"
@@ -70,7 +75,8 @@ def from_results(
7075
7176
Args:
7277
results: List of evaluation results to summarize.
73-
evaluation_data: Optional evaluation data for API token and streaming stats.
78+
evaluation_data: Optional evaluation data for API token, agent latency,
79+
and streaming stats.
7480
compute_confidence_intervals: Whether to compute bootstrap confidence
7581
intervals. Default False.
7682
@@ -88,11 +94,13 @@ def from_results(
8894
by_tag = compute_tag_stats(results, compute_confidence_intervals)
8995

9096
# Compute API token usage and streaming stats if evaluation data provided
91-
api_tokens = None
9297
streaming = None
98+
agent_token_usage = None
99+
agent_latency_stats = None
93100
if evaluation_data:
94-
api_tokens = compute_api_token_usage(evaluation_data)
95101
streaming = compute_streaming_stats(evaluation_data)
102+
agent_token_usage = compute_agent_token_usage(evaluation_data)
103+
agent_latency_stats = compute_agent_latency_stats(evaluation_data)
96104

97105
return cls(
98106
timestamp=timestamp,
@@ -101,6 +109,7 @@ def from_results(
101109
by_metric=by_metric,
102110
by_conversation=by_conversation,
103111
by_tag=by_tag,
104-
api_tokens=api_tokens,
112+
agent_token_usage=agent_token_usage,
113+
agent_latency_stats=agent_latency_stats,
105114
streaming=streaming,
106115
)

0 commit comments

Comments
 (0)