Skip to content

Commit 4ab0efe

Browse files
committed
updated litellm cost calculator
1 parent dd4864f commit 4ab0efe

14 files changed

Lines changed: 486 additions & 148 deletions

File tree

docs/guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
88
| [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility |
99
| [Exception Handling](exception-handling.md) | Distinguish agent errors from infrastructure failures |
1010
| [Seeding](seeding.md) | Enable reproducible benchmark runs with deterministic seeds |
11+
| [Usage & Cost Tracking](usage-tracking.md) | Track token usage and compute cost across providers |

docs/guides/usage-tracking.md

Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
# Usage & Cost Tracking
2+
3+
## Overview
4+
5+
MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for:
6+
7+
- **Cost control**: Track how much each benchmark run costs across providers
8+
- **Budgeting**: Compare cost across models, tasks, and components
9+
- **Billing**: Support custom credit systems (university clusters, internal APIs)
10+
- **Analysis**: Understand token usage patterns per task, agent, or model
11+
12+
!!! info "Usage vs Cost"
13+
14+
**Usage** = Token counts and arbitrary resource units (API calls, data points, etc.)
15+
16+
**Cost** = Monetary value computed from usage (USD, EUR, credits, etc.)
17+
18+
Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator.
19+
20+
## Core Concepts
21+
22+
**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata.
23+
24+
**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.).
25+
26+
**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`.
27+
28+
**`CostCalculator`**: Protocol for pluggable cost computation from token counts.
29+
30+
## Automatic LLM Usage Tracking
31+
32+
All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally.
33+
34+
```python
35+
from maseval.interface.inference import OpenAIModelAdapter
36+
37+
model = OpenAIModelAdapter(client=client, model_id="gpt-4")
38+
39+
# Make some calls
40+
model.chat([{"role": "user", "content": "Hello"}])
41+
model.chat([{"role": "user", "content": "How are you?"}])
42+
43+
# Inspect accumulated usage
44+
usage = model.gather_usage()
45+
print(usage.input_tokens) # e.g., 25
46+
print(usage.output_tokens) # e.g., 42
47+
print(usage.cost) # None (no cost calculator configured)
48+
```
49+
50+
### In Benchmarks
51+
52+
Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key:
53+
54+
```python
55+
results = benchmark.run()
56+
57+
for report in results:
58+
print(f"Task {report['task_id']}: {report['usage']}")
59+
```
60+
61+
Live running totals are available during execution:
62+
63+
```python
64+
benchmark.usage # -> Usage (grand total across all tasks)
65+
benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals)
66+
```
67+
68+
## Cost Calculation
69+
70+
Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones.
71+
72+
### Cost Priority
73+
74+
When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order:
75+
76+
1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins.
77+
2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`.
78+
3. **None** — if neither source provides cost, `Usage.cost` stays `None`.
79+
80+
### StaticPricingCalculator
81+
82+
Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`.
83+
84+
```python
85+
from maseval import StaticPricingCalculator
86+
87+
calculator = StaticPricingCalculator({
88+
"gpt-4": {"input": 0.00003, "output": 0.00006},
89+
"claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
90+
})
91+
92+
model = OpenAIModelAdapter(
93+
client=client,
94+
model_id="gpt-4",
95+
cost_calculator=calculator,
96+
)
97+
98+
response = model.chat([{"role": "user", "content": "Hello"}])
99+
print(model.gather_usage().cost) # e.g., 0.00234
100+
```
101+
102+
Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate:
103+
104+
```python
105+
calculator = StaticPricingCalculator({
106+
"claude-sonnet-4-5": {
107+
"input": 0.000003,
108+
"output": 0.000015,
109+
"cached_input": 0.0000003, # 10x cheaper for cached tokens
110+
},
111+
})
112+
```
113+
114+
For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents:
115+
116+
```python
117+
calculator = StaticPricingCalculator({
118+
"llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token
119+
})
120+
```
121+
122+
### LiteLLMCostCalculator
123+
124+
Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more.
125+
126+
```python
127+
from maseval.interface.usage import LiteLLMCostCalculator
128+
129+
calculator = LiteLLMCostCalculator()
130+
131+
model = OpenAIModelAdapter(
132+
client=client,
133+
model_id="gpt-4",
134+
cost_calculator=calculator,
135+
)
136+
```
137+
138+
!!! tip "LiteLLMModelAdapter already reports cost"
139+
140+
If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup.
141+
142+
#### Custom Pricing Overrides
143+
144+
Override pricing for specific models while using LiteLLM's database for the rest:
145+
146+
```python
147+
calculator = LiteLLMCostCalculator(custom_pricing={
148+
"my-finetuned-gpt4": {
149+
"input_cost_per_token": 0.00006,
150+
"output_cost_per_token": 0.00012,
151+
},
152+
})
153+
```
154+
155+
#### Model ID Remapping
156+
157+
When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap:
158+
159+
```python
160+
calculator = LiteLLMCostCalculator(model_id_map={
161+
"gemini-2.0-flash": "gemini/gemini-2.0-flash",
162+
"my-custom-gpt4": "gpt-4",
163+
})
164+
```
165+
166+
The map is applied before both custom pricing and LiteLLM lookup.
167+
168+
### Custom Cost Calculator
169+
170+
Implement the `CostCalculator` protocol for custom pricing logic:
171+
172+
```python
173+
from maseval import CostCalculator, TokenUsage
174+
from typing import Optional
175+
176+
class MyCostCalculator:
177+
def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]:
178+
rate = MY_PRICING_TABLE.get(model_id)
179+
if rate is None:
180+
return None
181+
return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens
182+
```
183+
184+
The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model.
185+
186+
### Sharing Calculators Across Adapters
187+
188+
A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing:
189+
190+
```python
191+
calculator = StaticPricingCalculator({
192+
"gpt-4": {"input": 0.00003, "output": 0.00006},
193+
"claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015},
194+
})
195+
196+
model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator)
197+
model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator)
198+
```
199+
200+
## Non-LLM Usage Tracking
201+
202+
Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`:
203+
204+
```python
205+
from maseval import Usage, UsageTrackableMixin
206+
from maseval.core.tracing import TraceableMixin
207+
208+
class BloombergEnvironment(Environment, UsageTrackableMixin):
209+
def __init__(self, task_data):
210+
super().__init__(task_data)
211+
self._usage_records = []
212+
213+
def _call_bloomberg(self, query):
214+
result = bloomberg_client.query(query)
215+
self._usage_records.append(Usage(
216+
cost=result.billed_amount,
217+
units={"api_calls": 1, "data_points": result.count},
218+
provider="bloomberg",
219+
kind="service",
220+
))
221+
return result
222+
223+
def gather_usage(self) -> Usage:
224+
if not self._usage_records:
225+
return Usage()
226+
return sum(self._usage_records, Usage())
227+
```
228+
229+
Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model.
230+
231+
## Post-hoc Analysis with UsageReporter
232+
233+
`UsageReporter` provides sliced analysis across all benchmark reports:
234+
235+
```python
236+
from maseval import UsageReporter
237+
238+
reporter = UsageReporter.from_reports(benchmark.reports)
239+
240+
# Grand total
241+
total = reporter.total()
242+
print(f"Total cost: ${total.cost:.4f}")
243+
print(f"Total tokens: {total.input_tokens + total.output_tokens}")
244+
245+
# Per-task breakdown
246+
for task_id, usage in reporter.by_task().items():
247+
print(f" {task_id}: ${usage.cost:.4f}")
248+
249+
# Per-component breakdown
250+
for component, usage in reporter.by_component().items():
251+
print(f" {component}: ${usage.cost:.4f}")
252+
253+
# Full nested summary dict
254+
summary = reporter.summary()
255+
```
256+
257+
## Usage Data Model
258+
259+
### Usage
260+
261+
Generic record for any billable resource:
262+
263+
| Field | Type | Description |
264+
|-------|------|-------------|
265+
| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. |
266+
| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). |
267+
| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). |
268+
| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). |
269+
| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). |
270+
| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). |
271+
272+
`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch.
273+
274+
### TokenUsage
275+
276+
Extends `Usage` with LLM-specific token counts:
277+
278+
| Field | Type | Description |
279+
|-------|------|-------------|
280+
| `input_tokens` | `int` | Input/prompt tokens. |
281+
| `output_tokens` | `int` | Output/completion tokens. |
282+
| `total_tokens` | `int` | Total tokens. |
283+
| `cached_input_tokens` | `int` | Tokens served from cache. |
284+
| `reasoning_tokens` | `int` | Reasoning/thinking tokens. |
285+
| `audio_tokens` | `int` | Audio processing tokens. |
286+
287+
## Evaluator Usage
288+
289+
Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically:
290+
291+
```python
292+
class MyBenchmark(Benchmark):
293+
def setup_evaluators(self, task, environment):
294+
judge_model = OpenAIModelAdapter(client=client, model_id="gpt-4")
295+
self.register("evaluator_models", "judge", judge_model)
296+
return [MyLLMEvaluator(judge_model)]
297+
```
298+
299+
The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage.
300+
301+
## Tips
302+
303+
**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates.
304+
305+
**For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming.
306+
307+
**For failed tasks**: Usage is collected before error status is determined, so partial usage from failed tasks is still tracked.
308+
309+
**For live monitoring**: Access `benchmark.usage` during execution to check running totals.

docs/reference/usage.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Usage & Cost Tracking
2+
3+
Usage and cost tracking provides data classes for recording resource consumption, a mixin for automatic collection, and pluggable cost calculators.
4+
5+
See the [Usage & Cost Tracking guide](../guides/usage-tracking.md) for usage patterns and examples.
6+
7+
## Core
8+
9+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/usage.py){ .md-source-file }
10+
11+
::: maseval.core.usage.Usage
12+
13+
::: maseval.core.usage.TokenUsage
14+
15+
::: maseval.core.usage.UsageTrackableMixin
16+
17+
::: maseval.core.usage.CostCalculator
18+
19+
::: maseval.core.usage.StaticPricingCalculator
20+
21+
## Reporting
22+
23+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/core/reporting.py){ .md-source-file }
24+
25+
::: maseval.core.reporting.UsageReporter
26+
27+
## Interface
28+
29+
[:material-github: View source](https://github.com/parameterlab/maseval/blob/main/maseval/interface/usage.py){ .md-source-file }
30+
31+
::: maseval.interface.usage.LiteLLMCostCalculator

maseval/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
from .core.history import MessageHistory, ToolInvocationHistory
3939
from .core.tracing import TraceableMixin
4040
from .core.usage import Usage, TokenUsage, UsageTrackableMixin
41-
from .core.cost import CostCalculator, StaticPricingCalculator
41+
from .core.usage import CostCalculator, StaticPricingCalculator
4242
from .core.reporting import UsageReporter
4343
from .core.registry import ComponentRegistry
4444
from .core.context import TaskContext

0 commit comments

Comments
 (0)