|
| 1 | +# Usage & Cost Tracking |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +MASEval provides first-class usage and cost tracking to monitor resource consumption during benchmark execution. This is useful for: |
| 6 | + |
| 7 | +- **Cost control**: Track how much each benchmark run costs across providers |
| 8 | +- **Budgeting**: Compare cost across models, tasks, and components |
| 9 | +- **Billing**: Support custom credit systems (university clusters, internal APIs) |
| 10 | +- **Analysis**: Understand token usage patterns per task, agent, or model |
| 11 | + |
| 12 | +!!! info "Usage vs Cost" |
| 13 | + |
| 14 | + **Usage** = Token counts and arbitrary resource units (API calls, data points, etc.) |
| 15 | + |
| 16 | + **Cost** = Monetary value computed from usage (USD, EUR, credits, etc.) |
| 17 | + |
| 18 | + Usage is always tracked automatically for LLM calls. Cost requires either a provider that reports it (e.g., LiteLLM) or a pluggable cost calculator. |
| 19 | + |
| 20 | +## Core Concepts |
| 21 | + |
| 22 | +**`Usage`**: Generic usage record for any billable resource — cost, arbitrary units, and grouping metadata. |
| 23 | + |
| 24 | +**`TokenUsage`**: LLM-specific extension of `Usage` with token fields (`input_tokens`, `output_tokens`, `cached_input_tokens`, etc.). |
| 25 | + |
| 26 | +**`UsageTrackableMixin`**: Mixin that enables automatic usage collection for any component via `gather_usage()`. |
| 27 | + |
| 28 | +**`CostCalculator`**: Protocol for pluggable cost computation from token counts. |
| 29 | + |
| 30 | +## Automatic LLM Usage Tracking |
| 31 | + |
| 32 | +All `ModelAdapter` subclasses track token usage automatically. No configuration needed — every `chat()` call records a `TokenUsage` entry internally. |
| 33 | + |
| 34 | +```python |
| 35 | +from maseval.interface.inference import OpenAIModelAdapter |
| 36 | + |
| 37 | +model = OpenAIModelAdapter(client=client, model_id="gpt-4") |
| 38 | + |
| 39 | +# Make some calls |
| 40 | +model.chat([{"role": "user", "content": "Hello"}]) |
| 41 | +model.chat([{"role": "user", "content": "How are you?"}]) |
| 42 | + |
| 43 | +# Inspect accumulated usage |
| 44 | +usage = model.gather_usage() |
| 45 | +print(usage.input_tokens) # e.g., 25 |
| 46 | +print(usage.output_tokens) # e.g., 42 |
| 47 | +print(usage.cost) # None (no cost calculator configured) |
| 48 | +``` |
| 49 | + |
| 50 | +### In Benchmarks |
| 51 | + |
| 52 | +Usage is collected automatically alongside traces and configs after each task repetition. Each report includes a `"usage"` key: |
| 53 | + |
| 54 | +```python |
| 55 | +results = benchmark.run() |
| 56 | + |
| 57 | +for report in results: |
| 58 | + print(f"Task {report['task_id']}: {report['usage']}") |
| 59 | +``` |
| 60 | + |
| 61 | +Live running totals are available during execution: |
| 62 | + |
| 63 | +```python |
| 64 | +benchmark.usage # -> Usage (grand total across all tasks) |
| 65 | +benchmark.usage_by_component # -> Dict[str, Usage] (per-component totals) |
| 66 | +``` |
| 67 | + |
| 68 | +## Cost Calculation |
| 69 | + |
| 70 | +Most LLM APIs return token counts but not cost. Cost is a client-side concern. MASEval provides two built-in cost calculators and a protocol for custom ones. |
| 71 | + |
| 72 | +### Cost Priority |
| 73 | + |
| 74 | +When a `ModelAdapter` records usage after a `chat()` call, cost is resolved in this order: |
| 75 | + |
| 76 | +1. **Provider-reported cost** — e.g., LiteLLM sets `response._hidden_params.response_cost` directly. This always wins. |
| 77 | +2. **CostCalculator** — if no provider cost, the adapter calls `calculator.calculate_cost(token_usage, model_id)`. |
| 78 | +3. **None** — if neither source provides cost, `Usage.cost` stays `None`. |
| 79 | + |
| 80 | +### StaticPricingCalculator |
| 81 | + |
| 82 | +Zero-dependency calculator using user-supplied per-token rates. Lives in `maseval.core.usage`. |
| 83 | + |
| 84 | +```python |
| 85 | +from maseval import StaticPricingCalculator |
| 86 | + |
| 87 | +calculator = StaticPricingCalculator({ |
| 88 | + "gpt-4": {"input": 0.00003, "output": 0.00006}, |
| 89 | + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, |
| 90 | +}) |
| 91 | + |
| 92 | +model = OpenAIModelAdapter( |
| 93 | + client=client, |
| 94 | + model_id="gpt-4", |
| 95 | + cost_calculator=calculator, |
| 96 | +) |
| 97 | + |
| 98 | +response = model.chat([{"role": "user", "content": "Hello"}]) |
| 99 | +print(model.gather_usage().cost) # e.g., 0.00234 |
| 100 | +``` |
| 101 | + |
| 102 | +Pricing is per token (not per 1K or 1M). Cached input tokens are handled automatically — set a `"cached_input"` rate to differentiate: |
| 103 | + |
| 104 | +```python |
| 105 | +calculator = StaticPricingCalculator({ |
| 106 | + "claude-sonnet-4-5": { |
| 107 | + "input": 0.000003, |
| 108 | + "output": 0.000015, |
| 109 | + "cached_input": 0.0000003, # 10x cheaper for cached tokens |
| 110 | + }, |
| 111 | +}) |
| 112 | +``` |
| 113 | + |
| 114 | +For custom unit systems (university credits, EUR, etc.), the "cost" unit is whatever your pricing represents: |
| 115 | + |
| 116 | +```python |
| 117 | +calculator = StaticPricingCalculator({ |
| 118 | + "llama-3-70b": {"input": 0.5, "output": 1.0}, # credits per token |
| 119 | +}) |
| 120 | +``` |
| 121 | + |
| 122 | +### LiteLLMCostCalculator |
| 123 | + |
| 124 | +Uses LiteLLM's bundled [model pricing database](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) for automatic cost calculation. Covers OpenAI, Anthropic, Google, Mistral, Cohere, and many more. |
| 125 | + |
| 126 | +```python |
| 127 | +from maseval.interface.usage import LiteLLMCostCalculator |
| 128 | + |
| 129 | +calculator = LiteLLMCostCalculator() |
| 130 | + |
| 131 | +model = OpenAIModelAdapter( |
| 132 | + client=client, |
| 133 | + model_id="gpt-4", |
| 134 | + cost_calculator=calculator, |
| 135 | +) |
| 136 | +``` |
| 137 | + |
| 138 | +!!! tip "LiteLLMModelAdapter already reports cost" |
| 139 | + |
| 140 | + If you're using the `LiteLLMModelAdapter`, it extracts provider-reported cost from `response._hidden_params.response_cost` automatically. You only need `LiteLLMCostCalculator` when using other adapters (OpenAI, Anthropic, Google) and want automatic pricing lookup. |
| 141 | + |
| 142 | +#### Custom Pricing Overrides |
| 143 | + |
| 144 | +Override pricing for specific models while using LiteLLM's database for the rest: |
| 145 | + |
| 146 | +```python |
| 147 | +calculator = LiteLLMCostCalculator(custom_pricing={ |
| 148 | + "my-finetuned-gpt4": { |
| 149 | + "input_cost_per_token": 0.00006, |
| 150 | + "output_cost_per_token": 0.00012, |
| 151 | + }, |
| 152 | +}) |
| 153 | +``` |
| 154 | + |
| 155 | +#### Model ID Remapping |
| 156 | + |
| 157 | +When your adapter's `model_id` doesn't match LiteLLM's naming convention (e.g., using Google's OpenAI-compatible endpoint), use `model_id_map` to remap: |
| 158 | + |
| 159 | +```python |
| 160 | +calculator = LiteLLMCostCalculator(model_id_map={ |
| 161 | + "gemini-2.0-flash": "gemini/gemini-2.0-flash", |
| 162 | + "my-custom-gpt4": "gpt-4", |
| 163 | +}) |
| 164 | +``` |
| 165 | + |
| 166 | +The map is applied before both custom pricing and LiteLLM lookup. |
| 167 | + |
| 168 | +### Custom Cost Calculator |
| 169 | + |
| 170 | +Implement the `CostCalculator` protocol for custom pricing logic: |
| 171 | + |
| 172 | +```python |
| 173 | +from maseval import CostCalculator, TokenUsage |
| 174 | +from typing import Optional |
| 175 | + |
| 176 | +class MyCostCalculator: |
| 177 | + def calculate_cost(self, usage: TokenUsage, model_id: str) -> Optional[float]: |
| 178 | + rate = MY_PRICING_TABLE.get(model_id) |
| 179 | + if rate is None: |
| 180 | + return None |
| 181 | + return rate["input"] * usage.input_tokens + rate["output"] * usage.output_tokens |
| 182 | +``` |
| 183 | + |
| 184 | +The protocol requires a single method: `calculate_cost(usage, model_id) -> Optional[float]`. Return `None` if you don't have pricing for the given model. |
| 185 | + |
| 186 | +### Sharing Calculators Across Adapters |
| 187 | + |
| 188 | +A single calculator instance can be shared across multiple model adapters. The `model_id` is passed on each call, so the calculator can look up the right pricing: |
| 189 | + |
| 190 | +```python |
| 191 | +calculator = StaticPricingCalculator({ |
| 192 | + "gpt-4": {"input": 0.00003, "output": 0.00006}, |
| 193 | + "claude-sonnet-4-5": {"input": 0.000003, "output": 0.000015}, |
| 194 | +}) |
| 195 | + |
| 196 | +model_a = OpenAIModelAdapter(client=client, model_id="gpt-4", cost_calculator=calculator) |
| 197 | +model_b = AnthropicModelAdapter(client=client, model_id="claude-sonnet-4-5", cost_calculator=calculator) |
| 198 | +``` |
| 199 | + |
| 200 | +## Non-LLM Usage Tracking |
| 201 | + |
| 202 | +Tools, environments, and other components can track usage by inheriting `UsageTrackableMixin` and overriding `gather_usage()`: |
| 203 | + |
| 204 | +```python |
| 205 | +from maseval import Usage, UsageTrackableMixin |
| 206 | +from maseval.core.tracing import TraceableMixin |
| 207 | + |
| 208 | +class BloombergEnvironment(Environment, UsageTrackableMixin): |
| 209 | + def __init__(self, task_data): |
| 210 | + super().__init__(task_data) |
| 211 | + self._usage_records = [] |
| 212 | + |
| 213 | + def _call_bloomberg(self, query): |
| 214 | + result = bloomberg_client.query(query) |
| 215 | + self._usage_records.append(Usage( |
| 216 | + cost=result.billed_amount, |
| 217 | + units={"api_calls": 1, "data_points": result.count}, |
| 218 | + provider="bloomberg", |
| 219 | + kind="service", |
| 220 | + )) |
| 221 | + return result |
| 222 | + |
| 223 | + def gather_usage(self) -> Usage: |
| 224 | + if not self._usage_records: |
| 225 | + return Usage() |
| 226 | + return sum(self._usage_records, Usage()) |
| 227 | +``` |
| 228 | + |
| 229 | +Non-LLM components set cost directly in their `Usage` records — there is no calculator involvement. Each component knows its own billing model. |
| 230 | + |
| 231 | +## Post-hoc Analysis with UsageReporter |
| 232 | + |
| 233 | +`UsageReporter` provides sliced analysis across all benchmark reports: |
| 234 | + |
| 235 | +```python |
| 236 | +from maseval import UsageReporter |
| 237 | + |
| 238 | +reporter = UsageReporter.from_reports(benchmark.reports) |
| 239 | + |
| 240 | +# Grand total |
| 241 | +total = reporter.total() |
| 242 | +print(f"Total cost: ${total.cost:.4f}") |
| 243 | +print(f"Total tokens: {total.input_tokens + total.output_tokens}") |
| 244 | + |
| 245 | +# Per-task breakdown |
| 246 | +for task_id, usage in reporter.by_task().items(): |
| 247 | + print(f" {task_id}: ${usage.cost:.4f}") |
| 248 | + |
| 249 | +# Per-component breakdown |
| 250 | +for component, usage in reporter.by_component().items(): |
| 251 | + print(f" {component}: ${usage.cost:.4f}") |
| 252 | + |
| 253 | +# Full nested summary dict |
| 254 | +summary = reporter.summary() |
| 255 | +``` |
| 256 | + |
| 257 | +## Usage Data Model |
| 258 | + |
| 259 | +### Usage |
| 260 | + |
| 261 | +Generic record for any billable resource: |
| 262 | + |
| 263 | +| Field | Type | Description | |
| 264 | +|-------|------|-------------| |
| 265 | +| `cost` | `Optional[float]` | Cost in USD (or custom unit). `None` = unknown. | |
| 266 | +| `units` | `Dict[str, int\|float]` | Arbitrary countable units (e.g., `{"api_calls": 3}`). | |
| 267 | +| `provider` | `Optional[str]` | Provider identifier (e.g., `"anthropic"`). | |
| 268 | +| `category` | `Optional[str]` | Registry category (e.g., `"models"`, `"tools"`). | |
| 269 | +| `component_name` | `Optional[str]` | Component name (e.g., `"main_model"`). | |
| 270 | +| `kind` | `Optional[str]` | Component kind (e.g., `"llm"`, `"service"`). | |
| 271 | + |
| 272 | +`Usage` supports addition: costs sum (both known) or become `None` (either unknown), units sum, grouping fields are preserved on match or set to `None` on mismatch. |
| 273 | + |
| 274 | +### TokenUsage |
| 275 | + |
| 276 | +Extends `Usage` with LLM-specific token counts: |
| 277 | + |
| 278 | +| Field | Type | Description | |
| 279 | +|-------|------|-------------| |
| 280 | +| `input_tokens` | `int` | Input/prompt tokens. | |
| 281 | +| `output_tokens` | `int` | Output/completion tokens. | |
| 282 | +| `total_tokens` | `int` | Total tokens. | |
| 283 | +| `cached_input_tokens` | `int` | Tokens served from cache. | |
| 284 | +| `reasoning_tokens` | `int` | Reasoning/thinking tokens. | |
| 285 | +| `audio_tokens` | `int` | Audio processing tokens. | |
| 286 | + |
| 287 | +## Evaluator Usage |
| 288 | + |
| 289 | +Evaluators that use LLM calls (LLM-as-judge) hold a `ModelAdapter`. Register the evaluator's model in the benchmark and its usage is collected automatically: |
| 290 | + |
| 291 | +```python |
| 292 | +class MyBenchmark(Benchmark): |
| 293 | + def setup_evaluators(self, task, environment): |
| 294 | + judge_model = OpenAIModelAdapter(client=client, model_id="gpt-4") |
| 295 | + self.register("evaluator_models", "judge", judge_model) |
| 296 | + return [MyLLMEvaluator(judge_model)] |
| 297 | +``` |
| 298 | + |
| 299 | +The judge model's usage appears under `usage["evaluator_models"]["judge"]` in the report, separate from the agent's model usage. |
| 300 | + |
| 301 | +## Tips |
| 302 | + |
| 303 | +**For cost tracking**: Use `LiteLLMCostCalculator` for automatic pricing, or `StaticPricingCalculator` for custom rates. |
| 304 | + |
| 305 | +**For custom hosts**: Use `model_id_map` in `LiteLLMCostCalculator` when your adapter's model ID doesn't match LiteLLM's naming. |
| 306 | + |
| 307 | +**For failed tasks**: Usage is collected before error status is determined, so partial usage from failed tasks is still tracked. |
| 308 | + |
| 309 | +**For live monitoring**: Access `benchmark.usage` during execution to check running totals. |
0 commit comments