Skip to content

Commit 8dc11ce

Browse files
authored
API Usage Tracking (#45)
- Add Usage, TokenUsage data classes and UsageTrackableMixin for recording resource consumption - Track token usage automatically in ModelAdapter after each chat() call - Add CostCalculator protocol with StaticPricingCalculator and LiteLLMCostCalculator implementations - Support cost_calculator and model_id on AgentAdapter with auto-detection for smolagents, CAMEL, and LlamaIndex - Expose live totals via benchmark.usage and benchmark.usage_by_component - Add UsageReporter for post-hoc analysis by task, component, or model - Add usage tracking guide and API reference docs - Update 5-A-Day benchmark example with usage tracking
1 parent acffbbb commit 8dc11ce

36 files changed

Lines changed: 4732 additions & 173 deletions

AGENTS.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -265,12 +265,11 @@ mkdocs serve
265265

266266
1. Create a feature branch (never commit to `main`)
267267
2. Make changes following code style guidelines
268-
3. Run formatters and linters: `ruff format . && ruff check . --fix`
269-
4. Run tests: `pytest -v`
270-
5. Update documentation if needed
271-
6. Open PR against `main` branch
272-
7. Request review from `cemde`
273-
8. Ensure all CI checks pass
268+
3. Run `just all` before committing. This formats, lints, typechecks, and tests in one step. See the `justfile` for all available recipes.
269+
4. Update documentation if needed
270+
5. Open PR against `main` branch
271+
6. Request review from `cemde`
272+
7. Ensure all CI checks pass
274273

275274
**CI Pipeline:** GitHub Actions runs formatting checks, linting, and test suite across Python versions and OS. All checks must pass before merge.
276275

@@ -301,25 +300,24 @@ Example workflow:
301300
## Common Tasks Quick Reference
302301

303302
```bash
304-
# Fresh environment setup
305-
uv sync --all-extras --all-groups
303+
# Fresh environment setup / Update after pulling changes
304+
just install # uv sync --all-extras --all-groups
306305

307-
# Before committing
308-
uv run ruff format . && uv run ruff check . --fix && uv run pytest -v && uv run ty check
306+
# Before committing (format, lint, typecheck, test)
307+
just all
309308

310309
# Run example
311310
uv run python examples/amazon_collab.py
312311

313-
# Update after pulling changes
314-
uv sync --all-extras --all-groups
315-
316312
# Add optional dependency
317313
uv add --optional <extra-name> <package-name>
318314

319315
# Check specific test file
320316
uv run pytest tests/test_core/test_agent.py -v
321317
```
322318

319+
For more comments see `justfile`.
320+
323321
## Security and Confidentiality
324322

325323
**IMPORTANT:** This project contains confidential research material.
@@ -540,4 +538,4 @@ class Evaluator:
540538
...
541539
```
542540

543-
**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.
541+
**Rule:** Only copy defaults that exist in the source. If the original doesn't provide a default, neither should you. Always document the source file and line number.

CHANGELOG.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1111

1212
**Core**
1313

14+
- Usage and cost tracking via `Usage` and `TokenUsage` data classes. `ModelAdapter` tracks token usage automatically after each `chat()` call. Components that implement `UsageTrackableMixin` are collected via `gather_usage()`. Live totals available during benchmark runs via `benchmark.usage` (grand total) and `benchmark.usage_by_component` (per-component breakdowns). Post-hoc analysis via `UsageReporter.from_reports(benchmark.reports)` with breakdowns by task, component, or model. (PR: #45)
15+
- Pluggable cost calculation via `CostCalculator` protocol. `StaticPricingCalculator` computes cost from user-supplied per-token rates. `LiteLLMCostCalculator` in `maseval.interface.usage` for automatic pricing via LiteLLM's model database (supports `custom_pricing` overrides and `model_id_map`; requires `litellm`). Pass a `cost_calculator` to `ModelAdapter` or `AgentAdapter` to compute `Usage.cost`. Provider-reported cost always takes precedence. (PR: #45)
16+
- `AgentAdapter` now accepts `cost_calculator` and `model_id` parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (`LiteLLMCostCalculator` if litellm is installed). LangGraph requires explicit `model_id` since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
17+
1418
- `Task.freeze()` and `Task.unfreeze()` methods to make task data read-only during benchmark runs, preventing accidental mutation of `environment_data`, `user_data`, `evaluation_data`, and `metadata` (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with `Task.is_frozen`. (PR: #42)
1519
- `TaskFrozenError` exception in `maseval.core.exceptions`, raised when attempting to modify a frozen task. (PR: #42)
1620

@@ -39,10 +43,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
3943

4044
**Examples**
4145

46+
- Added usage tracking to the 5-A-Day benchmark: `five_a_day_benchmark.ipynb` (section 2.7) and `five_a_day_benchmark.py` (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
47+
4248
- MMLU benchmark example at `examples/mmlu_benchmark/` for evaluating HuggingFace models on MMLU with optional DISCO prediction (`--disco_model_path`, `--disco_transform_path`). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34)
4349
- Added a dedicated runnable CONVERSE default benchmark example at `examples/converse_benchmark/default_converse_benchmark.py` for quick start with `DefaultAgentConverseBenchmark`. (PR: #28)
4450
- Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)
4551

52+
**Documentation**
53+
54+
- Usage & Cost Tracking guide (`docs/guides/usage-tracking.md`) and API reference (`docs/reference/usage.md`). (PR: #45)
55+
4656
**Core**
4757

4858
- Added `SeedGenerator` abstract base class and `DefaultSeedGenerator` implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
@@ -108,8 +118,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
108118
- `LangGraphUser``LangGraphLLMUser`
109119
- `LlamaIndexUser``LlamaIndexLLMUser`
110120

111-
**Documentation**
112-
113121
- All benchmarks except MACS are now labeled as **Beta** in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)
114122

115123
**Testing**

docs/guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,4 @@ Guides provide an in-depth exploration of MASEval's features and best practices.
88
| [Configuration Gathering](config-gathering.md) | Collect and export configuration for reproducibility |
99
| [Exception Handling](exception-handling.md) | Distinguish agent errors from infrastructure failures |
1010
| [Seeding](seeding.md) | Enable reproducible benchmark runs with deterministic seeds |
11+
| [Usage & Cost Tracking](usage-tracking.md) | Track token usage and compute cost across providers |

0 commit comments

Comments
 (0)