Skip to content

Commit d4ce113

Browse files
amabitoamabito
andauthored
feat(evaluators): add built-in budget evaluator for per-agent cost tracking (#144)
## Summary - New built-in evaluator "budget" that tracks cumulative token/cost usage per agent, channel, user. Configurable time windows (daily/weekly/monthly). - Addresses #130. ## Scope - User-facing/API changes: - "budget" evaluator registered alongside regex/list/json/sql - BudgetStore protocol + InMemoryBudgetStore (dict + threading.Lock) - BudgetEvaluatorConfig: limits list, optional pricing table, path configs - Internal changes: - evaluators/builtin/src/agent_control_evaluators/budget/ -- 4 files, ~650 LOC - evaluators/builtin/tests/budget/test_budget.py -- 63 tests - Out of scope: - No Postgres store, no DB tables, no new dependencies ## Risk and Rollout - Risk level: low -- new evaluator, no changes to existing code. 230 existing tests untouched. - Rollback plan: revert PR ## Testing - [x] Added or updated automated tests (63 tests incl. thread safety, NaN/Inf, scope injection, double-count) - [x] Ran pytest tests/ -- 293 passed - [x] Manually verified behavior ## Checklist - [x] Linked issue/spec -- #130 - [ ] Updated docs/examples -- follow-up: config example in docs/evaluators/ - [x] Included follow-up tasks -- Postgres BudgetStore (separate package) --------- Co-authored-by: amabito <amabito@local>
1 parent 03f402e commit d4ce113

11 files changed

Lines changed: 3133 additions & 0 deletions

File tree

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# agent-control-evaluator-budget
2+
3+
Budget evaluator for agent-control that tracks cumulative LLM token and cost usage per scope and time window.
4+
5+
## Install
6+
7+
```bash
8+
pip install agent-control-evaluator-budget
9+
```
10+
11+
## Quickstart
12+
13+
```python
14+
from agent_control_evaluator_budget.budget import (
15+
BudgetEvaluatorConfig,
16+
BudgetLimitRule,
17+
ModelPricing,
18+
)
19+
20+
config = BudgetEvaluatorConfig(
21+
budget_id="support-daily",
22+
limits=[
23+
BudgetLimitRule(
24+
scope={"agent": "support"},
25+
group_by="user_id",
26+
window_seconds=86_400,
27+
limit=500,
28+
limit_unit="usd_cents",
29+
),
30+
BudgetLimitRule(
31+
scope={"agent": "support"},
32+
group_by="user_id",
33+
window_seconds=86_400,
34+
limit=50_000,
35+
limit_unit="tokens",
36+
),
37+
],
38+
pricing={
39+
"gpt-4.1-mini": ModelPricing(input_per_1k=0.04, output_per_1k=0.16),
40+
},
41+
model_path="model",
42+
metadata_paths={
43+
"agent": "metadata.agent",
44+
"user_id": "metadata.user_id",
45+
},
46+
unknown_model_behavior="block",
47+
)
48+
```
49+
50+
The evaluator reads token usage from standard fields such as `usage.input_tokens` and `usage.output_tokens`. Configure `token_path` only when your event shape uses a custom location.
51+
52+
## Scope and group_by
53+
54+
Each `BudgetLimitRule` has a static `scope` and an optional `group_by` field.
55+
56+
`scope` filters which events a rule applies to. A rule with `scope={"agent": "support"}` only applies when extracted metadata contains `agent="support"`. An empty scope is global.
57+
58+
`group_by` creates independent buckets per extracted metadata value. The common per-user pattern is:
59+
60+
```python
61+
BudgetLimitRule(
62+
scope={"agent": "support"},
63+
group_by="user_id",
64+
window_seconds=86_400,
65+
limit=500,
66+
limit_unit="usd_cents",
67+
)
68+
```
69+
70+
With `metadata_paths={"user_id": "metadata.user_id"}`, each user gets a separate daily budget inside the support scope.
71+
72+
## Budget pools
73+
74+
`budget_id` identifies the accumulated budget pool.
75+
76+
Evaluators with the same `budget_id` share accumulated spend and token totals across all evaluator instances. Each evaluator still evaluates using its own configured rules -- the shared state is the bucket (the rolling sum), not the rule set. Evaluators with different `budget_id` values are fully isolated.
77+
78+
Use stable names such as `support-daily`, `billing-global`, or `tenant-acme-monthly`. Avoid generating a new `budget_id` per request unless each request should have an isolated budget.
79+
80+
## Pricing
81+
82+
`ModelPricing` stores cost rates in cents per 1K tokens:
83+
84+
```python
85+
ModelPricing(input_per_1k=0.04, output_per_1k=0.16)
86+
```
87+
88+
`input_per_1k` is applied to input tokens. `output_per_1k` is applied to output tokens.
89+
90+
Pricing is required when any rule uses `limit_unit="usd_cents"`. Token-only rules can omit pricing. If an event uses a model that is not in the pricing table and a cost rule exists, `unknown_model_behavior="block"` fails closed. Use `"warn"` to log a warning and treat the cost as 0.
91+
92+
## Dual Ceiling Pattern
93+
94+
Use two evaluators when cost and token ceilings need independent control records or different `budget_id` pools:
95+
96+
```python
97+
cost_config = BudgetEvaluatorConfig(
98+
budget_id="support-cost-daily",
99+
limits=[
100+
BudgetLimitRule(
101+
scope={"agent": "support"},
102+
group_by="user_id",
103+
window_seconds=86_400,
104+
limit=500,
105+
limit_unit="usd_cents",
106+
)
107+
],
108+
pricing={
109+
"gpt-4.1-mini": ModelPricing(input_per_1k=0.04, output_per_1k=0.16),
110+
},
111+
model_path="model",
112+
metadata_paths={"agent": "metadata.agent", "user_id": "metadata.user_id"},
113+
)
114+
115+
token_config = BudgetEvaluatorConfig(
116+
budget_id="support-token-daily",
117+
limits=[
118+
BudgetLimitRule(
119+
scope={"agent": "support"},
120+
group_by="user_id",
121+
window_seconds=86_400,
122+
limit=50_000,
123+
limit_unit="tokens",
124+
)
125+
],
126+
metadata_paths={"agent": "metadata.agent", "user_id": "metadata.user_id"},
127+
)
128+
```
129+
130+
This pattern lets cost and token budgets reset, alert, and roll out independently. A single evaluator can also contain both rules when one shared pool and one control result are sufficient.
131+
132+
## Limitations
133+
134+
`InMemoryBudgetStore` is single-process only. State is lost on restart and is not shared across workers or pods.
135+
136+
Use a distributed store for production deployments that run multiple processes, multiple workers, or multiple pods.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
[project]
2+
name = "agent-control-evaluator-budget"
3+
version = "0.1.0"
4+
description = "Budget evaluator for agent-control -- cumulative LLM cost and token tracking"
5+
readme = "README.md"
6+
requires-python = ">=3.12"
7+
license = { text = "Apache-2.0" }
8+
authors = [{ name = "Agent Control Team" }]
9+
dependencies = [
10+
"agent-control-evaluators>=3.0.0",
11+
"agent-control-models>=3.0.0",
12+
]
13+
14+
[project.optional-dependencies]
15+
dev = [
16+
"pytest>=8.0.0",
17+
"pytest-asyncio>=0.23.0",
18+
"ruff>=0.1.0",
19+
"mypy>=1.8.0",
20+
]
21+
22+
[project.entry-points."agent_control.evaluators"]
23+
budget = "agent_control_evaluator_budget.budget:BudgetEvaluator"
24+
25+
[build-system]
26+
requires = ["hatchling"]
27+
build-backend = "hatchling.build"
28+
29+
[tool.hatch.build.targets.wheel]
30+
packages = ["src/agent_control_evaluator_budget"]
31+
32+
[tool.ruff]
33+
line-length = 100
34+
target-version = "py312"
35+
36+
[tool.ruff.lint]
37+
select = ["E", "F", "I"]
38+
39+
[tool.uv.sources]
40+
agent-control-evaluators = { path = "../../builtin", editable = true }
41+
agent-control-models = { path = "../../../models", editable = true }
42+
43+
[dependency-groups]
44+
dev = [
45+
"pytest>=9.0.2",
46+
"pytest-asyncio>=1.3.0",
47+
]

evaluators/contrib/budget/src/agent_control_evaluator_budget/__init__.py

Whitespace-only changes.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
"""Budget evaluator for per-agent LLM cost and token tracking."""
2+
3+
from agent_control_evaluator_budget.budget.config import (
4+
BudgetEvaluatorConfig,
5+
BudgetLimitRule,
6+
ModelPricing,
7+
)
8+
from agent_control_evaluator_budget.budget.evaluator import BudgetEvaluator
9+
from agent_control_evaluator_budget.budget.memory_store import InMemoryBudgetStore
10+
from agent_control_evaluator_budget.budget.store import BudgetSnapshot, BudgetStore
11+
12+
# Note: clear_budget_stores is a testing utility and is intentionally not
13+
# re-exported here. Import it directly from the evaluator submodule in tests:
14+
# from agent_control_evaluator_budget.budget.evaluator import clear_budget_stores
15+
16+
__all__ = [
17+
"BudgetEvaluator",
18+
"BudgetEvaluatorConfig",
19+
"BudgetLimitRule",
20+
"BudgetSnapshot",
21+
"BudgetStore",
22+
"InMemoryBudgetStore",
23+
"ModelPricing",
24+
]
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
"""Configuration for the budget evaluator."""
2+
3+
from __future__ import annotations
4+
5+
from typing import Literal
6+
7+
from agent_control_evaluators._base import EvaluatorConfig
8+
from pydantic import Field, field_validator, model_validator
9+
10+
# ---------------------------------------------------------------------------
11+
# Window convenience constants (seconds)
12+
# ---------------------------------------------------------------------------
13+
14+
WINDOW_HOURLY = 3600
15+
WINDOW_DAILY = 86400
16+
WINDOW_WEEKLY = 604800
17+
WINDOW_MONTHLY = 2592000 # 30 days
18+
19+
20+
class ModelPricing(EvaluatorConfig):
21+
"""Per-model token pricing in cents per 1K tokens."""
22+
23+
input_per_1k: float = 0.0
24+
output_per_1k: float = 0.0
25+
26+
27+
class BudgetLimitRule(EvaluatorConfig):
28+
"""A single budget limit rule.
29+
30+
Each rule defines a ceiling for a combination of scope dimensions
31+
and time window. Multiple rules can apply to the same step -- the
32+
evaluator checks all of them and triggers on the first breach.
33+
34+
Attributes:
35+
scope: Static scope dimensions that must match for this rule
36+
to apply. Empty dict = global rule.
37+
Examples:
38+
{"agent": "summarizer"} -- per-agent limit
39+
{"agent": "summarizer", "channel": "slack"} -- agent+channel limit
40+
group_by: If set, the limit is applied independently for each
41+
unique value of this dimension. e.g. group_by="user_id" means
42+
each user gets their own budget. None = shared/global limit.
43+
window_seconds: Time window for accumulation in seconds.
44+
None = cumulative (no reset). See WINDOW_* constants.
45+
limit: Maximum usage in the window. Interpreted by limit_unit.
46+
limit_unit: Unit for limit. usd_cents checks spend; tokens checks
47+
input + output tokens.
48+
"""
49+
50+
scope: dict[str, str] = Field(default_factory=dict)
51+
group_by: str | None = None
52+
window_seconds: int | None = None
53+
limit: int
54+
limit_unit: Literal["usd_cents", "tokens"] = "usd_cents"
55+
56+
@field_validator("limit")
57+
@classmethod
58+
def validate_limit(cls, v: int) -> int:
59+
if v <= 0:
60+
raise ValueError("limit must be a positive integer")
61+
return v
62+
63+
@field_validator("window_seconds")
64+
@classmethod
65+
def validate_window_seconds(cls, v: int | None) -> int | None:
66+
if v is not None and v <= 0:
67+
raise ValueError("window_seconds must be positive")
68+
return v
69+
70+
71+
class BudgetEvaluatorConfig(EvaluatorConfig):
72+
"""Configuration for the budget evaluator.
73+
74+
Attributes:
75+
limits: List of budget limit rules. Each is checked independently.
76+
budget_id: Unique budget pool identifier. Same budget_id shares
77+
accumulated spend. Different budget_id is fully isolated.
78+
unknown_model_behavior: What to do when a model is not found in the
79+
pricing table and a cost-based rule exists. block=fail closed,
80+
warn=log warning and treat cost as 0.
81+
pricing: Optional model pricing table. Maps model name to ModelPricing.
82+
Used to derive cost in USD from token counts and model name.
83+
token_path: Dot-notation path to extract token usage from step
84+
data (e.g. "usage.total_tokens"). If None, looks for standard
85+
fields (input_tokens, output_tokens, total_tokens, usage).
86+
model_path: Dot-notation path to extract model name (for pricing lookup).
87+
metadata_paths: Mapping of metadata field name to dot-notation path
88+
in step data. Used to extract scope dimensions (channel, user_id, etc).
89+
"""
90+
91+
limits: list[BudgetLimitRule] = Field(min_length=1)
92+
budget_id: str = Field(
93+
default="default",
94+
description=(
95+
"Unique budget pool identifier. Same budget_id shares accumulated spend. "
96+
"Different budget_id is fully isolated."
97+
),
98+
)
99+
unknown_model_behavior: Literal["block", "warn"] = Field(
100+
default="block",
101+
description=(
102+
"What to do when a model is not found in the pricing table and a cost-based "
103+
"rule exists. block=fail closed, warn=log warning and treat cost as 0."
104+
),
105+
)
106+
pricing: dict[str, ModelPricing] | None = None
107+
token_path: str | None = None
108+
model_path: str | None = None
109+
metadata_paths: dict[str, str] = Field(default_factory=dict)
110+
111+
@model_validator(mode="after")
112+
def require_pricing_for_cost_rules(self) -> "BudgetEvaluatorConfig":
113+
has_cost_rule = any(rule.limit_unit == "usd_cents" for rule in self.limits)
114+
if has_cost_rule and self.pricing is None:
115+
raise ValueError('pricing is required when any rule uses limit_unit="usd_cents"')
116+
if has_cost_rule and not (self.model_path or "").strip():
117+
raise ValueError('model_path is required when any rule uses limit_unit="usd_cents"')
118+
return self

0 commit comments

Comments
 (0)