Skip to content

Commit 1e703eb

Browse files
feat: support Anthropic-compatible endpoints for benchmark LLMs
Add Anthropic as a first-class provider for answer and judge models, validate provider-specific environment variables per role, and update the README to match the omb CLI and runtime-local output workflow.
1 parent 84c3f96 commit 1e703eb

6 files changed

Lines changed: 229 additions & 27 deletions

File tree

README.md

Lines changed: 39 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,88 @@
1-
# AMBAgent Memory Benchmark
1+
# OMBOpen Memory Benchmark
22

3-
We built AMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. AMB is fully open: datasets, prompts, scoring logic, and results.
3+
We built OMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. OMB is fully open: datasets, prompts, scoring logic, and results.
44

55
Live leaderboard: **[agentmemorybenchmark.ai](https://agentmemorybenchmark.ai)**
66

77
## The problem with existing benchmarks
88

99
LoComo and LongMemEval are solid datasets, but they were designed for an era of 32k context windows. State-of-the-art models now have million-token context windows — on most instances, a naive "dump everything into context" approach scores competitively, not because it's a good memory architecture, but because retrieval has become the easy part. The benchmarks can no longer tell them apart.
1010

11-
Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. AMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions.
11+
Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. OMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions.
1212

13-
## What AMB measures
13+
## What OMB measures
1414

15-
A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. AMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it.
15+
A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. OMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it.
1616

17-
The only credible benchmark result is one you can reproduce yourself. AMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them.
17+
The only credible benchmark result is one you can reproduce yourself. OMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them.
1818

1919
## How it works
2020

2121
1. **Ingest** — documents from a dataset are loaded into a memory provider
2222
2. **Retrieve** — for each query the memory provider retrieves relevant context
23-
3. **Generate**a Gemini model produces an answer from the retrieved context
24-
4. **Judge** — a second Gemini call scores the answer against gold answers
23+
3. **Generate**an LLM produces an answer from the retrieved context
24+
4. **Judge** — a second LLM call scores the answer against gold answers
2525

2626
Retrieval time is tracked separately from generation; ingestion time is also recorded.
2727

2828
## Setup
2929

3030
```bash
31-
# Copy and fill in your API key
32-
cp .env.example .env # or just create .env with:
33-
# GEMINI_API_KEY=...
31+
# Example: Anthropic-compatible endpoint for answer/judge LLMs
32+
export ANTHROPIC_BASE_URL=https://your-endpoint.example.com
33+
export ANTHROPIC_API_KEY=your-api-key
34+
export OMB_ANSWER_LLM=anthropic
35+
export OMB_JUDGE_LLM=anthropic
36+
export OMB_ANSWER_MODEL=your-model-name
37+
export OMB_JUDGE_MODEL=your-model-name
3438
```
3539

40+
Only set the provider-specific variables for the providers you actually use:
41+
42+
- `anthropic`: `ANTHROPIC_API_KEY` and optional `ANTHROPIC_BASE_URL`
43+
- `gemini`: `GEMINI_API_KEY` or `GOOGLE_API_KEY`
44+
- `groq`: `GROQ_API_KEY`
45+
- `openai`: `OPENAI_API_KEY`
46+
3647
## Usage
3748

3849
```bash
3950
# List available datasets, memory providers, and modes
40-
uv run amb providers
51+
uv run omb providers
4152

4253
# List domains for a dataset
43-
uv run amb domains --dataset personamem
54+
uv run omb domains --dataset personamem
4455

4556
# Run a benchmark
46-
uv run amb run --dataset personamem --domain 32k --memory bm25
57+
uv run omb run --dataset personamem --domain 32k --memory bm25
4758

4859
# Limit scale for a quick test
49-
uv run amb run --dataset personamem --domain 32k --memory bm25 --query-limit 20
60+
uv run omb run --dataset personamem --domain 32k --memory bm25 --query-limit 20
5061

5162
# Oracle mode: ingest only gold documents (tests generation quality in isolation)
52-
uv run amb run --dataset personamem --domain 32k --memory bm25 --oracle
63+
uv run omb run --dataset personamem --domain 32k --memory bm25 --oracle
5364

5465
# Dataset statistics
55-
uv run amb dataset-stats --dataset personamem
66+
uv run omb dataset-stats --dataset personamem
5667

5768
# Browse results in the browser
58-
uv run amb view
69+
uv run omb view
5970
```
6071

6172
## Results
6273

63-
Results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json` and can be explored with `uv run amb view`.
74+
By default, results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json`.
75+
If you pass `--output-dir`, results are written under that directory instead.
76+
This is how runtime-local wrappers can keep outputs under their own `results/` folders while still using the same benchmark CLI.
77+
78+
Explore results with `uv run omb view`.
6479

6580
## Requirements
6681

6782
- Python ≥ 3.11
68-
- `GEMINI_API_KEY` in `.env` or environment
83+
- API keys for the providers you actually use:
84+
- `ANTHROPIC_API_KEY` for `anthropic`
85+
- `GEMINI_API_KEY` for `gemini`
86+
- `GROQ_API_KEY` for `groq`
87+
- `OPENAI_API_KEY` for `openai`
6988
- For MemBench: set `MEMBENCH_DATA_PATH` to your local data directory

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ version = "0.1.0"
44
description = "Open Memory Benchmark"
55
requires-python = ">=3.11"
66
dependencies = [
7+
"anthropic>=0.84.0",
78
"datasets>=2.0",
89
"typer>=0.12",
910
"rich>=13",

src/memory_bench/cli.py

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,46 @@
2222
console = Console()
2323

2424

25-
def _resolve_gemini_key() -> None:
26-
key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
27-
if not key:
28-
typer.echo("Error: GEMINI_API_KEY environment variable is not set.", err=True)
29-
raise typer.Exit(1)
30-
os.environ["GOOGLE_API_KEY"] = key
25+
def _ensure_provider_env(provider: str, role: str) -> None:
26+
if provider == "anthropic":
27+
if not os.environ.get("ANTHROPIC_API_KEY"):
28+
typer.echo(f"Error: {role} LLM provider '{provider}' requires ANTHROPIC_API_KEY.", err=True)
29+
raise typer.Exit(1)
30+
return
31+
32+
if provider == "gemini":
33+
key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
34+
if not key:
35+
typer.echo(f"Error: {role} LLM provider '{provider}' requires GEMINI_API_KEY.", err=True)
36+
raise typer.Exit(1)
37+
os.environ["GOOGLE_API_KEY"] = key
38+
return
39+
40+
if provider == "groq":
41+
if not os.environ.get("GROQ_API_KEY"):
42+
typer.echo(f"Error: {role} LLM provider '{provider}' requires GROQ_API_KEY.", err=True)
43+
raise typer.Exit(1)
44+
return
45+
46+
if provider == "openai":
47+
if not os.environ.get("OPENAI_API_KEY"):
48+
typer.echo(f"Error: {role} LLM provider '{provider}' requires OPENAI_API_KEY.", err=True)
49+
raise typer.Exit(1)
50+
return
51+
52+
53+
def _validate_run_env(memory: str) -> None:
54+
answer_provider = os.environ.get("OMB_ANSWER_LLM", "groq")
55+
judge_provider = os.environ.get("OMB_JUDGE_LLM", "gemini")
56+
_ensure_provider_env(answer_provider, "Answer")
57+
_ensure_provider_env(judge_provider, "Judge")
58+
59+
if memory == "hindsight":
60+
key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
61+
if not key:
62+
typer.echo("Error: memory provider 'hindsight' requires GEMINI_API_KEY for embedded extraction.", err=True)
63+
raise typer.Exit(1)
64+
os.environ["GOOGLE_API_KEY"] = key
3165

3266

3367
@app.command()
@@ -53,7 +87,7 @@ def run(
5387
description: str = typer.Option(None, "--description", "-d", help="Optional description for this run (stored in the result JSON)"),
5488
) -> None:
5589
"""Run an evaluation on a single split (optionally filtered to a category)."""
56-
_resolve_gemini_key()
90+
_validate_run_env(memory)
5791

5892
ds = get_dataset(dataset)
5993

src/memory_bench/llm/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
import os
22

3+
from .anthropic import AnthropicLLM
34
from .base import LLM, Schema
45
from .gemini import GeminiLLM
56
from .groq import GroqLLM
67
from .openai import OpenAILLM
78

89
REGISTRY: dict[str, type[LLM]] = {
10+
"anthropic": AnthropicLLM,
911
"gemini": GeminiLLM,
1012
"groq": GroqLLM,
1113
"openai": OpenAILLM,

src/memory_bench/llm/anthropic.py

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
import json
2+
import os
3+
import re
4+
import time
5+
6+
from .base import LLM, Schema
7+
8+
_MAX_RETRIES = 6
9+
_RETRY_BASE_DELAY = 5
10+
11+
12+
def _parse_json_payload(text: str) -> dict:
13+
text = text.strip()
14+
15+
try:
16+
return json.loads(text)
17+
except json.JSONDecodeError:
18+
pass
19+
20+
fenced = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, flags=re.DOTALL | re.IGNORECASE)
21+
if fenced:
22+
return json.loads(fenced.group(1))
23+
24+
start = text.find("{")
25+
end = text.rfind("}")
26+
if start != -1 and end != -1 and end > start:
27+
return json.loads(text[start : end + 1])
28+
29+
raise json.JSONDecodeError("Could not find JSON object in model response", text, 0)
30+
31+
32+
def _coerce_text_payload(text: str, schema: Schema) -> dict | None:
33+
text = text.strip()
34+
if not text:
35+
return None
36+
37+
result: dict = {}
38+
for field in schema.required:
39+
spec = schema.properties.get(field, {})
40+
field_type = spec.get("type", "string")
41+
lowered = text.lower()
42+
43+
if field_type == "string":
44+
result[field] = text
45+
continue
46+
47+
if field_type == "boolean":
48+
if re.search(r"\btrue\b", lowered):
49+
result[field] = True
50+
continue
51+
if re.search(r"\bfalse\b", lowered):
52+
result[field] = False
53+
continue
54+
if re.search(r"\b(correct|yes)\b", lowered) and not re.search(r"\b(incorrect|wrong|no)\b", lowered):
55+
result[field] = True
56+
continue
57+
if re.search(r"\b(incorrect|wrong|no)\b", lowered):
58+
result[field] = False
59+
continue
60+
return None
61+
62+
return None
63+
64+
return result
65+
66+
67+
class AnthropicLLM(LLM):
68+
def __init__(self, model: str | None = None):
69+
from anthropic import Anthropic
70+
71+
api_key = os.environ.get("ANTHROPIC_API_KEY")
72+
if not api_key:
73+
raise RuntimeError("Anthropic provider requires ANTHROPIC_API_KEY")
74+
75+
base_url = os.environ.get("ANTHROPIC_BASE_URL")
76+
self._client = Anthropic(
77+
api_key=api_key,
78+
base_url=base_url or None,
79+
max_retries=0,
80+
)
81+
self._model = (
82+
model
83+
or os.environ.get("ANTHROPIC_MODEL")
84+
or "claude-sonnet-4-5"
85+
)
86+
87+
@property
88+
def model_id(self) -> str:
89+
return f"anthropic:{self._model}"
90+
91+
def generate(self, prompt: str, schema: Schema) -> dict:
92+
from anthropic import APIConnectionError, APIStatusError, RateLimitError
93+
94+
schema_json = {
95+
"type": "object",
96+
"properties": schema.properties,
97+
"required": schema.required,
98+
"additionalProperties": False,
99+
}
100+
system_prompt = (
101+
"Return only a valid JSON object matching this schema. "
102+
"Do not wrap JSON in markdown fences.\n\n"
103+
f"{json.dumps(schema_json, ensure_ascii=False)}"
104+
)
105+
106+
delay = _RETRY_BASE_DELAY
107+
last_exc = None
108+
109+
for attempt in range(_MAX_RETRIES):
110+
try:
111+
response = self._client.messages.create(
112+
model=self._model,
113+
max_tokens=4096,
114+
temperature=0.0,
115+
system=system_prompt,
116+
messages=[{"role": "user", "content": prompt}],
117+
)
118+
text = "".join(block.text for block in response.content if getattr(block, "type", None) == "text")
119+
try:
120+
return _parse_json_payload(text)
121+
except json.JSONDecodeError:
122+
coerced = _coerce_text_payload(text, schema)
123+
if coerced is not None:
124+
return coerced
125+
raise
126+
except (RateLimitError, APIConnectionError) as e:
127+
last_exc = e
128+
except APIStatusError as e:
129+
last_exc = e
130+
if e.status_code not in (429, 500, 502, 503, 504):
131+
raise
132+
except Exception as e:
133+
last_exc = e
134+
msg = str(e)
135+
if "429" not in msg and "rate" not in msg.lower():
136+
raise
137+
138+
if attempt < _MAX_RETRIES - 1:
139+
time.sleep(delay)
140+
delay *= 2
141+
142+
raise RuntimeError(f"Anthropic request failed after {_MAX_RETRIES} retries: {last_exc}")

uv.lock

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)