|
1 | | -# AMB — Agent Memory Benchmark |
| 1 | +# OMB — Open Memory Benchmark |
2 | 2 |
|
3 | | -We built AMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. AMB is fully open: datasets, prompts, scoring logic, and results. |
| 3 | +We built OMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. OMB is fully open: datasets, prompts, scoring logic, and results. |
4 | 4 |
|
5 | 5 | Live leaderboard: **[agentmemorybenchmark.ai](https://agentmemorybenchmark.ai)** |
6 | 6 |
|
7 | 7 | ## The problem with existing benchmarks |
8 | 8 |
|
9 | 9 | LoComo and LongMemEval are solid datasets, but they were designed for an era of 32k context windows. State-of-the-art models now have million-token context windows — on most instances, a naive "dump everything into context" approach scores competitively, not because it's a good memory architecture, but because retrieval has become the easy part. The benchmarks can no longer tell them apart. |
10 | 10 |
|
11 | | -Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. AMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions. |
| 11 | +Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. OMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions. |
12 | 12 |
|
13 | | -## What AMB measures |
| 13 | +## What OMB measures |
14 | 14 |
|
15 | | -A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. AMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it. |
| 15 | +A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. OMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it. |
16 | 16 |
|
17 | | -The only credible benchmark result is one you can reproduce yourself. AMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them. |
| 17 | +The only credible benchmark result is one you can reproduce yourself. OMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them. |
18 | 18 |
|
19 | 19 | ## How it works |
20 | 20 |
|
21 | 21 | 1. **Ingest** — documents from a dataset are loaded into a memory provider |
22 | 22 | 2. **Retrieve** — for each query the memory provider retrieves relevant context |
23 | | -3. **Generate** — a Gemini model produces an answer from the retrieved context |
24 | | -4. **Judge** — a second Gemini call scores the answer against gold answers |
| 23 | +3. **Generate** — an LLM produces an answer from the retrieved context |
| 24 | +4. **Judge** — a second LLM call scores the answer against gold answers |
25 | 25 |
|
26 | 26 | Retrieval time is tracked separately from generation; ingestion time is also recorded. |
27 | 27 |
|
28 | 28 | ## Setup |
29 | 29 |
|
30 | 30 | ```bash |
31 | | -# Copy and fill in your API key |
32 | | -cp .env.example .env # or just create .env with: |
33 | | -# GEMINI_API_KEY=... |
| 31 | +# Example: Anthropic-compatible endpoint for answer/judge LLMs |
| 32 | +export ANTHROPIC_BASE_URL=https://your-endpoint.example.com |
| 33 | +export ANTHROPIC_API_KEY=your-api-key |
| 34 | +export OMB_ANSWER_LLM=anthropic |
| 35 | +export OMB_JUDGE_LLM=anthropic |
| 36 | +export OMB_ANSWER_MODEL=your-model-name |
| 37 | +export OMB_JUDGE_MODEL=your-model-name |
34 | 38 | ``` |
35 | 39 |
|
| 40 | +Only set the provider-specific variables for the providers you actually use: |
| 41 | + |
| 42 | +- `anthropic`: `ANTHROPIC_API_KEY` and optional `ANTHROPIC_BASE_URL` |
| 43 | +- `gemini`: `GEMINI_API_KEY` or `GOOGLE_API_KEY` |
| 44 | +- `groq`: `GROQ_API_KEY` |
| 45 | +- `openai`: `OPENAI_API_KEY` |
| 46 | + |
36 | 47 | ## Usage |
37 | 48 |
|
38 | 49 | ```bash |
39 | 50 | # List available datasets, memory providers, and modes |
40 | | -uv run amb providers |
| 51 | +uv run omb providers |
41 | 52 |
|
42 | 53 | # List domains for a dataset |
43 | | -uv run amb domains --dataset personamem |
| 54 | +uv run omb domains --dataset personamem |
44 | 55 |
|
45 | 56 | # Run a benchmark |
46 | | -uv run amb run --dataset personamem --domain 32k --memory bm25 |
| 57 | +uv run omb run --dataset personamem --domain 32k --memory bm25 |
47 | 58 |
|
48 | 59 | # Limit scale for a quick test |
49 | | -uv run amb run --dataset personamem --domain 32k --memory bm25 --query-limit 20 |
| 60 | +uv run omb run --dataset personamem --domain 32k --memory bm25 --query-limit 20 |
50 | 61 |
|
51 | 62 | # Oracle mode: ingest only gold documents (tests generation quality in isolation) |
52 | | -uv run amb run --dataset personamem --domain 32k --memory bm25 --oracle |
| 63 | +uv run omb run --dataset personamem --domain 32k --memory bm25 --oracle |
53 | 64 |
|
54 | 65 | # Dataset statistics |
55 | | -uv run amb dataset-stats --dataset personamem |
| 66 | +uv run omb dataset-stats --dataset personamem |
56 | 67 |
|
57 | 68 | # Browse results in the browser |
58 | | -uv run amb view |
| 69 | +uv run omb view |
59 | 70 | ``` |
60 | 71 |
|
61 | 72 | ## Results |
62 | 73 |
|
63 | | -Results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json` and can be explored with `uv run amb view`. |
| 74 | +By default, results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json`. |
| 75 | +If you pass `--output-dir`, results are written under that directory instead. |
| 76 | +This is how runtime-local wrappers can keep outputs under their own `results/` folders while still using the same benchmark CLI. |
| 77 | + |
| 78 | +Explore results with `uv run omb view`. |
64 | 79 |
|
65 | 80 | ## Requirements |
66 | 81 |
|
67 | 82 | - Python ≥ 3.11 |
68 | | -- `GEMINI_API_KEY` in `.env` or environment |
| 83 | +- API keys for the providers you actually use: |
| 84 | +- `ANTHROPIC_API_KEY` for `anthropic` |
| 85 | +- `GEMINI_API_KEY` for `gemini` |
| 86 | +- `GROQ_API_KEY` for `groq` |
| 87 | +- `OPENAI_API_KEY` for `openai` |
69 | 88 | - For MemBench: set `MEMBENCH_DATA_PATH` to your local data directory |
0 commit comments