Skip to content

Commit 62f7c77

Browse files
committed
blog: add benchmarks post — 'We Benchmark in the Open'
1 parent cc96e0c commit 62f7c77

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed

content/9.blog/4.benchmarks.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
---
2+
title: "We Benchmark in the Open"
3+
description: "How Basic Memory performs on academic retrieval benchmarks — and why we publish everything, including the parts that aren't flattering."
4+
---
5+
6+
Most AI memory products don't publish benchmarks. The ones that do tend to cherry-pick metrics, use proprietary evaluation setups, or — in at least one case — [get caught inflating their numbers](https://github.com/getzep/zep-papers/issues/5).
7+
8+
We decided to do it differently. We built a standalone, reproducible benchmark suite, ran it against an academic dataset, and published everything: the code, the results, the methodology, and the gaps.
9+
10+
Here's what we found.
11+
12+
---
13+
14+
## The Dataset: LoCoMo
15+
16+
[LoCoMo](https://snap-research.github.io/locomo/) is an academic benchmark from Snap Research designed to test long-conversation memory systems. It's 10 multi-session conversations with 1,982 questions across five categories:
17+
18+
- **Single-hop** — straightforward fact recall ("Where does Alice work?")
19+
- **Multi-hop** — connecting facts across conversations ("Who works at the same company as the person who likes hiking?")
20+
- **Temporal** — time-sensitive reasoning ("What was Bob's job before he switched in March?")
21+
- **Open-domain** — broad knowledge retrieval
22+
- **Adversarial** — questions designed to trip up memory systems
23+
24+
It's the same benchmark used by Mem0, Zep, MemMachine, and others. Common ground for comparison.
25+
26+
## Our Results
27+
28+
Basic Memory v0.18.5, running entirely local on SQLite with local embeddings. No cloud APIs. No external services.
29+
30+
| Metric | Score |
31+
|--------|-------|
32+
| **Recall@5** | **76.4%** |
33+
| **Recall@10** | **85.5%** |
34+
| **MRR** | **0.658** |
35+
| **Mean Latency** | **1,063ms** |
36+
37+
By category (Recall@5):
38+
39+
| Category | Score |
40+
|----------|-------|
41+
| Open-domain | 86.6% |
42+
| Multi-hop | 84.1% |
43+
| Adversarial | 67.0% |
44+
| Temporal | 59.1% |
45+
| Single-hop | 57.7% |
46+
47+
We're strong on open-domain and multi-hop retrieval. The knowledge graph structure helps here — connecting facts across conversations is literally what a graph does. We're weaker on single-hop and temporal queries, and we know why.
48+
49+
## What's Working
50+
51+
**Multi-hop retrieval (84.1%)** is our standout. Questions that require connecting information across multiple conversations play to our architecture's strength. When you write a note about Alice's job and another about Alice's hiking trip, Basic Memory's relation graph links them. A query about "Alice's hobbies and career" traverses that graph.
52+
53+
**Open-domain (86.6%)** is strong because hybrid search — combining keyword matching with semantic similarity — handles broad queries well. The keyword side catches exact terms; the vector side catches related concepts.
54+
55+
**Fully local execution** matters more than it sounds. Every query runs against a local SQLite database with local embeddings. No network calls, no API keys, no per-query costs. The 1,063ms mean latency is end-to-end on consumer hardware.
56+
57+
## What's Not Working (Yet)
58+
59+
**Single-hop (57.7%)** is our biggest gap. This is basic fact recall — the kind of thing that should be easy. The issue is architectural: we store full conversations and rely on chunk matching, while competitors like Mem0 extract atomic facts before storing. Their approach creates precise, granular memory units. Ours preserves full context but makes pinpoint retrieval harder.
60+
61+
**Temporal (59.1%)** suffers because we don't yet distinguish between "when a note was created" and "when the event it describes happened." If you write today about a meeting that happened last Tuesday, Basic Memory timestamps today's date. Temporal queries need the event date. This is a solvable problem — we just haven't solved it yet.
62+
63+
## The Comparison Problem
64+
65+
Here's where we need to be honest about something the industry mostly isn't.
66+
67+
There are two ways to evaluate memory systems:
68+
69+
1. **Retrieval metrics** — Did the system find the right document? (Recall@K, MRR)
70+
2. **LLM-as-Judge** — Given what was retrieved, did an LLM produce the correct answer? (Binary score from GPT-4o)
71+
72+
We currently measure (1). Most published competitor numbers use (2). **These are not directly comparable.**
73+
74+
A system with perfect retrieval but bad prompting scores high on (1) and low on (2). A system with mediocre retrieval but excellent answer generation could score higher on (2) than a system with better retrieval. They're measuring different things.
75+
76+
Published LLM-as-Judge scores from competitors:
77+
78+
| System | Overall Score |
79+
|--------|--------------|
80+
| MemMachine | 84.9% |
81+
| Mem0 (graph) | 68.5% |
82+
| Mem0 | 66.9% |
83+
| Zep (corrected) | 58.4% |
84+
| LangMem | 58.1% |
85+
| OpenAI Memory | 52.9% |
86+
87+
We can't put our number in that table yet because we haven't run the LLM-as-Judge step. We're adding it. When we do, the results go in the same public repo alongside everything else.
88+
89+
We could have estimated where we'd land, or presented our retrieval metrics alongside their answer metrics and hoped nobody noticed the difference. We chose not to.
90+
91+
## The Zep Incident
92+
93+
Speaking of benchmark honesty: Zep originally published an 84% LoCoMo score. Mem0's CTO found that Zep had included adversarial category answers in the numerator while excluding adversarial questions from the denominator — inflating their number significantly. The corrected score is 58.4%.
94+
95+
This is why we publish the code. Every query. Every result. The exact commands to reproduce the run. If our methodology is wrong, you can find it and tell us.
96+
97+
## What We're Improving
98+
99+
Based on these results, we're working on:
100+
101+
- **Better observation extraction** — more atomic facts per conversation, closing the single-hop gap
102+
- **Temporal indexing** — separating document dates from event dates
103+
- **LLM-as-Judge evaluation** — so we can compare apples-to-apples with published numbers
104+
- **Cloud benchmarks with OpenAI embeddings** — local embeddings are good; cloud embeddings should be better
105+
106+
Each improvement gets benchmarked before and after, on the same dataset, with the same methodology. No "trust us, it's better now." Numbers or it didn't happen.
107+
108+
## Reproduce It Yourself
109+
110+
The entire benchmark suite is open source:
111+
112+
```bash
113+
git clone https://github.com/basicmachines-co/basic-memory-benchmarks
114+
cd basic-memory-benchmarks
115+
uv sync --group dev
116+
uv run bm-bench datasets fetch --dataset locomo
117+
uv run bm-bench convert locomo
118+
uv run bm-bench run retrieval --providers bm-local
119+
```
120+
121+
Every run produces a manifest with provenance metadata: git SHA, BM version, dataset checksums, provider configurations. Full reproducibility, not "we ran it once on a good day."
122+
123+
[Benchmark repo →](https://github.com/basicmachines-co/basic-memory-benchmarks)
124+
125+
---
126+
127+
## Why This Matters
128+
129+
Benchmarks are how you know if a product actually works, or if it just has good marketing. The AI memory space is full of big claims and few receipts.
130+
131+
We'd rather show you a 76.4% that you can verify than an 85% that you can't. And when we improve that number — and we will — you'll be able to see exactly what changed and why.
132+
133+
That's the same philosophy behind the product itself. Plain text you can read. Results you can reproduce. No black boxes.
134+
135+
---
136+
137+
*Basic Memory is local-first AI knowledge infrastructure. Benchmarked, open source, plain text. [Get started →](https://basicmemory.com)*

0 commit comments

Comments
 (0)