blog: add benchmarks post — 'We Benchmark in the Open'

bm-clawd · bm-clawd · commit 62f7c779317c · 2026-03-03T19:14:02.000-06:00
diff --git a/content/9.blog/4.benchmarks.md b/content/9.blog/4.benchmarks.md
@@ -0,0 +1,137 @@
+---
+title: "We Benchmark in the Open"
+description: "How Basic Memory performs on academic retrieval benchmarks — and why we publish everything, including the parts that aren't flattering."
+---
+
+Most AI memory products don't publish benchmarks. The ones that do tend to cherry-pick metrics, use proprietary evaluation setups, or — in at least one case — [get caught inflating their numbers](https://github.com/getzep/zep-papers/issues/5).
+
+We decided to do it differently. We built a standalone, reproducible benchmark suite, ran it against an academic dataset, and published everything: the code, the results, the methodology, and the gaps.
+
+Here's what we found.
+
+---
+
+## The Dataset: LoCoMo
+
+[LoCoMo](https://snap-research.github.io/locomo/) is an academic benchmark from Snap Research designed to test long-conversation memory systems. It's 10 multi-session conversations with 1,982 questions across five categories:
+
+- **Single-hop** — straightforward fact recall ("Where does Alice work?")
+- **Multi-hop** — connecting facts across conversations ("Who works at the same company as the person who likes hiking?")
+- **Temporal** — time-sensitive reasoning ("What was Bob's job before he switched in March?")
+- **Open-domain** — broad knowledge retrieval
+- **Adversarial** — questions designed to trip up memory systems
+
+It's the same benchmark used by Mem0, Zep, MemMachine, and others. Common ground for comparison.
+
+## Our Results
+
+Basic Memory v0.18.5, running entirely local on SQLite with local embeddings. No cloud APIs. No external services.
+
+| Metric | Score |
+|--------|-------|
+| **Recall@5** | **76.4%** |
+| **Recall@10** | **85.5%** |
+| **MRR** | **0.658** |
+| **Mean Latency** | **1,063ms** |
+
+By category (Recall@5):
+
+| Category | Score |
+|----------|-------|
+| Open-domain | 86.6% |
+| Multi-hop | 84.1% |
+| Adversarial | 67.0% |
+| Temporal | 59.1% |
+| Single-hop | 57.7% |
+
+We're strong on open-domain and multi-hop retrieval. The knowledge graph structure helps here — connecting facts across conversations is literally what a graph does. We're weaker on single-hop and temporal queries, and we know why.
+
+## What's Working
+
+**Multi-hop retrieval (84.1%)** is our standout. Questions that require connecting information across multiple conversations play to our architecture's strength. When you write a note about Alice's job and another about Alice's hiking trip, Basic Memory's relation graph links them. A query about "Alice's hobbies and career" traverses that graph.
+
+**Open-domain (86.6%)** is strong because hybrid search — combining keyword matching with semantic similarity — handles broad queries well. The keyword side catches exact terms; the vector side catches related concepts.
+
+**Fully local execution** matters more than it sounds. Every query runs against a local SQLite database with local embeddings. No network calls, no API keys, no per-query costs. The 1,063ms mean latency is end-to-end on consumer hardware.
+
+## What's Not Working (Yet)
+
+**Single-hop (57.7%)** is our biggest gap. This is basic fact recall — the kind of thing that should be easy. The issue is architectural: we store full conversations and rely on chunk matching, while competitors like Mem0 extract atomic facts before storing. Their approach creates precise, granular memory units. Ours preserves full context but makes pinpoint retrieval harder.
+
+**Temporal (59.1%)** suffers because we don't yet distinguish between "when a note was created" and "when the event it describes happened." If you write today about a meeting that happened last Tuesday, Basic Memory timestamps today's date. Temporal queries need the event date. This is a solvable problem — we just haven't solved it yet.
+
+## The Comparison Problem
+
+Here's where we need to be honest about something the industry mostly isn't.
+
+There are two ways to evaluate memory systems:
+
+1. **Retrieval metrics** — Did the system find the right document? (Recall@K, MRR)
+2. **LLM-as-Judge** — Given what was retrieved, did an LLM produce the correct answer? (Binary score from GPT-4o)
+
+We currently measure (1). Most published competitor numbers use (2). **These are not directly comparable.**
+
+A system with perfect retrieval but bad prompting scores high on (1) and low on (2). A system with mediocre retrieval but excellent answer generation could score higher on (2) than a system with better retrieval. They're measuring different things.
+
+Published LLM-as-Judge scores from competitors:
+
+| System | Overall Score |
+|--------|--------------|
+| MemMachine | 84.9% |
+| Mem0 (graph) | 68.5% |
+| Mem0 | 66.9% |
+| Zep (corrected) | 58.4% |
+| LangMem | 58.1% |
+| OpenAI Memory | 52.9% |
+
+We can't put our number in that table yet because we haven't run the LLM-as-Judge step. We're adding it. When we do, the results go in the same public repo alongside everything else.
+
+We could have estimated where we'd land, or presented our retrieval metrics alongside their answer metrics and hoped nobody noticed the difference. We chose not to.
+
+## The Zep Incident
+
+Speaking of benchmark honesty: Zep originally published an 84% LoCoMo score. Mem0's CTO found that Zep had included adversarial category answers in the numerator while excluding adversarial questions from the denominator — inflating their number significantly. The corrected score is 58.4%.
+
+This is why we publish the code. Every query. Every result. The exact commands to reproduce the run. If our methodology is wrong, you can find it and tell us.
+
+## What We're Improving
+
+Based on these results, we're working on:
+
+- **Better observation extraction** — more atomic facts per conversation, closing the single-hop gap
+- **Temporal indexing** — separating document dates from event dates
+- **LLM-as-Judge evaluation** — so we can compare apples-to-apples with published numbers
+- **Cloud benchmarks with OpenAI embeddings** — local embeddings are good; cloud embeddings should be better
+
+Each improvement gets benchmarked before and after, on the same dataset, with the same methodology. No "trust us, it's better now." Numbers or it didn't happen.
+
+## Reproduce It Yourself
+
+The entire benchmark suite is open source:
+
+```bash
+git clone https://github.com/basicmachines-co/basic-memory-benchmarks
+cd basic-memory-benchmarks
+uv sync --group dev
+uv run bm-bench datasets fetch --dataset locomo
+uv run bm-bench convert locomo
+uv run bm-bench run retrieval --providers bm-local
+```
+
+Every run produces a manifest with provenance metadata: git SHA, BM version, dataset checksums, provider configurations. Full reproducibility, not "we ran it once on a good day."
+
+[Benchmark repo →](https://github.com/basicmachines-co/basic-memory-benchmarks)
+
+---
+
+## Why This Matters
+
+Benchmarks are how you know if a product actually works, or if it just has good marketing. The AI memory space is full of big claims and few receipts.
+
+We'd rather show you a 76.4% that you can verify than an 85% that you can't. And when we improve that number — and we will — you'll be able to see exactly what changed and why.
+
+That's the same philosophy behind the product itself. Plain text you can read. Results you can reproduce. No black boxes.
+
+---
+
+*Basic Memory is local-first AI knowledge infrastructure. Benchmarked, open source, plain text. [Get started →](https://basicmemory.com)*