Skip to content

Commit ef7dd2b

Browse files
committed
benchmark: add competitive comparison and standalone suite spec
LoCoMo competitive landscape: MemMachine 84.9%, Mem0 66.9%, Zep 58.4% BM Local baseline: 76.4% R@5 (retrieval, needs LLM-as-Judge for comparison) Spec for standalone basic-memory-bench repo (Python, provider abstraction)
1 parent ef3f932 commit ef7dd2b

File tree

2 files changed

+394
-0
lines changed

2 files changed

+394
-0
lines changed

benchmark/COMPETITIVE.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# LoCoMo Benchmark — Competitive Comparison
2+
3+
## Important Context
4+
5+
There are two different evaluation approaches being used across the industry:
6+
7+
1. **Retrieval metrics** (what we currently measure): Did the system find the right document? Recall@K, MRR, Precision@K.
8+
2. **LLM-as-Judge** (what Mem0, Zep, MemMachine measure): Given retrieved context, did the LLM produce the correct answer? Binary 0/1 scored by GPT-4o.
9+
10+
These are NOT directly comparable. A system with perfect retrieval but bad prompting would score high on (1) and low on (2). A system with mediocre retrieval but excellent prompting could score higher on (2) than on (1).
11+
12+
**Our next step should be adding LLM-as-Judge evaluation so we can compare apples-to-apples.**
13+
14+
## Published LoCoMo Results (LLM-as-Judge Score)
15+
16+
### Overall Scores
17+
| System | Overall | single_hop | multi_hop | temporal | open_domain | Notes |
18+
|--------|---------|------------|-----------|----------|-------------|-------|
19+
| **MemMachine** | **84.9%** | **93.3%** | 80.5% | 72.6% | 64.6% | Best overall. MacBook Pro M3, uses OpenAI API |
20+
| **Mem0ᵍ** (graph) | **68.5%** | 65.7% | 47.2% | **58.1%** | **75.7%** | Best temporal (graph edges help) |
21+
| **Mem0** | **66.9%** | 67.1% | **51.1%** | 55.5% | 72.9% | Best accuracy/speed/cost balance |
22+
| **Zep** (corrected) | **58.4%** ||||| Originally claimed 84%, Mem0 caught calculation error |
23+
| **LangMem** | **58.1%** | 62.2% | 47.9% | 23.4% | 71.1% | OSS, 60s latency (unusable) |
24+
| **OpenAI Memory** | **52.9%** | 63.8% | 42.9% | 21.7% | 62.3% | Fastest, but shallow recall |
25+
26+
Sources:
27+
- Mem0: arxiv.org/pdf/2504.19413, mem0.ai/blog
28+
- MemMachine: memmachine.ai/blog (Sep 2025)
29+
- Zep correction: github.com/getzep/zep-papers/issues/5 (Mem0 found Zep inflated scores by including adversarial category incorrectly)
30+
31+
### Category Mapping Note
32+
LoCoMo categories are numbered 1-5. Different vendors map them differently:
33+
- Categories 1-4 are scored. Category 5 (adversarial) is excluded from official scoring.
34+
- MemMachine swapped category IDs vs Mem0 (their cat 1 = multi_hop, cat 4 = single_hop)
35+
- We used the Snap Research original mapping in our benchmarks
36+
37+
## Our Results (Retrieval Metrics — NOT directly comparable)
38+
39+
### Basic Memory Local (v0.18.5) — 1,982 queries, all 10 conversations
40+
| Metric | Value |
41+
|--------|-------|
42+
| Recall@5 | 76.4% |
43+
| Recall@10 | 85.5% |
44+
| MRR | 0.658 |
45+
| Content Hit Rate | 25.4% |
46+
| Mean Latency | 1,063ms |
47+
48+
### By Category (Retrieval — Recall@5)
49+
| Category | N | BM Local R@5 |
50+
|----------|---|-------------|
51+
| open_domain | 841 | 86.6% |
52+
| multi_hop | 321 | 84.1% |
53+
| adversarial | 446 | 67.0% |
54+
| temporal | 92 | 59.1% |
55+
| single_hop | 282 | 57.7% |
56+
57+
## Gap Analysis
58+
59+
### Where we're strong (relative to competitors)
60+
- **Multi-hop: 84.1% retrieval** — Our graph structure helps here. Mem0 scores 51.1% on multi-hop answer quality, suggesting their retrieval for multi-hop might be weaker than ours.
61+
- **Open-domain: 86.6% retrieval** — Strong baseline. All competitors score 62-76% on answer quality.
62+
- **Local-first, no API costs** — Every competitor except LangMem requires cloud APIs. We run on SQLite.
63+
- **Transparent** — All our data is plain text. You can see exactly what the system retrieved and why.
64+
65+
### Where we need to improve
66+
- **Single-hop: 57.7% retrieval** — MemMachine gets 93.3% answer quality on single-hop. This is our biggest gap. They likely have better chunk-level fact extraction.
67+
- **Temporal: 59.1% retrieval** — Mem0ᵍ gets 58.1% answer quality (similar!), but Supermemory and MemMachine do better with explicit temporal metadata. We need date-aware scoring.
68+
- **Content Hit Rate: 25.4%** — We find the right notes but don't always surface the exact answer text. Better chunk extraction needed.
69+
- **No LLM-as-Judge yet** — Can't directly compare to published numbers without this step.
70+
71+
### Architectural observations
72+
1. **Mem0's selective extraction is key to their single-hop performance** — they extract "important sentences" before storing, creating atomic memory units. We store full conversations and rely on chunk matching. This is a fundamental tradeoff: their approach loses context, ours preserves it.
73+
74+
2. **MemMachine's multi-search approach** — they allow the agent to perform multiple memory searches per question. We do a single search. Multi-round retrieval could help.
75+
76+
3. **Supermemory's dual timestamps**`documentDate` (when stored) vs `eventDate` (when it happened). We only have document dates. Adding event date extraction could close the temporal gap.
77+
78+
4. **Zep's benchmark scandal** — They claimed 84% by including adversarial category answers in the numerator but excluding adversarial questions from the denominator. Mem0's CTO publicly called this out. Lesson: benchmark integrity matters. We should be scrupulously honest.
79+
80+
5. **Everyone uses GPT-4o for embeddings** — We use local sentence-transformers. Cloud BM with OpenAI embeddings should close the quality gap significantly.
81+
82+
## Supermemory's LongMemEval Results
83+
84+
Supermemory focuses on LongMemEval (ICLR 2025) instead of LoCoMo:
85+
86+
| Category | Supermemory |
87+
|----------|------------|
88+
| single-session-user | ~65% |
89+
| single-session-assistant | ~55% |
90+
| single-session-preference | ~45% |
91+
| multi-session | 71.4% |
92+
| knowledge-update | ~60% |
93+
| temporal-reasoning | 76.7% |
94+
95+
Key architectural differences:
96+
- Chunk-based ingestion with contextual memory generation (resolves ambiguous references)
97+
- Relational versioning: `updates`, `extends`, `derives` between memories
98+
- Dual timestamps: `documentDate` + `eventDate`
99+
- Hybrid search on atomic memories, then inject source chunk for detail
100+
101+
## Recommendations for Basic Memory
102+
103+
### Short-term (improve current numbers)
104+
1. **Fix RRF scoring (#577)** — Hybrid search flattening is destroying ranking quality
105+
2. **Better observation extraction in converter** — More atomic facts per session
106+
3. **Use matched_chunk in scoring** — Already helped content hit rate from 14% to 85% on synthetic
107+
108+
### Medium-term (close competitive gaps)
109+
4. **Add LLM-as-Judge evaluation** — Required for direct comparison
110+
5. **Cloud benchmark with OpenAI embeddings** — Should significantly improve vector quality
111+
6. **Multi-round retrieval** — Allow follow-up searches per query (like MemMachine)
112+
7. **Event date extraction** — Separate "when stored" from "when it happened"
113+
114+
### Long-term (differentiation)
115+
8. **Transparent benchmarking** — Publish everything, reproducible, no games. "We benchmark in the open."
116+
9. **User-editable memory as advantage** — Our memories are plain text files. Users can correct, augment, reorganize. No competitor offers this.
117+
10. **Schema-validated memories** — Picoschema ensures consistency. No competitor has this.
118+
119+
## What to publish
120+
121+
For the README, I'd suggest something like:
122+
123+
> **Retrieval Quality on LoCoMo (academic benchmark)**
124+
> - 76.4% Recall@5 across 1,982 questions
125+
> - 85.5% Recall@10
126+
> - Sub-second mean latency (1,063ms)
127+
> - Runs entirely local on SQLite — no cloud API required
128+
> - [Reproduce these results →](link-to-benchmark-repo)
129+
130+
We should NOT claim direct comparison with Mem0/MemMachine until we add LLM-as-Judge. But we CAN say we benchmark on the same datasets they use, which is more than most tools do.

benchmark/SPEC.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# SPEC: Basic Memory Benchmark Suite
2+
3+
## Summary
4+
5+
A standalone benchmark suite for evaluating retrieval quality across Basic Memory deployments (local, cloud, and competitors). Uses academic datasets (LoCoMo, LongMemEval) with standardized metrics. Designed to be publicly shareable, runnable by anyone, and integrated into CI.
6+
7+
## Motivation
8+
9+
1. **Internal quality tracking** — run benchmarks before/after every BM release to catch regressions and measure improvements
10+
2. **Cloud vs Local comparison** — validate that BM Cloud's better embeddings (OpenAI ada-003, etc.) produce measurably better retrieval
11+
3. **Public credibility** — publish reproducible numbers on academic benchmarks that anyone can verify
12+
4. **Marketing content** — "we benchmark in the open" blog post, README stats, comparison tables
13+
5. **Competitive positioning** — compare against Mem0, Supermemory, Zep on the same datasets they use
14+
15+
## Architecture
16+
17+
```
18+
basic-memory-bench/
19+
├── README.md # How to install, run, and interpret results
20+
├── datasets/
21+
│ ├── locomo/
22+
│ │ ├── download.sh # Fetches locomo10.json from snap-research/locomo
23+
│ │ └── README.md # Dataset description, citation, license
24+
│ └── longmemeval/
25+
│ ├── download.sh # Fetches from HuggingFace
26+
│ └── README.md
27+
├── converters/
28+
│ ├── locomo_to_bm.py # LoCoMo JSON → BM markdown notes
29+
│ ├── longmemeval_to_bm.py # LongMemEval → BM markdown notes
30+
│ └── base.py # Shared conversion utilities
31+
├── harness/
32+
│ ├── run.py # Main benchmark runner
33+
│ ├── scoring.py # Recall@K, MRR, Precision@K, content hit rate
34+
│ ├── judge.py # LLM-as-Judge evaluation (for answer quality)
35+
│ └── report.py # Generate markdown/JSON reports
36+
├── providers/
37+
│ ├── bm_local.py # Basic Memory local (via MCP stdio)
38+
│ ├── bm_cloud.py # Basic Memory Cloud (via API)
39+
│ ├── mem0.py # Mem0 API (optional, needs API key)
40+
│ └── base.py # Provider interface
41+
├── results/ # Saved benchmark runs (gitignored except baselines)
42+
│ └── baselines/
43+
│ └── bm-local-locomo-v0.18.5.json # Published baseline results
44+
├── pyproject.toml # Python package (uv/pip installable)
45+
└── justfile # Common commands
46+
```
47+
48+
## Key Design Decisions
49+
50+
### Python, not TypeScript
51+
The current harness is TypeScript (in the plugin repo) because it was built there first. The standalone suite should be Python because:
52+
- BM is Python — same ecosystem, same contributors
53+
- The BM importer framework (`basic_memory.importers`) is Python
54+
- Academic researchers use Python
55+
- Conversion scripts can use BM's `EntityMarkdown` types directly
56+
- `uv run` makes it trivially installable
57+
58+
### Use BM's importer framework for conversion
59+
Instead of raw string concatenation, converters should produce proper `EntityMarkdown` objects and write via `MarkdownProcessor`. This ensures:
60+
- Canonical frontmatter format
61+
- Proper permalink generation
62+
- Identical output to what a real BM user would have
63+
- Consistency with ChatGPT/Claude importers
64+
65+
### Provider abstraction
66+
Each provider implements a simple interface:
67+
68+
```python
69+
class BenchmarkProvider(ABC):
70+
@abstractmethod
71+
async def ingest(self, corpus_path: Path, project: str) -> None:
72+
"""Index a corpus of markdown files."""
73+
74+
@abstractmethod
75+
async def search(self, query: str, limit: int = 10) -> list[SearchResult]:
76+
"""Search and return ranked results."""
77+
78+
@abstractmethod
79+
async def cleanup(self, project: str) -> None:
80+
"""Remove indexed data."""
81+
```
82+
83+
BM Local uses `bm mcp` over stdio (like current harness).
84+
BM Cloud uses the cloud API directly.
85+
Mem0/Supermemory use their respective APIs (optional, needs keys).
86+
87+
### Two evaluation modes
88+
89+
**Retrieval evaluation** (what we have now):
90+
- Did we find the right note in top K results?
91+
- Metrics: Recall@5, Recall@10, Precision@5, MRR, Content Hit Rate
92+
- Fast, deterministic, no LLM cost
93+
94+
**Answer evaluation** (needed for Mem0 comparison):
95+
- Given retrieved context, does the LLM produce the correct answer?
96+
- Uses LLM-as-Judge (configurable: GPT-4o, Claude, Gemini)
97+
- Metrics: accuracy, factual correctness, hallucination rate
98+
- Slower, costs money, but directly comparable to Mem0's published numbers
99+
100+
### Corpus generation is reproducible
101+
```bash
102+
# Download dataset
103+
just download-locomo
104+
105+
# Convert to BM format (deterministic, no randomness)
106+
just convert-locomo
107+
108+
# Index into a BM project
109+
just index-locomo
110+
111+
# Run benchmark
112+
just bench-locomo
113+
114+
# Or all at once
115+
just full-locomo
116+
```
117+
118+
Anyone cloning the repo gets identical results (modulo embedding model differences).
119+
120+
## Datasets
121+
122+
### LoCoMo (primary)
123+
- **Source:** snap-research/locomo (ACL 2024)
124+
- **Size:** 10 conversations, ~300 turns each, 1,986 QA pairs
125+
- **Categories:** single-hop (282), multi-hop (321), temporal (92), open-domain (841), adversarial (446)
126+
- **Why:** Most cited memory benchmark. Mem0 publishes numbers on it. Direct comparison possible.
127+
- **License:** Research use
128+
129+
### LongMemEval (secondary)
130+
- **Source:** xiaowu0162/LongMemEval (ICLR 2025)
131+
- **Size:** Longer conversations, more complex memory tasks
132+
- **Categories:** knowledge update, knowledge retention, temporal reasoning, multi-session
133+
- **Why:** Supermemory uses it. More challenging than LoCoMo. Tests different capabilities.
134+
- **License:** Research use
135+
136+
### Synthetic (included, for fast iteration)
137+
- **Source:** Our hand-crafted corpus (already in plugin repo)
138+
- **Size:** 11 files, 38 queries, 9 categories
139+
- **Why:** Fast to run (<30s), good for CI smoke tests, covers BM-specific patterns (task recall, wiki-link traversal)
140+
141+
## Conversion Strategy
142+
143+
LoCoMo conversations → BM notes that look like real agent memory:
144+
145+
1. **Session notes** — one markdown file per conversation session, dated, with frontmatter
146+
2. **Observations** — extracted per-speaker observations become tagged `[speaker] fact` entries
147+
3. **People notes** — one note per speaker with relations
148+
4. **MEMORY.md** — accumulated summary of key facts (like a real agent's working memory)
149+
5. **Relations** — wiki-links between sessions, people, and topics
150+
151+
This mirrors how a real BM-powered agent would accumulate knowledge over time.
152+
153+
## Metrics
154+
155+
| Metric | Description | Use |
156+
|--------|-------------|-----|
157+
| Recall@K | Fraction of relevant docs in top K | Primary retrieval quality |
158+
| MRR | Reciprocal rank of first relevant result | Ranking quality |
159+
| Precision@K | Fraction of top K that are relevant | Result quality |
160+
| Content Hit Rate | Expected answer text found in results | Chunk quality |
161+
| Mean Latency | Average query time | Performance |
162+
| P95 Latency | 95th percentile query time | Tail performance |
163+
| LLM-Judge Score | Answer correctness rated by LLM | Answer quality (comparable to Mem0) |
164+
165+
## Current Baseline (BM Local, v0.18.5)
166+
167+
From our full 10-conversation LoCoMo run (1,982 queries):
168+
169+
| Metric | Value |
170+
|--------|-------|
171+
| Recall@5 | 76.4% |
172+
| Recall@10 | 85.5% |
173+
| MRR | 0.658 |
174+
| Content Hit Rate | 25.4% |
175+
| Mean Latency | 1,063ms |
176+
177+
By category:
178+
| Category | N | R@5 |
179+
|----------|---|-----|
180+
| open_domain | 841 | 86.6% |
181+
| multi_hop | 321 | 84.1% |
182+
| adversarial | 446 | 67.0% |
183+
| temporal | 92 | 59.1% |
184+
| single_hop | 282 | 57.7% |
185+
186+
### Known improvement opportunities
187+
1. **RRF scoring is broken** — hybrid search flattens all scores to ~0.016, destroying ranking (issue #577)
188+
2. **Single-hop weakness** — specific fact lookups need better chunk-level matching
189+
3. **Temporal weakness** — date-aware scoring or temporal indexing needed
190+
4. **FTS finds observations that vector misses** — tagged observations like `[speaker] fact` are better matched by FTS
191+
192+
## Cloud Comparison Plan
193+
194+
BM Cloud should outperform local because:
195+
- Better embedding models (OpenAI ada-003 vs local sentence-transformers)
196+
- PostgreSQL + pgvector vs SQLite + sqlite-vec
197+
- Server-grade hardware vs laptop
198+
199+
Expected improvements:
200+
- Higher vector similarity scores → better ranking
201+
- Better semantic matching → improved single-hop and temporal
202+
- Lower latency (dedicated infra)
203+
204+
To test: run same benchmark with `bm_cloud.py` provider pointing at cloud API. Same corpus, same queries, different backend.
205+
206+
## CI Integration
207+
208+
```yaml
209+
# .github/workflows/benchmark.yml
210+
name: Benchmark
211+
on:
212+
push:
213+
branches: [main]
214+
workflow_dispatch:
215+
216+
jobs:
217+
bench:
218+
runs-on: ubuntu-latest
219+
steps:
220+
- uses: actions/checkout@v4
221+
- uses: astral-sh/setup-uv@v4
222+
- run: uv sync
223+
- run: just download-locomo
224+
- run: just convert-locomo
225+
- run: just index-locomo
226+
- run: just bench-locomo --output results/ci-latest.json
227+
- run: just compare-baseline results/ci-latest.json results/baselines/latest.json
228+
# Fail if recall@5 drops more than 2% from baseline
229+
```
230+
231+
## Blog Post Angle
232+
233+
"We Benchmark in the Open"
234+
- Here are our numbers. Here's how to reproduce them.
235+
- We use academic datasets, not synthetic benchmarks we designed to win.
236+
- Clone the repo, run `just full-locomo`, get the same results.
237+
- We publish baselines with every release so you can track improvement over time.
238+
- This is what "build things worth keeping" looks like.
239+
240+
## Implementation Plan
241+
242+
### Phase 1: Repo setup + LoCoMo
243+
- Create `basic-memory-bench` repo
244+
- Port LoCoMo converter from TypeScript to Python (using BM importer framework)
245+
- Port harness from TypeScript to Python
246+
- Publish baseline results
247+
- README with full instructions
248+
249+
### Phase 2: Cloud provider + LongMemEval
250+
- Add BM Cloud provider
251+
- Run cloud vs local comparison
252+
- Add LongMemEval dataset + converter
253+
- Publish comparison results
254+
255+
### Phase 3: LLM-Judge + competitors
256+
- Add answer evaluation mode
257+
- Compare directly to Mem0's published LoCoMo numbers
258+
- Optional: add Mem0/Supermemory providers for head-to-head
259+
- Blog post with results
260+
261+
### Phase 4: CI + public dashboard
262+
- GitHub Actions workflow for automated benchmarking
263+
- Results dashboard (could be a BM Cloud MDX dashboard note!)
264+
- Community contributions: custom datasets, new providers

0 commit comments

Comments
 (0)