benchmark: add competitive comparison and standalone suite spec

bm-clawd · bm-clawd · commit ef7dd2b8a769 · 2026-02-25T15:22:07.000-06:00
LoCoMo competitive landscape: MemMachine 84.9%, Mem0 66.9%, Zep 58.4%
BM Local baseline: 76.4% R@5 (retrieval, needs LLM-as-Judge for comparison)
Spec for standalone basic-memory-bench repo (Python, provider abstraction)
diff --git a/benchmark/COMPETITIVE.md b/benchmark/COMPETITIVE.md
@@ -0,0 +1,130 @@
+# LoCoMo Benchmark — Competitive Comparison
+
+## Important Context
+
+There are two different evaluation approaches being used across the industry:
+
+1. **Retrieval metrics** (what we currently measure): Did the system find the right document? Recall@K, MRR, Precision@K.
+2. **LLM-as-Judge** (what Mem0, Zep, MemMachine measure): Given retrieved context, did the LLM produce the correct answer? Binary 0/1 scored by GPT-4o.
+
+These are NOT directly comparable. A system with perfect retrieval but bad prompting would score high on (1) and low on (2). A system with mediocre retrieval but excellent prompting could score higher on (2) than on (1).
+
+**Our next step should be adding LLM-as-Judge evaluation so we can compare apples-to-apples.**
+
+## Published LoCoMo Results (LLM-as-Judge Score)
+
+### Overall Scores
+| System | Overall | single_hop | multi_hop | temporal | open_domain | Notes |
+|--------|---------|------------|-----------|----------|-------------|-------|
+| **MemMachine** | **84.9%** | **93.3%** | 80.5% | 72.6% | 64.6% | Best overall. MacBook Pro M3, uses OpenAI API |
+| **Mem0ᵍ** (graph) | **68.5%** | 65.7% | 47.2% | **58.1%** | **75.7%** | Best temporal (graph edges help) |
+| **Mem0** | **66.9%** | 67.1% | **51.1%** | 55.5% | 72.9% | Best accuracy/speed/cost balance |
+| **Zep** (corrected) | **58.4%** | — | — | — | — | Originally claimed 84%, Mem0 caught calculation error |
+| **LangMem** | **58.1%** | 62.2% | 47.9% | 23.4% | 71.1% | OSS, 60s latency (unusable) |
+| **OpenAI Memory** | **52.9%** | 63.8% | 42.9% | 21.7% | 62.3% | Fastest, but shallow recall |
+
+Sources:
+- Mem0: arxiv.org/pdf/2504.19413, mem0.ai/blog
+- MemMachine: memmachine.ai/blog (Sep 2025)
+- Zep correction: github.com/getzep/zep-papers/issues/5 (Mem0 found Zep inflated scores by including adversarial category incorrectly)
+
+### Category Mapping Note
+LoCoMo categories are numbered 1-5. Different vendors map them differently:
+- Categories 1-4 are scored. Category 5 (adversarial) is excluded from official scoring.
+- MemMachine swapped category IDs vs Mem0 (their cat 1 = multi_hop, cat 4 = single_hop)
+- We used the Snap Research original mapping in our benchmarks
+
+## Our Results (Retrieval Metrics — NOT directly comparable)
+
+### Basic Memory Local (v0.18.5) — 1,982 queries, all 10 conversations
+| Metric | Value |
+|--------|-------|
+| Recall@5 | 76.4% |
+| Recall@10 | 85.5% |
+| MRR | 0.658 |
+| Content Hit Rate | 25.4% |
+| Mean Latency | 1,063ms |
+
+### By Category (Retrieval — Recall@5)
+| Category | N | BM Local R@5 |
+|----------|---|-------------|
+| open_domain | 841 | 86.6% |
+| multi_hop | 321 | 84.1% |
+| adversarial | 446 | 67.0% |
+| temporal | 92 | 59.1% |
+| single_hop | 282 | 57.7% |
+
+## Gap Analysis
+
+### Where we're strong (relative to competitors)
+- **Multi-hop: 84.1% retrieval** — Our graph structure helps here. Mem0 scores 51.1% on multi-hop answer quality, suggesting their retrieval for multi-hop might be weaker than ours.
+- **Open-domain: 86.6% retrieval** — Strong baseline. All competitors score 62-76% on answer quality.
+- **Local-first, no API costs** — Every competitor except LangMem requires cloud APIs. We run on SQLite.
+- **Transparent** — All our data is plain text. You can see exactly what the system retrieved and why.
+
+### Where we need to improve
+- **Single-hop: 57.7% retrieval** — MemMachine gets 93.3% answer quality on single-hop. This is our biggest gap. They likely have better chunk-level fact extraction.
+- **Temporal: 59.1% retrieval** — Mem0ᵍ gets 58.1% answer quality (similar!), but Supermemory and MemMachine do better with explicit temporal metadata. We need date-aware scoring.
+- **Content Hit Rate: 25.4%** — We find the right notes but don't always surface the exact answer text. Better chunk extraction needed.
+- **No LLM-as-Judge yet** — Can't directly compare to published numbers without this step.
+
+### Architectural observations
+1. **Mem0's selective extraction is key to their single-hop performance** — they extract "important sentences" before storing, creating atomic memory units. We store full conversations and rely on chunk matching. This is a fundamental tradeoff: their approach loses context, ours preserves it.
+
+2. **MemMachine's multi-search approach** — they allow the agent to perform multiple memory searches per question. We do a single search. Multi-round retrieval could help.
+
+3. **Supermemory's dual timestamps** — `documentDate` (when stored) vs `eventDate` (when it happened). We only have document dates. Adding event date extraction could close the temporal gap.
+
+4. **Zep's benchmark scandal** — They claimed 84% by including adversarial category answers in the numerator but excluding adversarial questions from the denominator. Mem0's CTO publicly called this out. Lesson: benchmark integrity matters. We should be scrupulously honest.
+
+5. **Everyone uses GPT-4o for embeddings** — We use local sentence-transformers. Cloud BM with OpenAI embeddings should close the quality gap significantly.
+
+## Supermemory's LongMemEval Results
+
+Supermemory focuses on LongMemEval (ICLR 2025) instead of LoCoMo:
+
+| Category | Supermemory |
+|----------|------------|
+| single-session-user | ~65% |
+| single-session-assistant | ~55% |
+| single-session-preference | ~45% |
+| multi-session | 71.4% |
+| knowledge-update | ~60% |
+| temporal-reasoning | 76.7% |
+
+Key architectural differences:
+- Chunk-based ingestion with contextual memory generation (resolves ambiguous references)
+- Relational versioning: `updates`, `extends`, `derives` between memories
+- Dual timestamps: `documentDate` + `eventDate`
+- Hybrid search on atomic memories, then inject source chunk for detail
+
+## Recommendations for Basic Memory
+
+### Short-term (improve current numbers)
+1. **Fix RRF scoring (#577)** — Hybrid search flattening is destroying ranking quality
+2. **Better observation extraction in converter** — More atomic facts per session
+3. **Use matched_chunk in scoring** — Already helped content hit rate from 14% to 85% on synthetic
+
+### Medium-term (close competitive gaps)
+4. **Add LLM-as-Judge evaluation** — Required for direct comparison
+5. **Cloud benchmark with OpenAI embeddings** — Should significantly improve vector quality
+6. **Multi-round retrieval** — Allow follow-up searches per query (like MemMachine)
+7. **Event date extraction** — Separate "when stored" from "when it happened"
+
+### Long-term (differentiation)
+8. **Transparent benchmarking** — Publish everything, reproducible, no games. "We benchmark in the open."
+9. **User-editable memory as advantage** — Our memories are plain text files. Users can correct, augment, reorganize. No competitor offers this.
+10. **Schema-validated memories** — Picoschema ensures consistency. No competitor has this.
+
+## What to publish
+
+For the README, I'd suggest something like:
+
+> **Retrieval Quality on LoCoMo (academic benchmark)**
+> - 76.4% Recall@5 across 1,982 questions
+> - 85.5% Recall@10
+> - Sub-second mean latency (1,063ms)
+> - Runs entirely local on SQLite — no cloud API required
+> - [Reproduce these results →](link-to-benchmark-repo)
+
+We should NOT claim direct comparison with Mem0/MemMachine until we add LLM-as-Judge. But we CAN say we benchmark on the same datasets they use, which is more than most tools do.
diff --git a/benchmark/SPEC.md b/benchmark/SPEC.md
@@ -0,0 +1,264 @@
+# SPEC: Basic Memory Benchmark Suite
+
+## Summary
+
+A standalone benchmark suite for evaluating retrieval quality across Basic Memory deployments (local, cloud, and competitors). Uses academic datasets (LoCoMo, LongMemEval) with standardized metrics. Designed to be publicly shareable, runnable by anyone, and integrated into CI.
+
+## Motivation
+
+1. **Internal quality tracking** — run benchmarks before/after every BM release to catch regressions and measure improvements
+2. **Cloud vs Local comparison** — validate that BM Cloud's better embeddings (OpenAI ada-003, etc.) produce measurably better retrieval
+3. **Public credibility** — publish reproducible numbers on academic benchmarks that anyone can verify
+4. **Marketing content** — "we benchmark in the open" blog post, README stats, comparison tables
+5. **Competitive positioning** — compare against Mem0, Supermemory, Zep on the same datasets they use
+
+## Architecture
+
+```
+basic-memory-bench/
+├── README.md                    # How to install, run, and interpret results
+├── datasets/
+│   ├── locomo/
+│   │   ├── download.sh          # Fetches locomo10.json from snap-research/locomo
+│   │   └── README.md            # Dataset description, citation, license
+│   └── longmemeval/
+│       ├── download.sh          # Fetches from HuggingFace
+│       └── README.md
+├── converters/
+│   ├── locomo_to_bm.py          # LoCoMo JSON → BM markdown notes
+│   ├── longmemeval_to_bm.py     # LongMemEval → BM markdown notes
+│   └── base.py                  # Shared conversion utilities
+├── harness/
+│   ├── run.py                   # Main benchmark runner
+│   ├── scoring.py               # Recall@K, MRR, Precision@K, content hit rate
+│   ├── judge.py                 # LLM-as-Judge evaluation (for answer quality)
+│   └── report.py                # Generate markdown/JSON reports
+├── providers/
+│   ├── bm_local.py              # Basic Memory local (via MCP stdio)
+│   ├── bm_cloud.py              # Basic Memory Cloud (via API)
+│   ├── mem0.py                  # Mem0 API (optional, needs API key)
+│   └── base.py                  # Provider interface
+├── results/                     # Saved benchmark runs (gitignored except baselines)
+│   └── baselines/
+│       └── bm-local-locomo-v0.18.5.json  # Published baseline results
+├── pyproject.toml               # Python package (uv/pip installable)
+└── justfile                     # Common commands
+```
+
+## Key Design Decisions
+
+### Python, not TypeScript
+The current harness is TypeScript (in the plugin repo) because it was built there first. The standalone suite should be Python because:
+- BM is Python — same ecosystem, same contributors
+- The BM importer framework (`basic_memory.importers`) is Python
+- Academic researchers use Python
+- Conversion scripts can use BM's `EntityMarkdown` types directly
+- `uv run` makes it trivially installable
+
+### Use BM's importer framework for conversion
+Instead of raw string concatenation, converters should produce proper `EntityMarkdown` objects and write via `MarkdownProcessor`. This ensures:
+- Canonical frontmatter format
+- Proper permalink generation
+- Identical output to what a real BM user would have
+- Consistency with ChatGPT/Claude importers
+
+### Provider abstraction
+Each provider implements a simple interface:
+
+```python
+class BenchmarkProvider(ABC):
+    @abstractmethod
+    async def ingest(self, corpus_path: Path, project: str) -> None:
+        """Index a corpus of markdown files."""
+        
+    @abstractmethod
+    async def search(self, query: str, limit: int = 10) -> list[SearchResult]:
+        """Search and return ranked results."""
+        
+    @abstractmethod
+    async def cleanup(self, project: str) -> None:
+        """Remove indexed data."""
+```
+
+BM Local uses `bm mcp` over stdio (like current harness).
+BM Cloud uses the cloud API directly.
+Mem0/Supermemory use their respective APIs (optional, needs keys).
+
+### Two evaluation modes
+
+**Retrieval evaluation** (what we have now):
+- Did we find the right note in top K results?
+- Metrics: Recall@5, Recall@10, Precision@5, MRR, Content Hit Rate
+- Fast, deterministic, no LLM cost
+
+**Answer evaluation** (needed for Mem0 comparison):
+- Given retrieved context, does the LLM produce the correct answer?
+- Uses LLM-as-Judge (configurable: GPT-4o, Claude, Gemini)
+- Metrics: accuracy, factual correctness, hallucination rate
+- Slower, costs money, but directly comparable to Mem0's published numbers
+
+### Corpus generation is reproducible
+```bash
+# Download dataset
+just download-locomo
+
+# Convert to BM format (deterministic, no randomness)
+just convert-locomo
+
+# Index into a BM project
+just index-locomo
+
+# Run benchmark
+just bench-locomo
+
+# Or all at once
+just full-locomo
+```
+
+Anyone cloning the repo gets identical results (modulo embedding model differences).
+
+## Datasets
+
+### LoCoMo (primary)
+- **Source:** snap-research/locomo (ACL 2024)
+- **Size:** 10 conversations, ~300 turns each, 1,986 QA pairs
+- **Categories:** single-hop (282), multi-hop (321), temporal (92), open-domain (841), adversarial (446)
+- **Why:** Most cited memory benchmark. Mem0 publishes numbers on it. Direct comparison possible.
+- **License:** Research use
+
+### LongMemEval (secondary)
+- **Source:** xiaowu0162/LongMemEval (ICLR 2025)
+- **Size:** Longer conversations, more complex memory tasks
+- **Categories:** knowledge update, knowledge retention, temporal reasoning, multi-session
+- **Why:** Supermemory uses it. More challenging than LoCoMo. Tests different capabilities.
+- **License:** Research use
+
+### Synthetic (included, for fast iteration)
+- **Source:** Our hand-crafted corpus (already in plugin repo)
+- **Size:** 11 files, 38 queries, 9 categories
+- **Why:** Fast to run (<30s), good for CI smoke tests, covers BM-specific patterns (task recall, wiki-link traversal)
+
+## Conversion Strategy
+
+LoCoMo conversations → BM notes that look like real agent memory:
+
+1. **Session notes** — one markdown file per conversation session, dated, with frontmatter
+2. **Observations** — extracted per-speaker observations become tagged `[speaker] fact` entries
+3. **People notes** — one note per speaker with relations
+4. **MEMORY.md** — accumulated summary of key facts (like a real agent's working memory)
+5. **Relations** — wiki-links between sessions, people, and topics
+
+This mirrors how a real BM-powered agent would accumulate knowledge over time.
+
+## Metrics
+
+| Metric | Description | Use |
+|--------|-------------|-----|
+| Recall@K | Fraction of relevant docs in top K | Primary retrieval quality |
+| MRR | Reciprocal rank of first relevant result | Ranking quality |
+| Precision@K | Fraction of top K that are relevant | Result quality |
+| Content Hit Rate | Expected answer text found in results | Chunk quality |
+| Mean Latency | Average query time | Performance |
+| P95 Latency | 95th percentile query time | Tail performance |
+| LLM-Judge Score | Answer correctness rated by LLM | Answer quality (comparable to Mem0) |
+
+## Current Baseline (BM Local, v0.18.5)
+
+From our full 10-conversation LoCoMo run (1,982 queries):
+
+| Metric | Value |
+|--------|-------|
+| Recall@5 | 76.4% |
+| Recall@10 | 85.5% |
+| MRR | 0.658 |
+| Content Hit Rate | 25.4% |
+| Mean Latency | 1,063ms |
+
+By category:
+| Category | N | R@5 |
+|----------|---|-----|
+| open_domain | 841 | 86.6% |
+| multi_hop | 321 | 84.1% |
+| adversarial | 446 | 67.0% |
+| temporal | 92 | 59.1% |
+| single_hop | 282 | 57.7% |
+
+### Known improvement opportunities
+1. **RRF scoring is broken** — hybrid search flattens all scores to ~0.016, destroying ranking (issue #577)
+2. **Single-hop weakness** — specific fact lookups need better chunk-level matching
+3. **Temporal weakness** — date-aware scoring or temporal indexing needed
+4. **FTS finds observations that vector misses** — tagged observations like `[speaker] fact` are better matched by FTS
+
+## Cloud Comparison Plan
+
+BM Cloud should outperform local because:
+- Better embedding models (OpenAI ada-003 vs local sentence-transformers)
+- PostgreSQL + pgvector vs SQLite + sqlite-vec
+- Server-grade hardware vs laptop
+
+Expected improvements:
+- Higher vector similarity scores → better ranking
+- Better semantic matching → improved single-hop and temporal
+- Lower latency (dedicated infra)
+
+To test: run same benchmark with `bm_cloud.py` provider pointing at cloud API. Same corpus, same queries, different backend.
+
+## CI Integration
+
+```yaml
+# .github/workflows/benchmark.yml
+name: Benchmark
+on:
+  push:
+    branches: [main]
+  workflow_dispatch:
+
+jobs:
+  bench:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v4
+      - run: uv sync
+      - run: just download-locomo
+      - run: just convert-locomo  
+      - run: just index-locomo
+      - run: just bench-locomo --output results/ci-latest.json
+      - run: just compare-baseline results/ci-latest.json results/baselines/latest.json
+      # Fail if recall@5 drops more than 2% from baseline
+```
+
+## Blog Post Angle
+
+"We Benchmark in the Open"
+- Here are our numbers. Here's how to reproduce them.
+- We use academic datasets, not synthetic benchmarks we designed to win.
+- Clone the repo, run `just full-locomo`, get the same results.
+- We publish baselines with every release so you can track improvement over time.
+- This is what "build things worth keeping" looks like.
+
+## Implementation Plan
+
+### Phase 1: Repo setup + LoCoMo
+- Create `basic-memory-bench` repo
+- Port LoCoMo converter from TypeScript to Python (using BM importer framework)
+- Port harness from TypeScript to Python
+- Publish baseline results
+- README with full instructions
+
+### Phase 2: Cloud provider + LongMemEval
+- Add BM Cloud provider
+- Run cloud vs local comparison
+- Add LongMemEval dataset + converter
+- Publish comparison results
+
+### Phase 3: LLM-Judge + competitors
+- Add answer evaluation mode
+- Compare directly to Mem0's published LoCoMo numbers
+- Optional: add Mem0/Supermemory providers for head-to-head
+- Blog post with results
+
+### Phase 4: CI + public dashboard
+- GitHub Actions workflow for automated benchmarking
+- Results dashboard (could be a BM Cloud MDX dashboard note!)
+- Community contributions: custom datasets, new providers