|
| 1 | +# SPEC: Basic Memory Benchmark Suite |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +A standalone benchmark suite for evaluating retrieval quality across Basic Memory deployments (local, cloud, and competitors). Uses academic datasets (LoCoMo, LongMemEval) with standardized metrics. Designed to be publicly shareable, runnable by anyone, and integrated into CI. |
| 6 | + |
| 7 | +## Motivation |
| 8 | + |
| 9 | +1. **Internal quality tracking** — run benchmarks before/after every BM release to catch regressions and measure improvements |
| 10 | +2. **Cloud vs Local comparison** — validate that BM Cloud's better embeddings (OpenAI ada-003, etc.) produce measurably better retrieval |
| 11 | +3. **Public credibility** — publish reproducible numbers on academic benchmarks that anyone can verify |
| 12 | +4. **Marketing content** — "we benchmark in the open" blog post, README stats, comparison tables |
| 13 | +5. **Competitive positioning** — compare against Mem0, Supermemory, Zep on the same datasets they use |
| 14 | + |
| 15 | +## Architecture |
| 16 | + |
| 17 | +``` |
| 18 | +basic-memory-bench/ |
| 19 | +├── README.md # How to install, run, and interpret results |
| 20 | +├── datasets/ |
| 21 | +│ ├── locomo/ |
| 22 | +│ │ ├── download.sh # Fetches locomo10.json from snap-research/locomo |
| 23 | +│ │ └── README.md # Dataset description, citation, license |
| 24 | +│ └── longmemeval/ |
| 25 | +│ ├── download.sh # Fetches from HuggingFace |
| 26 | +│ └── README.md |
| 27 | +├── converters/ |
| 28 | +│ ├── locomo_to_bm.py # LoCoMo JSON → BM markdown notes |
| 29 | +│ ├── longmemeval_to_bm.py # LongMemEval → BM markdown notes |
| 30 | +│ └── base.py # Shared conversion utilities |
| 31 | +├── harness/ |
| 32 | +│ ├── run.py # Main benchmark runner |
| 33 | +│ ├── scoring.py # Recall@K, MRR, Precision@K, content hit rate |
| 34 | +│ ├── judge.py # LLM-as-Judge evaluation (for answer quality) |
| 35 | +│ └── report.py # Generate markdown/JSON reports |
| 36 | +├── providers/ |
| 37 | +│ ├── bm_local.py # Basic Memory local (via MCP stdio) |
| 38 | +│ ├── bm_cloud.py # Basic Memory Cloud (via API) |
| 39 | +│ ├── mem0.py # Mem0 API (optional, needs API key) |
| 40 | +│ └── base.py # Provider interface |
| 41 | +├── results/ # Saved benchmark runs (gitignored except baselines) |
| 42 | +│ └── baselines/ |
| 43 | +│ └── bm-local-locomo-v0.18.5.json # Published baseline results |
| 44 | +├── pyproject.toml # Python package (uv/pip installable) |
| 45 | +└── justfile # Common commands |
| 46 | +``` |
| 47 | + |
| 48 | +## Key Design Decisions |
| 49 | + |
| 50 | +### Python, not TypeScript |
| 51 | +The current harness is TypeScript (in the plugin repo) because it was built there first. The standalone suite should be Python because: |
| 52 | +- BM is Python — same ecosystem, same contributors |
| 53 | +- The BM importer framework (`basic_memory.importers`) is Python |
| 54 | +- Academic researchers use Python |
| 55 | +- Conversion scripts can use BM's `EntityMarkdown` types directly |
| 56 | +- `uv run` makes it trivially installable |
| 57 | + |
| 58 | +### Use BM's importer framework for conversion |
| 59 | +Instead of raw string concatenation, converters should produce proper `EntityMarkdown` objects and write via `MarkdownProcessor`. This ensures: |
| 60 | +- Canonical frontmatter format |
| 61 | +- Proper permalink generation |
| 62 | +- Identical output to what a real BM user would have |
| 63 | +- Consistency with ChatGPT/Claude importers |
| 64 | + |
| 65 | +### Provider abstraction |
| 66 | +Each provider implements a simple interface: |
| 67 | + |
| 68 | +```python |
| 69 | +class BenchmarkProvider(ABC): |
| 70 | + @abstractmethod |
| 71 | + async def ingest(self, corpus_path: Path, project: str) -> None: |
| 72 | + """Index a corpus of markdown files.""" |
| 73 | + |
| 74 | + @abstractmethod |
| 75 | + async def search(self, query: str, limit: int = 10) -> list[SearchResult]: |
| 76 | + """Search and return ranked results.""" |
| 77 | + |
| 78 | + @abstractmethod |
| 79 | + async def cleanup(self, project: str) -> None: |
| 80 | + """Remove indexed data.""" |
| 81 | +``` |
| 82 | + |
| 83 | +BM Local uses `bm mcp` over stdio (like current harness). |
| 84 | +BM Cloud uses the cloud API directly. |
| 85 | +Mem0/Supermemory use their respective APIs (optional, needs keys). |
| 86 | + |
| 87 | +### Two evaluation modes |
| 88 | + |
| 89 | +**Retrieval evaluation** (what we have now): |
| 90 | +- Did we find the right note in top K results? |
| 91 | +- Metrics: Recall@5, Recall@10, Precision@5, MRR, Content Hit Rate |
| 92 | +- Fast, deterministic, no LLM cost |
| 93 | + |
| 94 | +**Answer evaluation** (needed for Mem0 comparison): |
| 95 | +- Given retrieved context, does the LLM produce the correct answer? |
| 96 | +- Uses LLM-as-Judge (configurable: GPT-4o, Claude, Gemini) |
| 97 | +- Metrics: accuracy, factual correctness, hallucination rate |
| 98 | +- Slower, costs money, but directly comparable to Mem0's published numbers |
| 99 | + |
| 100 | +### Corpus generation is reproducible |
| 101 | +```bash |
| 102 | +# Download dataset |
| 103 | +just download-locomo |
| 104 | + |
| 105 | +# Convert to BM format (deterministic, no randomness) |
| 106 | +just convert-locomo |
| 107 | + |
| 108 | +# Index into a BM project |
| 109 | +just index-locomo |
| 110 | + |
| 111 | +# Run benchmark |
| 112 | +just bench-locomo |
| 113 | + |
| 114 | +# Or all at once |
| 115 | +just full-locomo |
| 116 | +``` |
| 117 | + |
| 118 | +Anyone cloning the repo gets identical results (modulo embedding model differences). |
| 119 | + |
| 120 | +## Datasets |
| 121 | + |
| 122 | +### LoCoMo (primary) |
| 123 | +- **Source:** snap-research/locomo (ACL 2024) |
| 124 | +- **Size:** 10 conversations, ~300 turns each, 1,986 QA pairs |
| 125 | +- **Categories:** single-hop (282), multi-hop (321), temporal (92), open-domain (841), adversarial (446) |
| 126 | +- **Why:** Most cited memory benchmark. Mem0 publishes numbers on it. Direct comparison possible. |
| 127 | +- **License:** Research use |
| 128 | + |
| 129 | +### LongMemEval (secondary) |
| 130 | +- **Source:** xiaowu0162/LongMemEval (ICLR 2025) |
| 131 | +- **Size:** Longer conversations, more complex memory tasks |
| 132 | +- **Categories:** knowledge update, knowledge retention, temporal reasoning, multi-session |
| 133 | +- **Why:** Supermemory uses it. More challenging than LoCoMo. Tests different capabilities. |
| 134 | +- **License:** Research use |
| 135 | + |
| 136 | +### Synthetic (included, for fast iteration) |
| 137 | +- **Source:** Our hand-crafted corpus (already in plugin repo) |
| 138 | +- **Size:** 11 files, 38 queries, 9 categories |
| 139 | +- **Why:** Fast to run (<30s), good for CI smoke tests, covers BM-specific patterns (task recall, wiki-link traversal) |
| 140 | + |
| 141 | +## Conversion Strategy |
| 142 | + |
| 143 | +LoCoMo conversations → BM notes that look like real agent memory: |
| 144 | + |
| 145 | +1. **Session notes** — one markdown file per conversation session, dated, with frontmatter |
| 146 | +2. **Observations** — extracted per-speaker observations become tagged `[speaker] fact` entries |
| 147 | +3. **People notes** — one note per speaker with relations |
| 148 | +4. **MEMORY.md** — accumulated summary of key facts (like a real agent's working memory) |
| 149 | +5. **Relations** — wiki-links between sessions, people, and topics |
| 150 | + |
| 151 | +This mirrors how a real BM-powered agent would accumulate knowledge over time. |
| 152 | + |
| 153 | +## Metrics |
| 154 | + |
| 155 | +| Metric | Description | Use | |
| 156 | +|--------|-------------|-----| |
| 157 | +| Recall@K | Fraction of relevant docs in top K | Primary retrieval quality | |
| 158 | +| MRR | Reciprocal rank of first relevant result | Ranking quality | |
| 159 | +| Precision@K | Fraction of top K that are relevant | Result quality | |
| 160 | +| Content Hit Rate | Expected answer text found in results | Chunk quality | |
| 161 | +| Mean Latency | Average query time | Performance | |
| 162 | +| P95 Latency | 95th percentile query time | Tail performance | |
| 163 | +| LLM-Judge Score | Answer correctness rated by LLM | Answer quality (comparable to Mem0) | |
| 164 | + |
| 165 | +## Current Baseline (BM Local, v0.18.5) |
| 166 | + |
| 167 | +From our full 10-conversation LoCoMo run (1,982 queries): |
| 168 | + |
| 169 | +| Metric | Value | |
| 170 | +|--------|-------| |
| 171 | +| Recall@5 | 76.4% | |
| 172 | +| Recall@10 | 85.5% | |
| 173 | +| MRR | 0.658 | |
| 174 | +| Content Hit Rate | 25.4% | |
| 175 | +| Mean Latency | 1,063ms | |
| 176 | + |
| 177 | +By category: |
| 178 | +| Category | N | R@5 | |
| 179 | +|----------|---|-----| |
| 180 | +| open_domain | 841 | 86.6% | |
| 181 | +| multi_hop | 321 | 84.1% | |
| 182 | +| adversarial | 446 | 67.0% | |
| 183 | +| temporal | 92 | 59.1% | |
| 184 | +| single_hop | 282 | 57.7% | |
| 185 | + |
| 186 | +### Known improvement opportunities |
| 187 | +1. **RRF scoring is broken** — hybrid search flattens all scores to ~0.016, destroying ranking (issue #577) |
| 188 | +2. **Single-hop weakness** — specific fact lookups need better chunk-level matching |
| 189 | +3. **Temporal weakness** — date-aware scoring or temporal indexing needed |
| 190 | +4. **FTS finds observations that vector misses** — tagged observations like `[speaker] fact` are better matched by FTS |
| 191 | + |
| 192 | +## Cloud Comparison Plan |
| 193 | + |
| 194 | +BM Cloud should outperform local because: |
| 195 | +- Better embedding models (OpenAI ada-003 vs local sentence-transformers) |
| 196 | +- PostgreSQL + pgvector vs SQLite + sqlite-vec |
| 197 | +- Server-grade hardware vs laptop |
| 198 | + |
| 199 | +Expected improvements: |
| 200 | +- Higher vector similarity scores → better ranking |
| 201 | +- Better semantic matching → improved single-hop and temporal |
| 202 | +- Lower latency (dedicated infra) |
| 203 | + |
| 204 | +To test: run same benchmark with `bm_cloud.py` provider pointing at cloud API. Same corpus, same queries, different backend. |
| 205 | + |
| 206 | +## CI Integration |
| 207 | + |
| 208 | +```yaml |
| 209 | +# .github/workflows/benchmark.yml |
| 210 | +name: Benchmark |
| 211 | +on: |
| 212 | + push: |
| 213 | + branches: [main] |
| 214 | + workflow_dispatch: |
| 215 | + |
| 216 | +jobs: |
| 217 | + bench: |
| 218 | + runs-on: ubuntu-latest |
| 219 | + steps: |
| 220 | + - uses: actions/checkout@v4 |
| 221 | + - uses: astral-sh/setup-uv@v4 |
| 222 | + - run: uv sync |
| 223 | + - run: just download-locomo |
| 224 | + - run: just convert-locomo |
| 225 | + - run: just index-locomo |
| 226 | + - run: just bench-locomo --output results/ci-latest.json |
| 227 | + - run: just compare-baseline results/ci-latest.json results/baselines/latest.json |
| 228 | + # Fail if recall@5 drops more than 2% from baseline |
| 229 | +``` |
| 230 | + |
| 231 | +## Blog Post Angle |
| 232 | + |
| 233 | +"We Benchmark in the Open" |
| 234 | +- Here are our numbers. Here's how to reproduce them. |
| 235 | +- We use academic datasets, not synthetic benchmarks we designed to win. |
| 236 | +- Clone the repo, run `just full-locomo`, get the same results. |
| 237 | +- We publish baselines with every release so you can track improvement over time. |
| 238 | +- This is what "build things worth keeping" looks like. |
| 239 | + |
| 240 | +## Implementation Plan |
| 241 | + |
| 242 | +### Phase 1: Repo setup + LoCoMo |
| 243 | +- Create `basic-memory-bench` repo |
| 244 | +- Port LoCoMo converter from TypeScript to Python (using BM importer framework) |
| 245 | +- Port harness from TypeScript to Python |
| 246 | +- Publish baseline results |
| 247 | +- README with full instructions |
| 248 | + |
| 249 | +### Phase 2: Cloud provider + LongMemEval |
| 250 | +- Add BM Cloud provider |
| 251 | +- Run cloud vs local comparison |
| 252 | +- Add LongMemEval dataset + converter |
| 253 | +- Publish comparison results |
| 254 | + |
| 255 | +### Phase 3: LLM-Judge + competitors |
| 256 | +- Add answer evaluation mode |
| 257 | +- Compare directly to Mem0's published LoCoMo numbers |
| 258 | +- Optional: add Mem0/Supermemory providers for head-to-head |
| 259 | +- Blog post with results |
| 260 | + |
| 261 | +### Phase 4: CI + public dashboard |
| 262 | +- GitHub Actions workflow for automated benchmarking |
| 263 | +- Results dashboard (could be a BM Cloud MDX dashboard note!) |
| 264 | +- Community contributions: custom datasets, new providers |
0 commit comments