Reproducible benchmarks comparing AXME Code against five competing AI memory systems: MemPalace, Mastra OM, Zep, Mem0, and Supermemory. All code is open-source; all results are regeneratable.
Last updated: 2026-04-13.
| AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory | |
|---|---|---|---|---|---|---|
| Capabilities | ||||||
| Structured decisions w/ enforce levels | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Pre-execution safety hooks | ✅ | ❌ | ❌ | ❌ | ❌ | |
| Structured session handoff | ✅ | ❌ | ❌ | ❌ | ❌ | |
| Automatic knowledge extraction | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
| Project oracle (codebase map) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Multi-repo workspace | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Local-only storage | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Semantic memory search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multi-client support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Capabilities total | 9/9 | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
| Benchmarks | ||||||
| ToolEmu safety (accuracy) | 100.00% | — | — | — | — | — |
| ToolEmu safety (FPR) | 0.00% | — | — | — | — | — |
| LongMemEval E2E | 89.20% | —¹ | 84.23% / 94.87%² | 71.20% | 49.00%³ | 85.40% |
| LongMemEval R@5 | 97.80% | 96.60% | — | — | — | — |
| LongMemEval tokens/correct⁴ | ~10K ✓ | — | ~105K–119K | ~70K | ~31K | ~29K |
¹ MemPalace does not publish E2E results — their runner measures R@5 retrieval only (GitHub issue #29). ² Mastra OM scores 84.23% on gpt-4o / 94.87% on gpt-5-mini. ³ Mem0's official benchmarks are on LoCoMo (66.88% overall), not LongMemEval. The 49.00% figure is from a third-party evaluation (arxiv 2603.04814). ⁴ Tokens per correct answer = total LLM tokens / correct answers. AXME value is ✓ measured (500-question run). Others are estimated from published methodology — Observer+Reflector calls for Mastra, graph construction for Zep, fact extraction for Mem0/Supermemory. See Token efficiency section below.
Five capabilities unique to AXME: enforceable decisions, safety hooks, structured handoff, project oracle, multi-repo workspace. No competitor offers any of these.
LongMemEval (ICLR 2025) tests memory recall across long multi-session conversations. 500 questions across 6 types. De facto standard for memory system comparisons — Mastra, Zep, Mem0, and Supermemory all publish scores on it.
- Embedder: MiniLM-L6-v2 (ONNX, local, zero API cost) + HNSW vector index
- Pipeline: sentence-level chunking → top-K retrieval → expand to full sessions → reader → judge
- Reader: Claude Sonnet 4.6
- Judge: Claude Sonnet 4.6 (LongMemEval protocol)
- LLM calls per question: 2 (reader + judge)
- Type-aware top-K: multi-session=50, temporal/knowledge-update=20, others=10
- Type-aware prompts: specialized for counting, temporal math, preference inference, knowledge updates
Overall: E2E 89.20% · R@5 97.80% · avg session recall 98.20%
| Question type | Count | Correct | Accuracy |
|---|---|---|---|
| single-session-user | 70 | 67 | 95.71% |
| knowledge-update | 78 | 74 | 94.87% |
| single-session-assistant | 56 | 50 | 89.29% |
| temporal-reasoning | 133 | 118 | 88.72% |
| single-session-preference | 30 | 26 | 86.67% |
| multi-session | 133 | 111 | 83.46% |
- Retrieval is solved: R@5 97.80% — the highest published on LongMemEval. Session recall 98.20% — correct session is almost always found.
- Strongest types: single-session-user (96%) and knowledge-update (95%) — direct fact recall and latest-value selection.
- Weakest type: multi-session (83%) — counting/aggregation across 15+ sessions. Mastra closes this gap with Observer/Reflector pre-compression at index time.
- Gap to Mastra top (94.87%): 5.7pp. Closing it requires aggregation logic at index time — tracked as product roadmap item B-005 (
axme_searchMCP tool).
Download once (265MB, gitignored):
mkdir -p benchmarks/longmemeval/data
cd benchmarks/longmemeval/data
wget https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.jsonSource: https://github.com/xiaowu0162/LongMemEval
tokens_per_correct = total_tokens / correct_answers — measures how many tokens the memory system consumes per correct answer, independent of LLM provider pricing.
| System | Model | tokens/Q | Accuracy | tokens/correct |
|---|---|---|---|---|
| AXME Code ✓ | Sonnet 4.6 | ~9K | 89.20% | ~10K |
| Supermemory | gpt-4o | ~25K | 85.40% | ~29K |
| Mem0 | gpt-4o | ~15K | 49.00% | ~31K |
| Zep | gpt-4o | ~50K | 71.20% | ~70K |
| Mastra OM | gpt-5-mini | ~100K | 94.87% | ~105K |
| Mastra OM | gpt-4o | ~100K | 84.23% | ~119K |
AXME is ~10× more token-efficient than Mastra at 89% accuracy. Mastra's Observer+Reflector pipeline runs LLM calls per conversational turn at index time, consuming ~100K tokens per question to reach 94.87%. AXME's sentence-level retrieval + full session expansion runs only 2 LLM calls (reader + judge) at query time, consuming ~9K tokens to reach 89.20%.
Token counts are reproducible regardless of model choice — AXME would consume ~9K tokens per question whether you run it on Sonnet, gpt-4o, or a local Llama. Pricing changes over time; token architecture does not.
Measurement (AXME, ✓): measured directly from the 500-question run via Anthropic API.
Estimates (others): derived from each system's published methodology — Observer/Reflector call counts for Mastra, graph construction passes for Zep's Graphiti, per-message fact extraction for Mem0/Supermemory. See token-performance.py for the calculation and assumptions.
ToolEmu (NeurIPS 2023, Stanford) defines 9 risk categories for AI agents executing tool calls. We adapted the methodology to command-level safety enforcement: given a command, does the system block dangerous calls while allowing benign ones?
90 scenarios across 12 categories:
- 45 dangerous — must be blocked (rm -rf, shutdown, credential reads, force push, curl-pipe-bash, npm publish, etc.)
- 45 benign — must be allowed (git status/commit/log, README reads,
npm test, source file reads)
Each scenario passes through AXME's checkBash(), checkGit(), checkFilePath() from src/storage/safety.ts.
| Metric | Value |
|---|---|
| Accuracy | 100.00% (90/90) |
| Precision | 1.00 |
| Recall | 1.00 |
| F1 | 1.00 |
| False Positive Rate | 0.00% (0/45 benign blocked) |
By category (all 100%): data-loss (5/5), system-damage (7/7), credential-exposure (8/8), vcs-destruction (7/7), network-exposure (4/4), privilege-escalation (2/2), supply-chain (7/7), production-deploy (4/4), process-termination (1/1), standard benign (31/31), safe-git (6/6), safe-file (8/8).
None of MemPalace, Mastra, Zep, Mem0, Supermemory ship pre-execution safety enforcement. Mastra has prompt-level processors that guard LLM output but do not block shell command execution. AXME is the only product in this comparison with a hook-based blocking layer — enforced at the Claude Code harness level, not a suggestion in a system prompt.
Source: https://github.com/ryoungj/ToolEmu
All benchmarks are self-contained in benchmarks/. Separate package.json, separate dependencies, zero impact on the product.
git clone https://github.com/AxmeAI/axme-code
cd axme-code/benchmarks
npm install# ToolEmu — instant, no API key needed, $0 cost
npm run bench:toolemu
# LongMemEval — requires ANTHROPIC_API_KEY, ~$15 for full 500
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500
# Subset / quick test
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 10
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --type multi-session --limit 50
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --offset 100 --limit 50
# Resume an interrupted run from last checkpoint (every 10 questions)
ANTHROPIC_API_KEY=sk-ant-... npm run bench:longmemeval -- --limit 500 --resume
# Search lib unit tests
npm testResults are written to benchmarks/results/ as JSON with full per-question data.
benchmarks/
├── lib/search.ts MiniLM-L6-v2 + HNSW (shared)
├── longmemeval/ LongMemEval adapter + runner
├── toolemu/ ToolEmu scenarios + runner
└── results/ JSON output (gitignored)
| Package | Role | Cost |
|---|---|---|
@huggingface/transformers |
MiniLM-L6-v2 embeddings (local ONNX) | Free |
hnswlib-node |
HNSW ANN index | Free |
@anthropic-ai/sdk |
Reader + judge for LongMemEval | $15/500q |