| pretty_name | FalseMemBench | ||||
|---|---|---|---|---|---|
| license | mit | ||||
| task_categories |
|
||||
| language |
|
||||
| tags |
|
||||
| size_categories |
|
FalseMemBench is an adversarial benchmark for evaluating memory retrieval systems under heavy distractor pressure.
The goal is to measure whether a system can retrieve the right memory when many nearby but wrong memories are present.
The benchmark is designed for memory systems used by LLM agents.
It emphasizes:
- entity confusion
- environment confusion
- time/version confusion
- stale facts vs current facts
- speaker confusion
- near-duplicate paraphrases
The public release is intentionally small:
data/cases.jsonl: canonical benchmark datasetschema/case.schema.json: case schemascripts/validate.py: dataset validatorscripts/run_tagmem_benchmark.py: benchmark runner fortagmemscripts/run_mempalace_benchmark.py: benchmark runner for MemPalace-style retrievalscripts/run_benchmark.py: simple keyword baselinescripts/run_bm25_benchmark.py: BM25 baselinescripts/run_dense_benchmark.py: dense retrieval baselinedocs/: supporting benchmark notes
schema/case.schema.json: benchmark case schemadata/cases.jsonl: canonical benchmark casesdocs/: benchmark design notesscripts/validate.py: schema validator for the JSONL datasetscripts/run_benchmark.py: simple keyword baselinescripts/run_tagmem_benchmark.py: run the benchmark against a realtagmembinaryscripts/run_mempalace_benchmark.py: run the benchmark against MemPalace raw-style retrievalscripts/run_bm25_benchmark.py: lexical BM25 baselinescripts/run_dense_benchmark.py: dense retrieval baselinerequirements.txt: optional Python dependencies for BM25 and dense baseline scripts
data/cases.jsonl is the only canonical benchmark file.
There are no public snapshot versions in this repository. Version history is tracked through git.
Validate the canonical dataset:
python3 scripts/validate.pyRun the simple keyword baseline:
python3 scripts/run_benchmark.pyRun the tagmem benchmark:
python3 scripts/run_tagmem_benchmark.py --tagmem-bin tagmemRun the MemPalace-style benchmark:
python3 scripts/run_mempalace_benchmark.pyOptional BM25 and dense baselines use dependencies from requirements.txt.
Each case contains:
- a
query - a set of
entries - one or more
relevant_ids - a single
adversary_type - optional metadata for analysis
{
"id": "env-001",
"query": "What database does staging use?",
"adversary_type": "environment_swap",
"entries": [
{
"id": "e1",
"text": "The staging environment uses db-staging.internal.",
"tags": ["staging", "database", "infra"],
"depth": 1
},
{
"id": "e2",
"text": "The production environment uses db-prod.internal.",
"tags": ["production", "database", "infra"],
"depth": 1
}
],
"relevant_ids": ["e1"]
}entity_swapenvironment_swaptime_swapstate_updatespeaker_swapnear_duplicate_paraphrase
Current dataset size:
573cases
The benchmark is intended to be:
- model-agnostic
- storage-agnostic
- metadata-friendly
- easy to publish to GitHub and Hugging Face