FalseMemBench

pretty_name

FalseMemBench

license

mit

task_categories

text-retrieval

language

en

FalseMemBench

FalseMemBench is an adversarial benchmark for evaluating memory retrieval systems under heavy distractor pressure.

The goal is to measure whether a system can retrieve the right memory when many nearby but wrong memories are present.

Focus

The benchmark is designed for memory systems used by LLM agents.

It emphasizes:

entity confusion
environment confusion
time/version confusion
stale facts vs current facts
speaker confusion
near-duplicate paraphrases

Public Surface

The public release is intentionally small:

data/cases.jsonl: canonical benchmark dataset
schema/case.schema.json: case schema
scripts/validate.py: dataset validator
scripts/run_tagmem_benchmark.py: benchmark runner for tagmem
scripts/run_mempalace_benchmark.py: benchmark runner for MemPalace-style retrieval
scripts/run_benchmark.py: simple keyword baseline
scripts/run_bm25_benchmark.py: BM25 baseline
scripts/run_dense_benchmark.py: dense retrieval baseline
docs/: supporting benchmark notes

Layout

schema/case.schema.json: benchmark case schema
data/cases.jsonl: canonical benchmark cases
docs/: benchmark design notes
scripts/validate.py: schema validator for the JSONL dataset
scripts/run_benchmark.py: simple keyword baseline
scripts/run_tagmem_benchmark.py: run the benchmark against a real tagmem binary
scripts/run_mempalace_benchmark.py: run the benchmark against MemPalace raw-style retrieval
scripts/run_bm25_benchmark.py: lexical BM25 baseline
scripts/run_dense_benchmark.py: dense retrieval baseline
requirements.txt: optional Python dependencies for BM25 and dense baseline scripts

Canonical Dataset

data/cases.jsonl is the only canonical benchmark file.

There are no public snapshot versions in this repository. Version history is tracked through git.

Running

Validate the canonical dataset:

python3 scripts/validate.py

Run the simple keyword baseline:

python3 scripts/run_benchmark.py

Run the tagmem benchmark:

python3 scripts/run_tagmem_benchmark.py --tagmem-bin tagmem

Run the MemPalace-style benchmark:

python3 scripts/run_mempalace_benchmark.py

Optional BM25 and dense baselines use dependencies from requirements.txt.

Case format

Each case contains:

a query
a set of entries
one or more relevant_ids
a single adversary_type
optional metadata for analysis

Example

{
  "id": "env-001",
  "query": "What database does staging use?",
  "adversary_type": "environment_swap",
  "entries": [
    {
      "id": "e1",
      "text": "The staging environment uses db-staging.internal.",
      "tags": ["staging", "database", "infra"],
      "depth": 1
    },
    {
      "id": "e2",
      "text": "The production environment uses db-prod.internal.",
      "tags": ["production", "database", "infra"],
      "depth": 1
    }
  ],
  "relevant_ids": ["e1"]
}

Current adversary types

entity_swap
environment_swap
time_swap
state_update
speaker_swap
near_duplicate_paraphrase

Current dataset size:

573 cases

Intended Use

The benchmark is intended to be:

model-agnostic
storage-agnostic
metadata-friendly
easy to publish to GitHub and Hugging Face

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
schema		schema
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
dataset_infos.json		dataset_infos.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FalseMemBench

Focus

Public Surface

Layout

Canonical Dataset

Running

Case format

Example

Current adversary types

Intended Use

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FalseMemBench

Focus

Public Surface

Layout

Canonical Dataset

Running

Case format

Example

Current adversary types

Intended Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages