Skip to content

codysnider/FalseMemBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pretty_name FalseMemBench
license mit
task_categories
text-retrieval
language
en
tags
retrieval
memory
llm-agents
adversarial
size_categories
n<1K

FalseMemBench

FalseMemBench is an adversarial benchmark for evaluating memory retrieval systems under heavy distractor pressure.

The goal is to measure whether a system can retrieve the right memory when many nearby but wrong memories are present.

Focus

The benchmark is designed for memory systems used by LLM agents.

It emphasizes:

  • entity confusion
  • environment confusion
  • time/version confusion
  • stale facts vs current facts
  • speaker confusion
  • near-duplicate paraphrases

Public Surface

The public release is intentionally small:

  • data/cases.jsonl: canonical benchmark dataset
  • schema/case.schema.json: case schema
  • scripts/validate.py: dataset validator
  • scripts/run_tagmem_benchmark.py: benchmark runner for tagmem
  • scripts/run_mempalace_benchmark.py: benchmark runner for MemPalace-style retrieval
  • scripts/run_benchmark.py: simple keyword baseline
  • scripts/run_bm25_benchmark.py: BM25 baseline
  • scripts/run_dense_benchmark.py: dense retrieval baseline
  • docs/: supporting benchmark notes

Layout

  • schema/case.schema.json: benchmark case schema
  • data/cases.jsonl: canonical benchmark cases
  • docs/: benchmark design notes
  • scripts/validate.py: schema validator for the JSONL dataset
  • scripts/run_benchmark.py: simple keyword baseline
  • scripts/run_tagmem_benchmark.py: run the benchmark against a real tagmem binary
  • scripts/run_mempalace_benchmark.py: run the benchmark against MemPalace raw-style retrieval
  • scripts/run_bm25_benchmark.py: lexical BM25 baseline
  • scripts/run_dense_benchmark.py: dense retrieval baseline
  • requirements.txt: optional Python dependencies for BM25 and dense baseline scripts

Canonical Dataset

data/cases.jsonl is the only canonical benchmark file.

There are no public snapshot versions in this repository. Version history is tracked through git.

Running

Validate the canonical dataset:

python3 scripts/validate.py

Run the simple keyword baseline:

python3 scripts/run_benchmark.py

Run the tagmem benchmark:

python3 scripts/run_tagmem_benchmark.py --tagmem-bin tagmem

Run the MemPalace-style benchmark:

python3 scripts/run_mempalace_benchmark.py

Optional BM25 and dense baselines use dependencies from requirements.txt.

Case format

Each case contains:

  • a query
  • a set of entries
  • one or more relevant_ids
  • a single adversary_type
  • optional metadata for analysis

Example

{
  "id": "env-001",
  "query": "What database does staging use?",
  "adversary_type": "environment_swap",
  "entries": [
    {
      "id": "e1",
      "text": "The staging environment uses db-staging.internal.",
      "tags": ["staging", "database", "infra"],
      "depth": 1
    },
    {
      "id": "e2",
      "text": "The production environment uses db-prod.internal.",
      "tags": ["production", "database", "infra"],
      "depth": 1
    }
  ],
  "relevant_ids": ["e1"]
}

Current adversary types

  • entity_swap
  • environment_swap
  • time_swap
  • state_update
  • speaker_swap
  • near_duplicate_paraphrase

Current dataset size:

  • 573 cases

Intended Use

The benchmark is intended to be:

  • model-agnostic
  • storage-agnostic
  • metadata-friendly
  • easy to publish to GitHub and Hugging Face

About

Adversarial benchmark for memory retrieval systems, with noisy distractors, precision traps, and source-confusion cases for LLM agent memory evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors

Languages