Skip to content

Commit 26bbfd2

Browse files
yangxinxin-7claude
andauthored
benchmark: add LoCoMo evaluation for Supermemory (#1401)
* benchmark: add LoCoMo evaluation scripts for supermemory * benchmark(locomo): improve supermemory ingest and eval robustness - ingest.py: parallelize session upload/poll with ThreadPoolExecutor, add sample-level concurrency, parse LoCoMo dates to ISO 8601, simplify session content format - supermemory/eval.py: force explicit supermemory_search in prompt to work around first-turn autoRecall skip, pass question_time to gateway - mem0/eval.py: increase gateway startup sleep from 3s to 5s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(benchmark): remove dead code and fix potential IndexError in delete_container.py - Remove unused variable `prefix_sanitized` - Guard `k.split(":")[1]` access with length check to avoid IndexError on malformed ingest record keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 7388e4d commit 26bbfd2

6 files changed

Lines changed: 1690 additions & 1 deletion

File tree

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,3 +195,5 @@ examples/data/
195195
openviking/_version.py
196196
specs/
197197
.trae/
198+
.codex/
199+
.ttadk/

benchmark/locomo/mem0/eval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ def _restart_openclaw_gateway(base_url: str, sample_id: str, startup_timeout: in
9696
raise RuntimeError(f"Failed to start openclaw gateway: {e}")
9797

9898
# Wait for process to fully start before checking health
99-
time.sleep(3)
99+
time.sleep(5)
100100

101101
# Wait until gateway is ready
102102
health_url = f"{base_url.rstrip('/')}/health"
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# LoCoMo Benchmark — Supermemory Evaluation
2+
3+
Evaluate [Supermemory](https://supermemory.ai) on the [LoCoMo](https://github.com/snap-stanford/locomo) benchmark using OpenClaw as the agent (same approach as the mem0 eval).
4+
5+
## Overview
6+
7+
Two-phase pipeline:
8+
9+
1. **Ingest** — Import LoCoMo conversations into Supermemory (one `containerTag` per sample)
10+
2. **Eval** — Send QA questions to OpenClaw agent (which recalls from Supermemory internally), then judge answers with an LLM
11+
12+
Before each sample, `eval.py` automatically:
13+
1. Updates `~/.openclaw/openclaw.json` to set `openclaw-supermemory.config.containerTag = sanitize(sample_id)`
14+
2. Switches `plugins.slots.memory` to `"openclaw-supermemory"`
15+
3. Restarts the OpenClaw gateway to pick up the new config
16+
17+
> **Tag sanitization**: `conv-26``conv_26` (matches openclaw-supermemory's internal `sanitizeTag` logic). Both `ingest.py` and `eval.py` apply the same transformation automatically.
18+
19+
## Prerequisites
20+
21+
- [OpenClaw](https://openclaw.ai) installed and configured
22+
- `openclaw-supermemory` plugin installed (`~/.openclaw/extensions/openclaw-supermemory`)
23+
- `~/.openclaw/openclaw.json` with `openclaw-supermemory.config.apiKey` set
24+
- API keys in `~/.openviking_benchmark_env`:
25+
26+
```env
27+
SUPERMEMORY_API_KEY=sm-...
28+
ARK_API_KEY=... # Volcengine ARK, used for judge LLM
29+
```
30+
31+
- Python dependencies:
32+
33+
```bash
34+
uv sync --frozen --extra dev
35+
pip install supermemory openai python-dotenv
36+
```
37+
38+
## Data
39+
40+
LoCoMo 10-sample dataset at `benchmark/locomo/data/locomo10.json`:
41+
42+
- 10 samples (conversations between two people)
43+
- 1986 QA pairs across 5 categories:
44+
- 1: single-hop
45+
- 2: multi-hop
46+
- 3: temporal
47+
- 4: world-knowledge
48+
- 5: adversarial (skipped by default)
49+
50+
## Step 1 — Ingest
51+
52+
Import conversations into Supermemory. Each sample is stored under `containerTag = sample_id` (e.g. `conv-26`).
53+
54+
Sessions are formatted as date-prefixed JSON strings, matching the memorybench supermemory provider convention. Indexing is polled until both document and memory reach `done` status.
55+
56+
```bash
57+
# Ingest all 10 samples
58+
python ingest.py
59+
60+
# Ingest a single sample
61+
python ingest.py --sample conv-26
62+
63+
# Ingest specific sessions only
64+
python ingest.py --sample conv-26 --sessions 1-4
65+
66+
# Force re-ingest (ignore existing records)
67+
python ingest.py --sample conv-26 --force-ingest
68+
69+
# Clear all ingest records and start fresh
70+
python ingest.py --clear-ingest-record
71+
```
72+
73+
Key options:
74+
75+
| Option | Description |
76+
|--------|-------------|
77+
| `--sample` | Sample ID (e.g. `conv-26`) or index (0-based). Default: all |
78+
| `--sessions` | Session range, e.g. `1-4` or `3`. Default: all |
79+
| `--limit` | Max samples to process |
80+
| `--force-ingest` | Re-ingest even if already recorded |
81+
| `--clear-ingest-record` | Clear `.ingest_record.json` before running |
82+
| `--no-wait-indexing` | Skip indexing poll (faster, no status check) |
83+
84+
Ingest records are saved to `result/.ingest_record.json` to avoid duplicate ingestion.
85+
86+
## Step 2 — Eval
87+
88+
Send QA questions to OpenClaw agent and optionally judge answers.
89+
90+
```bash
91+
# Run QA + judge for all samples (6 concurrent threads)
92+
python eval.py --threads 6 --judge
93+
94+
# Single sample
95+
python eval.py --sample conv-26 --threads 6 --judge
96+
97+
# First 12 questions only
98+
python eval.py --sample conv-26 --count 12 --threads 6 --judge
99+
100+
# Judge-only (grade existing responses in CSV)
101+
python eval.py --judge-only
102+
```
103+
104+
Key options:
105+
106+
| Option | Description |
107+
|--------|-------------|
108+
| `--sample` | Sample ID or index. Default: all |
109+
| `--count` | Max QA items to process |
110+
| `--threads` | Concurrent threads per sample (default: 10) |
111+
| `--judge` | Auto-judge each response after answering |
112+
| `--judge-only` | Skip QA, only grade ungraded rows in existing CSV |
113+
| `--openclaw-url` | OpenClaw gateway URL (default: `http://127.0.0.1:18789`) |
114+
| `--openclaw-token` | Auth token (or `OPENCLAW_GATEWAY_TOKEN` env var) |
115+
| `--judge-base-url` | Judge API base URL (default: Volcengine ARK) |
116+
| `--judge-model` | Judge model (default: `doubao-seed-2-0-pro-260215`) |
117+
| `--output` | Output CSV path (default: `result/qa_results.csv`) |
118+
119+
Results are written to `result/qa_results.csv`. Failed (`[ERROR]`) rows are automatically removed at the start of each run and retried.
120+
121+
## Output
122+
123+
`result/qa_results.csv` columns:
124+
125+
| Column | Description |
126+
|--------|-------------|
127+
| `sample_id` | Conversation sample ID |
128+
| `question_id` | Unique question ID (e.g. `conv-26_qa0`) |
129+
| `question` / `answer` | Question and gold answer |
130+
| `category` / `category_name` | Question category |
131+
| `response` | Agent response |
132+
| `input_tokens` / `output_tokens` / `total_tokens` | LLM token usage |
133+
| `time_cost` | End-to-end latency (seconds) |
134+
| `result` | `CORRECT` or `WRONG` |
135+
| `reasoning` | Judge's reasoning |
136+
137+
## Summary Output
138+
139+
After eval completes:
140+
141+
```
142+
=== Token & Latency Summary ===
143+
Total input tokens : 123456
144+
Avg time per query : 18.3s
145+
146+
=== Accuracy Summary ===
147+
Overall: 512/1540 = 33.25%
148+
By category:
149+
multi-hop : 120/321 = 37.38%
150+
single-hop : 98/282 = 34.75%
151+
temporal : 28/96 = 29.17%
152+
world-knowledge : 266/841 = 31.63%
153+
```
154+
155+
## Delete Supermemory Data
156+
157+
```bash
158+
# Delete a specific sample's documents
159+
python delete_container.py conv-26
160+
161+
# Delete all samples from the dataset
162+
python delete_container.py --from-data
163+
164+
# Delete first N samples
165+
python delete_container.py --from-data --limit 3
166+
```
167+
168+
> **Note:** `delete_container.py` uses `documents.list(containerTags=[tag])` + `documents.deleteBulk(ids=[...])` in batches of 100, and also clears the corresponding ingest records from `result/.ingest_record.json`.
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
"""
2+
Delete all Supermemory documents for one or more containerTags (sample_ids).
3+
4+
Usage:
5+
# Delete a single container
6+
python delete_container.py conv-26
7+
8+
# Delete multiple containers
9+
python delete_container.py conv-26 conv-31 conv-45
10+
11+
# Delete first N samples from locomo10.json
12+
python delete_container.py --from-data --limit 2
13+
14+
# Delete all samples from locomo10.json
15+
python delete_container.py --from-data
16+
"""
17+
18+
import argparse
19+
import json
20+
import os
21+
import re
22+
import sys
23+
from pathlib import Path
24+
25+
from dotenv import load_dotenv
26+
27+
load_dotenv(Path.home() / ".openviking_benchmark_env")
28+
29+
try:
30+
from supermemory import Supermemory
31+
except ImportError:
32+
print("Error: supermemory package not installed. Run: pip install supermemory", file=sys.stderr)
33+
sys.exit(1)
34+
35+
SCRIPT_DIR = Path(__file__).parent.resolve()
36+
DEFAULT_DATA_PATH = str(SCRIPT_DIR / ".." / "data" / "locomo10.json")
37+
DEFAULT_RECORD_PATH = str(SCRIPT_DIR / "result" / ".ingest_record.json")
38+
39+
40+
def sanitize_tag(raw: str) -> str:
41+
"""Sanitize a tag string to match openclaw-supermemory convention.
42+
e.g. 'conv-26' -> 'conv_26'
43+
"""
44+
tag = re.sub(r"[^a-zA-Z0-9_]", "_", raw)
45+
tag = re.sub(r"_+", "_", tag)
46+
tag = tag.strip("_")
47+
return tag
48+
49+
50+
def wipe_container(client: Supermemory, container_tag: str) -> int:
51+
"""
52+
Delete all documents in a containerTag using documents.list + deleteBulk.
53+
Returns number of documents deleted.
54+
"""
55+
all_ids: list[str] = []
56+
page = 1
57+
58+
while True:
59+
response = client.documents.list(
60+
container_tags=[container_tag],
61+
limit=100,
62+
page=page,
63+
)
64+
65+
memories = getattr(response, "memories", None)
66+
if memories is None and isinstance(response, dict):
67+
memories = response.get("memories", [])
68+
69+
if not memories:
70+
break
71+
72+
for doc in memories:
73+
doc_id = getattr(doc, "id", None) or (doc.get("id") if isinstance(doc, dict) else None)
74+
if doc_id:
75+
all_ids.append(doc_id)
76+
77+
# Check pagination
78+
pagination = getattr(response, "pagination", None) or (response.get("pagination") if isinstance(response, dict) else None)
79+
total_pages = None
80+
if pagination:
81+
total_pages = getattr(pagination, "totalPages", None) or (pagination.get("totalPages") if isinstance(pagination, dict) else None)
82+
83+
if total_pages is None or page >= total_pages:
84+
break
85+
page += 1
86+
87+
if not all_ids:
88+
return 0
89+
90+
# Delete in batches of 100
91+
deleted = 0
92+
for i in range(0, len(all_ids), 100):
93+
batch = all_ids[i : i + 100]
94+
client.documents.delete_bulk(ids=batch)
95+
deleted += len(batch)
96+
97+
return deleted
98+
99+
100+
def clear_ingest_records(container_tag: str, record_path: str) -> int:
101+
"""Remove ingest records for the given container_tag. Returns count removed."""
102+
try:
103+
with open(record_path, "r", encoding="utf-8") as f:
104+
record = json.load(f)
105+
except (FileNotFoundError, json.JSONDecodeError):
106+
return 0
107+
108+
# Records are keyed as "supermemory:{sample_id}:{session_key}"
109+
# Match by sanitized sample_id to handle keys like "conv-26" vs "conv_26"
110+
keys_to_remove = [k for k in record if len(k.split(":")) >= 2 and sanitize_tag(k.split(":")[1]) == container_tag]
111+
112+
for k in keys_to_remove:
113+
del record[k]
114+
115+
with open(record_path, "w", encoding="utf-8") as f:
116+
json.dump(record, f, indent=2, ensure_ascii=False)
117+
118+
return len(keys_to_remove)
119+
120+
121+
def delete_container(client: Supermemory, sample_id: str, record_path: str) -> bool:
122+
container_tag = sanitize_tag(sample_id)
123+
print(f" [containerTag={container_tag}] listing documents...", file=sys.stderr)
124+
125+
try:
126+
deleted = wipe_container(client, container_tag)
127+
if deleted == 0:
128+
print(f" [WARN] No documents found (may already be deleted)", file=sys.stderr)
129+
else:
130+
print(f" [OK] Deleted {deleted} documents", file=sys.stderr)
131+
except Exception as e:
132+
print(f" [ERROR] Failed to delete documents: {e}", file=sys.stderr)
133+
return False
134+
135+
removed = clear_ingest_records(container_tag, record_path)
136+
if removed:
137+
print(f" Cleared {removed} ingest record(s)", file=sys.stderr)
138+
139+
return True
140+
141+
142+
def main() -> None:
143+
parser = argparse.ArgumentParser(description="Delete all Supermemory documents for given sample(s)")
144+
parser.add_argument("samples", nargs="*", help="sample_id(s) to delete (e.g. conv-26 conv-31)")
145+
parser.add_argument("--api-key", default=None, help="Supermemory API key (or SUPERMEMORY_API_KEY env var)")
146+
parser.add_argument("--from-data", action="store_true", help="load sample_ids from locomo10.json")
147+
parser.add_argument("--input", default=DEFAULT_DATA_PATH, help="path to locomo10.json")
148+
parser.add_argument("--limit", type=int, default=None, help="max samples to delete (with --from-data)")
149+
parser.add_argument(
150+
"--record",
151+
default=DEFAULT_RECORD_PATH,
152+
help=f"Path to ingest progress record (default: {DEFAULT_RECORD_PATH})",
153+
)
154+
args = parser.parse_args()
155+
156+
api_key = args.api_key or os.environ.get("SUPERMEMORY_API_KEY", "")
157+
if not api_key:
158+
print("Error: Supermemory API key required (--api-key or SUPERMEMORY_API_KEY env var)", file=sys.stderr)
159+
sys.exit(1)
160+
161+
sample_ids: list[str] = list(args.samples)
162+
163+
if args.from_data:
164+
with open(args.input, "r", encoding="utf-8") as f:
165+
data = json.load(f)
166+
if args.limit:
167+
data = data[: args.limit]
168+
sample_ids += [s["sample_id"] for s in data]
169+
170+
if not sample_ids:
171+
print("Error: no sample_ids specified. Pass sample_ids or use --from-data", file=sys.stderr)
172+
sys.exit(1)
173+
174+
sample_ids = list(dict.fromkeys(sample_ids)) # deduplicate, preserve order
175+
print(f"Deleting documents for {len(sample_ids)} sample(s)...", file=sys.stderr)
176+
177+
client = Supermemory(api_key=api_key)
178+
ok = 0
179+
for sid in sample_ids:
180+
print(f"\n=== {sid} ===", file=sys.stderr)
181+
if delete_container(client, sid, args.record):
182+
ok += 1
183+
184+
print(f"\nDone: {ok}/{len(sample_ids)} succeeded", file=sys.stderr)
185+
186+
187+
if __name__ == "__main__":
188+
main()

0 commit comments

Comments
 (0)