benchmark: add LoCoMo evaluation for Supermemory (#1401)

yangxinxin-7 · claude · web-flow · commit 26bbfd2c244e · 2026-04-13T11:19:16.000+08:00
* benchmark: add LoCoMo evaluation scripts for supermemory

* benchmark(locomo): improve supermemory ingest and eval robustness

- ingest.py: parallelize session upload/poll with ThreadPoolExecutor,
  add sample-level concurrency, parse LoCoMo dates to ISO 8601,
  simplify session content format
- supermemory/eval.py: force explicit supermemory_search in prompt to
  work around first-turn autoRecall skip, pass question_time to gateway
- mem0/eval.py: increase gateway startup sleep from 3s to 5s

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

* fix(benchmark): remove dead code and fix potential IndexError in delete_container.py

- Remove unused variable `prefix_sanitized`
- Guard `k.split(":")[1]` access with length check to avoid IndexError
  on malformed ingest record keys

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -195,3 +195,5 @@ examples/data/
 openviking/_version.py
 specs/
 .trae/
+.codex/
+.ttadk/
diff --git a/benchmark/locomo/mem0/eval.py b/benchmark/locomo/mem0/eval.py
@@ -96,7 +96,7 @@ def _restart_openclaw_gateway(base_url: str, sample_id: str, startup_timeout: in
         raise RuntimeError(f"Failed to start openclaw gateway: {e}")
 
     # Wait for process to fully start before checking health
-    time.sleep(3)
+    time.sleep(5)
 
     # Wait until gateway is ready
     health_url = f"{base_url.rstrip('/')}/health"
diff --git a/benchmark/locomo/supermemory/README.md b/benchmark/locomo/supermemory/README.md
@@ -0,0 +1,168 @@
+# LoCoMo Benchmark — Supermemory Evaluation
+
+Evaluate [Supermemory](https://supermemory.ai) on the [LoCoMo](https://github.com/snap-stanford/locomo) benchmark using OpenClaw as the agent (same approach as the mem0 eval).
+
+## Overview
+
+Two-phase pipeline:
+
+1. **Ingest** — Import LoCoMo conversations into Supermemory (one `containerTag` per sample)
+2. **Eval** — Send QA questions to OpenClaw agent (which recalls from Supermemory internally), then judge answers with an LLM
+
+Before each sample, `eval.py` automatically:
+1. Updates `~/.openclaw/openclaw.json` to set `openclaw-supermemory.config.containerTag = sanitize(sample_id)`
+2. Switches `plugins.slots.memory` to `"openclaw-supermemory"`
+3. Restarts the OpenClaw gateway to pick up the new config
+
+> **Tag sanitization**: `conv-26` → `conv_26` (matches openclaw-supermemory's internal `sanitizeTag` logic). Both `ingest.py` and `eval.py` apply the same transformation automatically.
+
+## Prerequisites
+
+- [OpenClaw](https://openclaw.ai) installed and configured
+- `openclaw-supermemory` plugin installed (`~/.openclaw/extensions/openclaw-supermemory`)
+- `~/.openclaw/openclaw.json` with `openclaw-supermemory.config.apiKey` set
+- API keys in `~/.openviking_benchmark_env`:
+
+```env
+SUPERMEMORY_API_KEY=sm-...
+ARK_API_KEY=...         # Volcengine ARK, used for judge LLM
+```
+
+- Python dependencies:
+
+```bash
+uv sync --frozen --extra dev
+pip install supermemory openai python-dotenv
+```
+
+## Data
+
+LoCoMo 10-sample dataset at `benchmark/locomo/data/locomo10.json`:
+
+- 10 samples (conversations between two people)
+- 1986 QA pairs across 5 categories:
+  - 1: single-hop
+  - 2: multi-hop
+  - 3: temporal
+  - 4: world-knowledge
+  - 5: adversarial (skipped by default)
+
+## Step 1 — Ingest
+
+Import conversations into Supermemory. Each sample is stored under `containerTag = sample_id` (e.g. `conv-26`).
+
+Sessions are formatted as date-prefixed JSON strings, matching the memorybench supermemory provider convention. Indexing is polled until both document and memory reach `done` status.
+
+```bash
+# Ingest all 10 samples
+python ingest.py
+
+# Ingest a single sample
+python ingest.py --sample conv-26
+
+# Ingest specific sessions only
+python ingest.py --sample conv-26 --sessions 1-4
+
+# Force re-ingest (ignore existing records)
+python ingest.py --sample conv-26 --force-ingest
+
+# Clear all ingest records and start fresh
+python ingest.py --clear-ingest-record
+```
+
+Key options:
+
+| Option | Description |
+|--------|-------------|
+| `--sample` | Sample ID (e.g. `conv-26`) or index (0-based). Default: all |
+| `--sessions` | Session range, e.g. `1-4` or `3`. Default: all |
+| `--limit` | Max samples to process |
+| `--force-ingest` | Re-ingest even if already recorded |
+| `--clear-ingest-record` | Clear `.ingest_record.json` before running |
+| `--no-wait-indexing` | Skip indexing poll (faster, no status check) |
+
+Ingest records are saved to `result/.ingest_record.json` to avoid duplicate ingestion.
+
+## Step 2 — Eval
+
+Send QA questions to OpenClaw agent and optionally judge answers.
+
+```bash
+# Run QA + judge for all samples (6 concurrent threads)
+python eval.py --threads 6 --judge
+
+# Single sample
+python eval.py --sample conv-26 --threads 6 --judge
+
+# First 12 questions only
+python eval.py --sample conv-26 --count 12 --threads 6 --judge
+
+# Judge-only (grade existing responses in CSV)
+python eval.py --judge-only
+```
+
+Key options:
+
+| Option | Description |
+|--------|-------------|
+| `--sample` | Sample ID or index. Default: all |
+| `--count` | Max QA items to process |
+| `--threads` | Concurrent threads per sample (default: 10) |
+| `--judge` | Auto-judge each response after answering |
+| `--judge-only` | Skip QA, only grade ungraded rows in existing CSV |
+| `--openclaw-url` | OpenClaw gateway URL (default: `http://127.0.0.1:18789`) |
+| `--openclaw-token` | Auth token (or `OPENCLAW_GATEWAY_TOKEN` env var) |
+| `--judge-base-url` | Judge API base URL (default: Volcengine ARK) |
+| `--judge-model` | Judge model (default: `doubao-seed-2-0-pro-260215`) |
+| `--output` | Output CSV path (default: `result/qa_results.csv`) |
+
+Results are written to `result/qa_results.csv`. Failed (`[ERROR]`) rows are automatically removed at the start of each run and retried.
+
+## Output
+
+`result/qa_results.csv` columns:
+
+| Column | Description |
+|--------|-------------|
+| `sample_id` | Conversation sample ID |
+| `question_id` | Unique question ID (e.g. `conv-26_qa0`) |
+| `question` / `answer` | Question and gold answer |
+| `category` / `category_name` | Question category |
+| `response` | Agent response |
+| `input_tokens` / `output_tokens` / `total_tokens` | LLM token usage |
+| `time_cost` | End-to-end latency (seconds) |
+| `result` | `CORRECT` or `WRONG` |
+| `reasoning` | Judge's reasoning |
+
+## Summary Output
+
+After eval completes:
+
+```
+=== Token & Latency Summary ===
+  Total input tokens : 123456
+  Avg time per query : 18.3s
+
+=== Accuracy Summary ===
+  Overall: 512/1540 = 33.25%
+  By category:
+    multi-hop           : 120/321 = 37.38%
+    single-hop          : 98/282 = 34.75%
+    temporal            : 28/96  = 29.17%
+    world-knowledge     : 266/841 = 31.63%
+```
+
+## Delete Supermemory Data
+
+```bash
+# Delete a specific sample's documents
+python delete_container.py conv-26
+
+# Delete all samples from the dataset
+python delete_container.py --from-data
+
+# Delete first N samples
+python delete_container.py --from-data --limit 3
+```
+
+> **Note:** `delete_container.py` uses `documents.list(containerTags=[tag])` + `documents.deleteBulk(ids=[...])` in batches of 100, and also clears the corresponding ingest records from `result/.ingest_record.json`.
diff --git a/benchmark/locomo/supermemory/delete_container.py b/benchmark/locomo/supermemory/delete_container.py
@@ -0,0 +1,188 @@
+"""
+Delete all Supermemory documents for one or more containerTags (sample_ids).
+
+Usage:
+    # Delete a single container
+    python delete_container.py conv-26
+
+    # Delete multiple containers
+    python delete_container.py conv-26 conv-31 conv-45
+
+    # Delete first N samples from locomo10.json
+    python delete_container.py --from-data --limit 2
+
+    # Delete all samples from locomo10.json
+    python delete_container.py --from-data
+"""
+
+import argparse
+import json
+import os
+import re
+import sys
+from pathlib import Path
+
+from dotenv import load_dotenv
+
+load_dotenv(Path.home() / ".openviking_benchmark_env")
+
+try:
+    from supermemory import Supermemory
+except ImportError:
+    print("Error: supermemory package not installed. Run: pip install supermemory", file=sys.stderr)
+    sys.exit(1)
+
+SCRIPT_DIR = Path(__file__).parent.resolve()
+DEFAULT_DATA_PATH = str(SCRIPT_DIR / ".." / "data" / "locomo10.json")
+DEFAULT_RECORD_PATH = str(SCRIPT_DIR / "result" / ".ingest_record.json")
+
+
+def sanitize_tag(raw: str) -> str:
+    """Sanitize a tag string to match openclaw-supermemory convention.
+    e.g. 'conv-26' -> 'conv_26'
+    """
+    tag = re.sub(r"[^a-zA-Z0-9_]", "_", raw)
+    tag = re.sub(r"_+", "_", tag)
+    tag = tag.strip("_")
+    return tag
+
+
+def wipe_container(client: Supermemory, container_tag: str) -> int:
+    """
+    Delete all documents in a containerTag using documents.list + deleteBulk.
+    Returns number of documents deleted.
+    """
+    all_ids: list[str] = []
+    page = 1
+
+    while True:
+        response = client.documents.list(
+            container_tags=[container_tag],
+            limit=100,
+            page=page,
+        )
+
+        memories = getattr(response, "memories", None)
+        if memories is None and isinstance(response, dict):
+            memories = response.get("memories", [])
+
+        if not memories:
+            break
+
+        for doc in memories:
+            doc_id = getattr(doc, "id", None) or (doc.get("id") if isinstance(doc, dict) else None)
+            if doc_id:
+                all_ids.append(doc_id)
+
+        # Check pagination
+        pagination = getattr(response, "pagination", None) or (response.get("pagination") if isinstance(response, dict) else None)
+        total_pages = None
+        if pagination:
+            total_pages = getattr(pagination, "totalPages", None) or (pagination.get("totalPages") if isinstance(pagination, dict) else None)
+
+        if total_pages is None or page >= total_pages:
+            break
+        page += 1
+
+    if not all_ids:
+        return 0
+
+    # Delete in batches of 100
+    deleted = 0
+    for i in range(0, len(all_ids), 100):
+        batch = all_ids[i : i + 100]
+        client.documents.delete_bulk(ids=batch)
+        deleted += len(batch)
+
+    return deleted
+
+
+def clear_ingest_records(container_tag: str, record_path: str) -> int:
+    """Remove ingest records for the given container_tag. Returns count removed."""
+    try:
+        with open(record_path, "r", encoding="utf-8") as f:
+            record = json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return 0
+
+    # Records are keyed as "supermemory:{sample_id}:{session_key}"
+    # Match by sanitized sample_id to handle keys like "conv-26" vs "conv_26"
+    keys_to_remove = [k for k in record if len(k.split(":")) >= 2 and sanitize_tag(k.split(":")[1]) == container_tag]
+
+    for k in keys_to_remove:
+        del record[k]
+
+    with open(record_path, "w", encoding="utf-8") as f:
+        json.dump(record, f, indent=2, ensure_ascii=False)
+
+    return len(keys_to_remove)
+
+
+def delete_container(client: Supermemory, sample_id: str, record_path: str) -> bool:
+    container_tag = sanitize_tag(sample_id)
+    print(f"  [containerTag={container_tag}] listing documents...", file=sys.stderr)
+
+    try:
+        deleted = wipe_container(client, container_tag)
+        if deleted == 0:
+            print(f"  [WARN] No documents found (may already be deleted)", file=sys.stderr)
+        else:
+            print(f"  [OK] Deleted {deleted} documents", file=sys.stderr)
+    except Exception as e:
+        print(f"  [ERROR] Failed to delete documents: {e}", file=sys.stderr)
+        return False
+
+    removed = clear_ingest_records(container_tag, record_path)
+    if removed:
+        print(f"  Cleared {removed} ingest record(s)", file=sys.stderr)
+
+    return True
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Delete all Supermemory documents for given sample(s)")
+    parser.add_argument("samples", nargs="*", help="sample_id(s) to delete (e.g. conv-26 conv-31)")
+    parser.add_argument("--api-key", default=None, help="Supermemory API key (or SUPERMEMORY_API_KEY env var)")
+    parser.add_argument("--from-data", action="store_true", help="load sample_ids from locomo10.json")
+    parser.add_argument("--input", default=DEFAULT_DATA_PATH, help="path to locomo10.json")
+    parser.add_argument("--limit", type=int, default=None, help="max samples to delete (with --from-data)")
+    parser.add_argument(
+        "--record",
+        default=DEFAULT_RECORD_PATH,
+        help=f"Path to ingest progress record (default: {DEFAULT_RECORD_PATH})",
+    )
+    args = parser.parse_args()
+
+    api_key = args.api_key or os.environ.get("SUPERMEMORY_API_KEY", "")
+    if not api_key:
+        print("Error: Supermemory API key required (--api-key or SUPERMEMORY_API_KEY env var)", file=sys.stderr)
+        sys.exit(1)
+
+    sample_ids: list[str] = list(args.samples)
+
+    if args.from_data:
+        with open(args.input, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        if args.limit:
+            data = data[: args.limit]
+        sample_ids += [s["sample_id"] for s in data]
+
+    if not sample_ids:
+        print("Error: no sample_ids specified. Pass sample_ids or use --from-data", file=sys.stderr)
+        sys.exit(1)
+
+    sample_ids = list(dict.fromkeys(sample_ids))  # deduplicate, preserve order
+    print(f"Deleting documents for {len(sample_ids)} sample(s)...", file=sys.stderr)
+
+    client = Supermemory(api_key=api_key)
+    ok = 0
+    for sid in sample_ids:
+        print(f"\n=== {sid} ===", file=sys.stderr)
+        if delete_container(client, sid, args.record):
+            ok += 1
+
+    print(f"\nDone: {ok}/{len(sample_ids)} succeeded", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmark/locomo/supermemory/eval.py b/benchmark/locomo/supermemory/eval.py
diff --git a/benchmark/locomo/supermemory/ingest.py b/benchmark/locomo/supermemory/ingest.py

-Original file line number
+Diff line change
 openviking/_version.py
 specs/
 .trae/
 +.codex/
 +.ttadk/