feat: support Anthropic-compatible endpoints for benchmark LLMs

Sanderhoff-alt · Sanderhoff-alt · commit 1e703eb101aa · 2026-05-08T15:29:24.000+08:00
Add Anthropic as a first-class provider for answer and judge models,
validate provider-specific environment variables per role, and update
the README to match the omb CLI and runtime-local output workflow.
diff --git a/README.md b/README.md
@@ -1,69 +1,88 @@
-# AMB — Agent Memory Benchmark
+# OMB — Open Memory Benchmark
 
-We built AMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. AMB is fully open: datasets, prompts, scoring logic, and results.
+We built OMB because we wanted to be honest about how Hindsight performs — and because no existing benchmark gave us the full picture. OMB is fully open: datasets, prompts, scoring logic, and results.
 
 Live leaderboard: **[agentmemorybenchmark.ai](https://agentmemorybenchmark.ai)**
 
 ## The problem with existing benchmarks
 
 LoComo and LongMemEval are solid datasets, but they were designed for an era of 32k context windows. State-of-the-art models now have million-token context windows — on most instances, a naive "dump everything into context" approach scores competitively, not because it's a good memory architecture, but because retrieval has become the easy part. The benchmarks can no longer tell them apart.
 
-Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. AMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions.
+Both datasets were also built around chatbot use cases. Agents today don't just answer questions about conversation history — they research, plan, execute multi-step tasks, and build knowledge across many interactions. OMB adds datasets that focus on agentic tasks: memory across tool calls, knowledge built from document research, preferences applied to multi-step decisions.
 
-## What AMB measures
+## What OMB measures
 
-A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. AMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it.
+A memory system that scores 90% accuracy but costs $10 per user per day is not better than one that scores 82% and costs $0.10. OMB starts from accuracy because it's the hardest to fake, and tracks speed and token cost alongside it.
 
-The only credible benchmark result is one you can reproduce yourself. AMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them.
+The only credible benchmark result is one you can reproduce yourself. OMB publishes everything: the evaluation harness, judge prompts, answer generation prompts, and the exact models used. Small changes to any of these can swing accuracy scores by double digits — we publish all of them.
 
 ## How it works
 
 1. **Ingest** — documents from a dataset are loaded into a memory provider
 2. **Retrieve** — for each query the memory provider retrieves relevant context
-3. **Generate** — a Gemini model produces an answer from the retrieved context
-4. **Judge** — a second Gemini call scores the answer against gold answers
+3. **Generate** — an LLM produces an answer from the retrieved context
+4. **Judge** — a second LLM call scores the answer against gold answers
 
 Retrieval time is tracked separately from generation; ingestion time is also recorded.
 
 ## Setup
 
 ```bash
-# Copy and fill in your API key
-cp .env.example .env   # or just create .env with:
-# GEMINI_API_KEY=...
+# Example: Anthropic-compatible endpoint for answer/judge LLMs
+export ANTHROPIC_BASE_URL=https://your-endpoint.example.com
+export ANTHROPIC_API_KEY=your-api-key
+export OMB_ANSWER_LLM=anthropic
+export OMB_JUDGE_LLM=anthropic
+export OMB_ANSWER_MODEL=your-model-name
+export OMB_JUDGE_MODEL=your-model-name
 ```
 
+Only set the provider-specific variables for the providers you actually use:
+
+- `anthropic`: `ANTHROPIC_API_KEY` and optional `ANTHROPIC_BASE_URL`
+- `gemini`: `GEMINI_API_KEY` or `GOOGLE_API_KEY`
+- `groq`: `GROQ_API_KEY`
+- `openai`: `OPENAI_API_KEY`
+
 ## Usage
 
 ```bash
 # List available datasets, memory providers, and modes
-uv run amb providers
+uv run omb providers
 
 # List domains for a dataset
-uv run amb domains --dataset personamem
+uv run omb domains --dataset personamem
 
 # Run a benchmark
-uv run amb run --dataset personamem --domain 32k --memory bm25
+uv run omb run --dataset personamem --domain 32k --memory bm25
 
 # Limit scale for a quick test
-uv run amb run --dataset personamem --domain 32k --memory bm25 --query-limit 20
+uv run omb run --dataset personamem --domain 32k --memory bm25 --query-limit 20
 
 # Oracle mode: ingest only gold documents (tests generation quality in isolation)
-uv run amb run --dataset personamem --domain 32k --memory bm25 --oracle
+uv run omb run --dataset personamem --domain 32k --memory bm25 --oracle
 
 # Dataset statistics
-uv run amb dataset-stats --dataset personamem
+uv run omb dataset-stats --dataset personamem
 
 # Browse results in the browser
-uv run amb view
+uv run omb view
 ```
 
 ## Results
 
-Results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json` and can be explored with `uv run amb view`.
+By default, results are saved to `outputs/{dataset}/{memory}/{mode}/{domain}.json`.
+If you pass `--output-dir`, results are written under that directory instead.
+This is how runtime-local wrappers can keep outputs under their own `results/` folders while still using the same benchmark CLI.
+
+Explore results with `uv run omb view`.
 
 ## Requirements
 
 - Python ≥ 3.11
-- `GEMINI_API_KEY` in `.env` or environment
+- API keys for the providers you actually use:
+- `ANTHROPIC_API_KEY` for `anthropic`
+- `GEMINI_API_KEY` for `gemini`
+- `GROQ_API_KEY` for `groq`
+- `OPENAI_API_KEY` for `openai`
 - For MemBench: set `MEMBENCH_DATA_PATH` to your local data directory
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,6 +4,7 @@ version = "0.1.0"
 description = "Open Memory Benchmark"
 requires-python = ">=3.11"
 dependencies = [
+    "anthropic>=0.84.0",
     "datasets>=2.0",
     "typer>=0.12",
     "rich>=13",
diff --git a/src/memory_bench/cli.py b/src/memory_bench/cli.py
@@ -22,12 +22,46 @@
 console = Console()
 
 
-def _resolve_gemini_key() -> None:
-    key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
-    if not key:
-        typer.echo("Error: GEMINI_API_KEY environment variable is not set.", err=True)
-        raise typer.Exit(1)
-    os.environ["GOOGLE_API_KEY"] = key
+def _ensure_provider_env(provider: str, role: str) -> None:
+    if provider == "anthropic":
+        if not os.environ.get("ANTHROPIC_API_KEY"):
+            typer.echo(f"Error: {role} LLM provider '{provider}' requires ANTHROPIC_API_KEY.", err=True)
+            raise typer.Exit(1)
+        return
+
+    if provider == "gemini":
+        key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
+        if not key:
+            typer.echo(f"Error: {role} LLM provider '{provider}' requires GEMINI_API_KEY.", err=True)
+            raise typer.Exit(1)
+        os.environ["GOOGLE_API_KEY"] = key
+        return
+
+    if provider == "groq":
+        if not os.environ.get("GROQ_API_KEY"):
+            typer.echo(f"Error: {role} LLM provider '{provider}' requires GROQ_API_KEY.", err=True)
+            raise typer.Exit(1)
+        return
+
+    if provider == "openai":
+        if not os.environ.get("OPENAI_API_KEY"):
+            typer.echo(f"Error: {role} LLM provider '{provider}' requires OPENAI_API_KEY.", err=True)
+            raise typer.Exit(1)
+        return
+
+
+def _validate_run_env(memory: str) -> None:
+    answer_provider = os.environ.get("OMB_ANSWER_LLM", "groq")
+    judge_provider = os.environ.get("OMB_JUDGE_LLM", "gemini")
+    _ensure_provider_env(answer_provider, "Answer")
+    _ensure_provider_env(judge_provider, "Judge")
+
+    if memory == "hindsight":
+        key = os.environ.get("GEMINI_API_KEY") or os.environ.get("GOOGLE_API_KEY")
+        if not key:
+            typer.echo("Error: memory provider 'hindsight' requires GEMINI_API_KEY for embedded extraction.", err=True)
+            raise typer.Exit(1)
+        os.environ["GOOGLE_API_KEY"] = key
 
 
 @app.command()
@@ -53,7 +87,7 @@ def run(
     description: str = typer.Option(None, "--description", "-d", help="Optional description for this run (stored in the result JSON)"),
 ) -> None:
     """Run an evaluation on a single split (optionally filtered to a category)."""
-    _resolve_gemini_key()
+    _validate_run_env(memory)
 
     ds = get_dataset(dataset)
 
diff --git a/src/memory_bench/llm/__init__.py b/src/memory_bench/llm/__init__.py
@@ -1,11 +1,13 @@
 import os
 
+from .anthropic import AnthropicLLM
 from .base import LLM, Schema
 from .gemini import GeminiLLM
 from .groq import GroqLLM
 from .openai import OpenAILLM
 
 REGISTRY: dict[str, type[LLM]] = {
+    "anthropic": AnthropicLLM,
     "gemini": GeminiLLM,
     "groq": GroqLLM,
     "openai": OpenAILLM,
diff --git a/src/memory_bench/llm/anthropic.py b/src/memory_bench/llm/anthropic.py
@@ -0,0 +1,142 @@
+import json
+import os
+import re
+import time
+
+from .base import LLM, Schema
+
+_MAX_RETRIES = 6
+_RETRY_BASE_DELAY = 5
+
+
+def _parse_json_payload(text: str) -> dict:
+    text = text.strip()
+
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+
+    fenced = re.search(r"```(?:json)?\s*(\{.*\})\s*```", text, flags=re.DOTALL | re.IGNORECASE)
+    if fenced:
+        return json.loads(fenced.group(1))
+
+    start = text.find("{")
+    end = text.rfind("}")
+    if start != -1 and end != -1 and end > start:
+        return json.loads(text[start : end + 1])
+
+    raise json.JSONDecodeError("Could not find JSON object in model response", text, 0)
+
+
+def _coerce_text_payload(text: str, schema: Schema) -> dict | None:
+    text = text.strip()
+    if not text:
+        return None
+
+    result: dict = {}
+    for field in schema.required:
+        spec = schema.properties.get(field, {})
+        field_type = spec.get("type", "string")
+        lowered = text.lower()
+
+        if field_type == "string":
+            result[field] = text
+            continue
+
+        if field_type == "boolean":
+            if re.search(r"\btrue\b", lowered):
+                result[field] = True
+                continue
+            if re.search(r"\bfalse\b", lowered):
+                result[field] = False
+                continue
+            if re.search(r"\b(correct|yes)\b", lowered) and not re.search(r"\b(incorrect|wrong|no)\b", lowered):
+                result[field] = True
+                continue
+            if re.search(r"\b(incorrect|wrong|no)\b", lowered):
+                result[field] = False
+                continue
+            return None
+
+        return None
+
+    return result
+
+
+class AnthropicLLM(LLM):
+    def __init__(self, model: str | None = None):
+        from anthropic import Anthropic
+
+        api_key = os.environ.get("ANTHROPIC_API_KEY")
+        if not api_key:
+            raise RuntimeError("Anthropic provider requires ANTHROPIC_API_KEY")
+
+        base_url = os.environ.get("ANTHROPIC_BASE_URL")
+        self._client = Anthropic(
+            api_key=api_key,
+            base_url=base_url or None,
+            max_retries=0,
+        )
+        self._model = (
+            model
+            or os.environ.get("ANTHROPIC_MODEL")
+            or "claude-sonnet-4-5"
+        )
+
+    @property
+    def model_id(self) -> str:
+        return f"anthropic:{self._model}"
+
+    def generate(self, prompt: str, schema: Schema) -> dict:
+        from anthropic import APIConnectionError, APIStatusError, RateLimitError
+
+        schema_json = {
+            "type": "object",
+            "properties": schema.properties,
+            "required": schema.required,
+            "additionalProperties": False,
+        }
+        system_prompt = (
+            "Return only a valid JSON object matching this schema. "
+            "Do not wrap JSON in markdown fences.\n\n"
+            f"{json.dumps(schema_json, ensure_ascii=False)}"
+        )
+
+        delay = _RETRY_BASE_DELAY
+        last_exc = None
+
+        for attempt in range(_MAX_RETRIES):
+            try:
+                response = self._client.messages.create(
+                    model=self._model,
+                    max_tokens=4096,
+                    temperature=0.0,
+                    system=system_prompt,
+                    messages=[{"role": "user", "content": prompt}],
+                )
+                text = "".join(block.text for block in response.content if getattr(block, "type", None) == "text")
+                try:
+                    return _parse_json_payload(text)
+                except json.JSONDecodeError:
+                    coerced = _coerce_text_payload(text, schema)
+                    if coerced is not None:
+                        return coerced
+                    raise
+            except (RateLimitError, APIConnectionError) as e:
+                last_exc = e
+            except APIStatusError as e:
+                last_exc = e
+                if e.status_code not in (429, 500, 502, 503, 504):
+                    raise
+            except Exception as e:
+                last_exc = e
+                msg = str(e)
+                if "429" not in msg and "rate" not in msg.lower():
+                    raise
+
+            if attempt < _MAX_RETRIES - 1:
+                time.sleep(delay)
+                delay *= 2
+
+        raise RuntimeError(f"Anthropic request failed after {_MAX_RETRIES} retries: {last_exc}")
diff --git a/uv.lock b/uv.lock