Skip to content

Commit fd78cc4

Browse files
anandgupta42claude
andcommitted
feat: add Spider 2.0-DBT benchmark harness
Add benchmark runner, evaluator, and reporting scripts for the Spider 2.0-DBT benchmark suite. Auto-downloads Spider2 repo and DuckDB databases on first run if not available. - `run_benchmark.py`: Parallel task runner with auto-retry, resume support - `evaluate_results.py`: Official `duckdb_match` evaluation against gold DBs - `setup_spider2.py`: One-time setup (sparse clone + Google Drive downloads) - `report.py`: Leaderboard comparison and domain breakdown reports - `config.py`: Centralized paths, timeouts, and leaderboard data - `prompt_template.py`: Task prompt builder for the agent - `.gitignore`: Excludes dataset dirs (`spider2_repo/`, `workspace/`, `results/`) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5108d02 commit fd78cc4

11 files changed

Lines changed: 1969 additions & 0 deletions

experiments/spider2_dbt/.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Spider2 cloned repo (large, re-cloneable)
2+
spider2_repo/
3+
4+
# Per-task workspace copies
5+
workspace/
6+
7+
# Results and reports (generated artifacts)
8+
results/
9+
reports/
10+
11+
# Python
12+
__pycache__/
13+
*.pyc

experiments/spider2_dbt/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Spider 2.0-DBT Benchmark Evaluation
2+
3+
Evaluate **altimate-code** against the [Spider 2.0-DBT](https://spider2-dbt.github.io/) benchmark — 68 real-world dbt + DuckDB data engineering tasks.
4+
5+
## Quick Start
6+
7+
```bash
8+
# 1. Install dependencies
9+
pip install -r requirements.txt
10+
11+
# 2. Setup (clone Spider2 repo, download databases)
12+
python setup_spider2.py
13+
14+
# 3. Run benchmark (all 68 tasks)
15+
python run_benchmark.py
16+
17+
# 4. Evaluate against gold standard
18+
python evaluate_results.py
19+
20+
# 5. Generate interactive HTML report
21+
python report.py
22+
```
23+
24+
## Smoke Test (5 tasks)
25+
26+
```bash
27+
python run_benchmark.py --tasks 5
28+
python evaluate_results.py
29+
python report.py
30+
```
31+
32+
## CLI Options
33+
34+
### `run_benchmark.py`
35+
36+
| Flag | Default | Description |
37+
|------|---------|-------------|
38+
| `--tasks N` | all | First N tasks |
39+
| `--tasks id1 id2` | all | Specific task IDs |
40+
| `--timeout` | 600 | Seconds per task |
41+
| `--model` | `anthropic/claude-opus-4-6` | Model to use |
42+
| `--agent` | default | Agent to use |
43+
| `--no-resume` | off | Force re-run all tasks |
44+
| `--dry-run` | off | Print tasks without running |
45+
46+
### `evaluate_results.py`
47+
48+
| Flag | Default | Description |
49+
|------|---------|-------------|
50+
| `--results` | latest | Path to benchmark results JSON |
51+
52+
### `report.py`
53+
54+
| Flag | Default | Description |
55+
|------|---------|-------------|
56+
| `--evaluation` | latest | Path to evaluation JSON |
57+
| `--output` | auto | Output HTML file path |
58+
59+
## Directory Structure
60+
61+
```
62+
experiments/spider2_dbt/
63+
├── config.py # Paths, leaderboard data, defaults
64+
├── setup_spider2.py # One-time: clone Spider2, download data
65+
├── prompt_template.py # Prompt engineering for each task
66+
├── run_benchmark.py # Runner: invoke altimate-code per task
67+
├── evaluate_results.py # Bridge to Spider2's official eval_utils
68+
├── report.py # Generate interactive single-file HTML report
69+
├── requirements.txt # Python deps
70+
├── results/ # Timestamped JSON results
71+
│ └── incremental/ # Per-task results for resumability
72+
├── reports/ # Generated HTML reports
73+
├── workspace/ # Per-task dbt project copies (gitignored)
74+
└── spider2_repo/ # Cloned Spider2 repository (gitignored)
75+
```
76+
77+
## Resumability
78+
79+
The benchmark runner saves per-task results to `results/incremental/`. If interrupted, re-running `python run_benchmark.py` will skip completed tasks. Use `--no-resume` to force a full re-run.
80+
81+
## Report Features
82+
83+
The HTML report is a single self-contained file (no external dependencies):
84+
85+
- **Summary cards**: Pass rate, total time, model, rank
86+
- **Leaderboard chart**: SVG bar chart with all Spider2 entries + altimate-code highlighted
87+
- **Category breakdown**: Tasks grouped by domain with pass/fail counts
88+
- **Per-task table**: Sortable, filterable, with expandable agent logs
89+
- **Timing histogram**: Distribution of execution times
90+
91+
## Leaderboard Context
92+
93+
Current Spider 2.0-DBT leaderboard (as of 2025):
94+
95+
| Agent | Pass Rate |
96+
|-------|-----------|
97+
| Databao Agent | 44.11% |
98+
| MLE-Bench Agent | 38.24% |
99+
| Claude 3.5 Sonnet (CoT) | 36.76% |
100+
| GPT-4o (CoT) | 33.82% |
101+
| CodeS Agent | 32.35% |
102+
| OpenHands Agent | 30.88% |
103+
| SWE-Agent | 27.94% |
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
#!/bin/bash
2+
exec bun run --cwd /Users/anandgupta/codebase/altimate-code/packages/opencode --conditions=browser src/index.ts "$@"

experiments/spider2_dbt/config.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
"""Configuration constants for Spider 2.0-DBT benchmark evaluation."""
2+
3+
from __future__ import annotations
4+
5+
import os
6+
from pathlib import Path
7+
8+
# ── Paths ──────────────────────────────────────────────────────────────────────
9+
10+
BASE_DIR = Path(__file__).resolve().parent
11+
SPIDER2_REPO_DIR = BASE_DIR / "spider2_repo"
12+
SPIDER2_DBT_DIR = SPIDER2_REPO_DIR / "spider2-dbt"
13+
TASK_JSONL = SPIDER2_DBT_DIR / "examples" / "spider2-dbt.jsonl"
14+
EXAMPLES_DIR = SPIDER2_DBT_DIR / "examples"
15+
GOLD_EVAL_JSONL = SPIDER2_DBT_DIR / "evaluation_suite" / "gold" / "spider2_eval.jsonl"
16+
EVAL_UTILS_DIR = SPIDER2_DBT_DIR / "evaluation_suite"
17+
WORKSPACE_DIR = BASE_DIR / "workspace"
18+
RESULTS_DIR = BASE_DIR / "results"
19+
INCREMENTAL_DIR = RESULTS_DIR / "incremental"
20+
REPORTS_DIR = BASE_DIR / "reports"
21+
22+
# ── Spider2 Repository ─────────────────────────────────────────────────────────
23+
24+
SPIDER2_REPO_URL = "https://github.com/xlang-ai/Spider2.git"
25+
# Pin to a known-good commit for reproducibility
26+
SPIDER2_COMMIT = "main"
27+
28+
# Google Drive file IDs for DuckDB database zips (from Spider2 README)
29+
# Format: (gdrive_id, expected_filename)
30+
DUCKDB_ZIP_DOWNLOADS = [
31+
("1N3f7BSWC4foj-V-1C9n8M2XmgV7FOcqL", "DBT_start_db.zip"),
32+
("1s0USV_iQLo4oe05QqAMnhGGp5jeejCzp", "dbt_gold.zip"),
33+
]
34+
35+
# ── Execution ──────────────────────────────────────────────────────────────────
36+
37+
ALTIMATE_CODE_BIN = os.environ.get("ALTIMATE_CODE_BIN", "altimate-code")
38+
DEFAULT_TIMEOUT = 600 # seconds per task (slowest legit tasks take ~593s)
39+
MAX_RETRIES = 2 # auto-retry only for fast exits (API/init failures)
40+
FAST_EXIT_THRESHOLD_S = 10 # tasks completing under this are likely failures
41+
DEFAULT_PARALLEL = 2 # concurrent tasks (4 caused too much resource contention)
42+
DEFAULT_MODEL = "anthropic/claude-sonnet-4-6"
43+
DEFAULT_AGENT = "coder"
44+
45+
# ── Leaderboard Data (Spider 2.0-DBT, as of 2025) ─────────────────────────────
46+
# Source: https://spider2-dbt.github.io/
47+
# Format: (agent_name, pass_rate)
48+
49+
LEADERBOARD: list[tuple[str, float]] = [
50+
("Databao Agent", 44.11),
51+
("MLE-Bench Agent", 38.24),
52+
("Claude 3.5 Sonnet (CoT)", 36.76),
53+
("GPT-4o (CoT)", 33.82),
54+
("CodeS Agent", 32.35),
55+
("OpenHands Agent", 30.88),
56+
("SWE-Agent", 27.94),
57+
("Gemini 1.5 Pro (CoT)", 26.47),
58+
("Llama 3.1 405B (CoT)", 22.06),
59+
("GPT-4o mini (CoT)", 19.12),
60+
("Claude 3 Haiku (CoT)", 16.18),
61+
]
62+
63+
# ── Task Categories (domain grouping for report) ──────────────────────────────
64+
# Extract domain from instance_id by stripping trailing digits
65+
66+
import re
67+
68+
69+
def get_task_domain(instance_id: str) -> str:
70+
"""Extract domain from instance_id by stripping trailing digits.
71+
72+
e.g. 'shopify002' -> 'shopify', 'f1003' -> 'f1', 'tpch001' -> 'tpch'
73+
"""
74+
return re.sub(r"\d+$", "", instance_id)

0 commit comments

Comments
 (0)