Skip to content

Commit 73fbd45

Browse files
anandgupta42claude
andcommitted
feat: add Spider2-DBT benchmark evaluation pipeline
- 68-task benchmark for evaluating agent on dbt+DuckDB workflows - Resumable runner with parallel execution (`--parallel N`) - Official Spider2 evaluation bridge (`eval_utils`) - Interactive single-file HTML report with leaderboard chart - One-time setup script for Spider2 repo + DuckDB databases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f2ccdf9 commit 73fbd45

10 files changed

Lines changed: 1856 additions & 0 deletions

File tree

experiments/spider2_dbt/.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Spider2 cloned repo (large, re-cloneable)
2+
spider2_repo/
3+
4+
# Per-task workspace copies
5+
workspace/
6+
7+
# Results and reports (generated artifacts)
8+
results/
9+
reports/
10+
11+
# Python
12+
__pycache__/
13+
*.pyc

experiments/spider2_dbt/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Spider 2.0-DBT Benchmark Evaluation
2+
3+
Evaluate **altimate-code** against the [Spider 2.0-DBT](https://spider2-dbt.github.io/) benchmark — 68 real-world dbt + DuckDB data engineering tasks.
4+
5+
## Quick Start
6+
7+
```bash
8+
# 1. Install dependencies
9+
pip install -r requirements.txt
10+
11+
# 2. Setup (clone Spider2 repo, download databases)
12+
python setup_spider2.py
13+
14+
# 3. Run benchmark (all 68 tasks)
15+
python run_benchmark.py
16+
17+
# 4. Evaluate against gold standard
18+
python evaluate_results.py
19+
20+
# 5. Generate interactive HTML report
21+
python report.py
22+
```
23+
24+
## Smoke Test (5 tasks)
25+
26+
```bash
27+
python run_benchmark.py --tasks 5
28+
python evaluate_results.py
29+
python report.py
30+
```
31+
32+
## CLI Options
33+
34+
### `run_benchmark.py`
35+
36+
| Flag | Default | Description |
37+
|------|---------|-------------|
38+
| `--tasks N` | all | First N tasks |
39+
| `--tasks id1 id2` | all | Specific task IDs |
40+
| `--timeout` | 600 | Seconds per task |
41+
| `--model` | `anthropic/claude-opus-4-6` | Model to use |
42+
| `--agent` | default | Agent to use |
43+
| `--no-resume` | off | Force re-run all tasks |
44+
| `--dry-run` | off | Print tasks without running |
45+
46+
### `evaluate_results.py`
47+
48+
| Flag | Default | Description |
49+
|------|---------|-------------|
50+
| `--results` | latest | Path to benchmark results JSON |
51+
52+
### `report.py`
53+
54+
| Flag | Default | Description |
55+
|------|---------|-------------|
56+
| `--evaluation` | latest | Path to evaluation JSON |
57+
| `--output` | auto | Output HTML file path |
58+
59+
## Directory Structure
60+
61+
```
62+
experiments/spider2_dbt/
63+
├── config.py # Paths, leaderboard data, defaults
64+
├── setup_spider2.py # One-time: clone Spider2, download data
65+
├── prompt_template.py # Prompt engineering for each task
66+
├── run_benchmark.py # Runner: invoke altimate-code per task
67+
├── evaluate_results.py # Bridge to Spider2's official eval_utils
68+
├── report.py # Generate interactive single-file HTML report
69+
├── requirements.txt # Python deps
70+
├── results/ # Timestamped JSON results
71+
│ └── incremental/ # Per-task results for resumability
72+
├── reports/ # Generated HTML reports
73+
├── workspace/ # Per-task dbt project copies (gitignored)
74+
└── spider2_repo/ # Cloned Spider2 repository (gitignored)
75+
```
76+
77+
## Resumability
78+
79+
The benchmark runner saves per-task results to `results/incremental/`. If interrupted, re-running `python run_benchmark.py` will skip completed tasks. Use `--no-resume` to force a full re-run.
80+
81+
## Report Features
82+
83+
The HTML report is a single self-contained file (no external dependencies):
84+
85+
- **Summary cards**: Pass rate, total time, model, rank
86+
- **Leaderboard chart**: SVG bar chart with all Spider2 entries + altimate-code highlighted
87+
- **Category breakdown**: Tasks grouped by domain with pass/fail counts
88+
- **Per-task table**: Sortable, filterable, with expandable agent logs
89+
- **Timing histogram**: Distribution of execution times
90+
91+
## Leaderboard Context
92+
93+
Current Spider 2.0-DBT leaderboard (as of 2025):
94+
95+
| Agent | Pass Rate |
96+
|-------|-----------|
97+
| Databao Agent | 44.11% |
98+
| MLE-Bench Agent | 38.24% |
99+
| Claude 3.5 Sonnet (CoT) | 36.76% |
100+
| GPT-4o (CoT) | 33.82% |
101+
| CodeS Agent | 32.35% |
102+
| OpenHands Agent | 30.88% |
103+
| SWE-Agent | 27.94% |

experiments/spider2_dbt/config.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""Configuration constants for Spider 2.0-DBT benchmark evaluation."""
2+
3+
from __future__ import annotations
4+
5+
import os
6+
from pathlib import Path
7+
8+
# ── Paths ──────────────────────────────────────────────────────────────────────
9+
10+
BASE_DIR = Path(__file__).resolve().parent
11+
SPIDER2_REPO_DIR = BASE_DIR / "spider2_repo"
12+
SPIDER2_DBT_DIR = SPIDER2_REPO_DIR / "spider2-dbt"
13+
TASK_JSONL = SPIDER2_DBT_DIR / "examples" / "spider2-dbt.jsonl"
14+
EXAMPLES_DIR = SPIDER2_DBT_DIR / "examples"
15+
GOLD_EVAL_JSONL = SPIDER2_DBT_DIR / "evaluation_suite" / "gold" / "spider2_eval.jsonl"
16+
EVAL_UTILS_DIR = SPIDER2_DBT_DIR / "evaluation_suite"
17+
WORKSPACE_DIR = BASE_DIR / "workspace"
18+
RESULTS_DIR = BASE_DIR / "results"
19+
INCREMENTAL_DIR = RESULTS_DIR / "incremental"
20+
REPORTS_DIR = BASE_DIR / "reports"
21+
22+
# ── Spider2 Repository ─────────────────────────────────────────────────────────
23+
24+
SPIDER2_REPO_URL = "https://github.com/xlang-ai/Spider2.git"
25+
# Pin to a known-good commit for reproducibility
26+
SPIDER2_COMMIT = "main"
27+
28+
# Google Drive file IDs for DuckDB database zips (from Spider2 README)
29+
# Format: (gdrive_id, expected_filename)
30+
DUCKDB_ZIP_DOWNLOADS = [
31+
("1N3f7BSWC4foj-V-1C9n8M2XmgV7FOcqL", "DBT_start_db.zip"),
32+
("1s0USV_iQLo4oe05QqAMnhGGp5jeejCzp", "dbt_gold.zip"),
33+
]
34+
35+
# ── Execution ──────────────────────────────────────────────────────────────────
36+
37+
ALTIMATE_CODE_BIN = os.environ.get("ALTIMATE_CODE_BIN", "altimate-code")
38+
DEFAULT_TIMEOUT = 600 # seconds per task
39+
DEFAULT_PARALLEL = 4 # concurrent tasks
40+
DEFAULT_MODEL = "anthropic/claude-opus-4-6"
41+
DEFAULT_AGENT = "coder"
42+
43+
# ── Leaderboard Data (Spider 2.0-DBT, as of 2025) ─────────────────────────────
44+
# Source: https://spider2-dbt.github.io/
45+
# Format: (agent_name, pass_rate)
46+
47+
LEADERBOARD: list[tuple[str, float]] = [
48+
("Databao Agent", 44.11),
49+
("MLE-Bench Agent", 38.24),
50+
("Claude 3.5 Sonnet (CoT)", 36.76),
51+
("GPT-4o (CoT)", 33.82),
52+
("CodeS Agent", 32.35),
53+
("OpenHands Agent", 30.88),
54+
("SWE-Agent", 27.94),
55+
("Gemini 1.5 Pro (CoT)", 26.47),
56+
("Llama 3.1 405B (CoT)", 22.06),
57+
("GPT-4o mini (CoT)", 19.12),
58+
("Claude 3 Haiku (CoT)", 16.18),
59+
]
60+
61+
# ── Task Categories (domain grouping for report) ──────────────────────────────
62+
# Extract domain from instance_id by stripping trailing digits
63+
64+
import re
65+
66+
67+
def get_task_domain(instance_id: str) -> str:
68+
"""Extract domain from instance_id by stripping trailing digits.
69+
70+
e.g. 'shopify002' -> 'shopify', 'f1003' -> 'f1', 'tpch001' -> 'tpch'
71+
"""
72+
return re.sub(r"\d+$", "", instance_id)

0 commit comments

Comments
 (0)