Benchmark repo for SWE-bench coding evaluations using AgentV. Contains imported SWE-bench examples converted to agentv eval format, with Docker workspace configurations for isolated execution.
This repository stores SWE-bench-style coding evaluation benchmarks in the AgentV eval format. Each eval defines:
- A Docker workspace based on the official SWE-bench evaluation images
- A problem statement from a real GitHub issue
- Assertions that verify the fix by running the project's test suite
SWE-bench tasks test an agent's ability to resolve real-world GitHub issues by producing correct code patches. Evals here are imported from the SWE-bench dataset and converted to AgentV YAML format.
.agentv/ # AgentV project configuration
config.yaml # Studio threshold and settings
targets.yaml # Model provider targets
evals/
swebench-verified/ # SWE-bench Verified examples (curated, human-validated)
swebench-lite/ # SWE-bench Lite examples (smaller subset)
scripts/
import-swebench.py # Import script to pull from HuggingFace datasets
Prerequisites:
- AgentV installed
- Docker available (SWE-bench images will be pulled automatically)
Run a single eval:
agentv eval evals/swebench-verified/django-15180.EVAL.yamlRun all SWE-bench Verified evals:
agentv eval evals/swebench-verified/Run with a specific target:
agentv eval evals/swebench-verified/ --target claude-opusResults are stored in .agentv/results/ (git-ignored). Use agentv studio to view and compare results across targets:
agentv studioThe default pass threshold is set to 0.8 in .agentv/config.yaml.