AgentV Bench: SWE-bench

Benchmark repo for SWE-bench coding evaluations using AgentV. Contains imported SWE-bench examples converted to agentv eval format, with Docker workspace configurations for isolated execution.

Overview

This repository stores SWE-bench-style coding evaluation benchmarks in the AgentV eval format. Each eval defines:

A Docker workspace based on the official SWE-bench evaluation images
A problem statement from a real GitHub issue
Assertions that verify the fix by running the project's test suite

SWE-bench tasks test an agent's ability to resolve real-world GitHub issues by producing correct code patches. Evals here are imported from the SWE-bench dataset and converted to AgentV YAML format.

Structure

.agentv/              # AgentV project configuration
  config.yaml         # Studio threshold and settings
  targets.yaml        # Model provider targets
evals/
  swebench-verified/  # SWE-bench Verified examples (curated, human-validated)
  swebench-lite/      # SWE-bench Lite examples (smaller subset)
scripts/
  import-swebench.py  # Import script to pull from HuggingFace datasets

Running Evals

Prerequisites:

AgentV installed
Docker available (SWE-bench images will be pulled automatically)

Run a single eval:

agentv eval evals/swebench-verified/django-15180.EVAL.yaml

Run all SWE-bench Verified evals:

agentv eval evals/swebench-verified/

Run with a specific target:

agentv eval evals/swebench-verified/ --target claude-opus

Results

Results are stored in .agentv/results/ (git-ignored). Use agentv studio to view and compare results across targets:

agentv studio

The default pass threshold is set to 0.8 in .agentv/config.yaml.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.agentv		.agentv
evals		evals
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentV Bench: SWE-bench

Overview

Structure

Running Evals

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentV Bench: SWE-bench

Overview

Structure

Running Evals

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages