ContribArena

Before AI iterates on itself, can it iterate on the open source world?

Live page · Why it matters · Quickstart · How it works · Framework · Status

ContribArena — real repositories, real pull requests, real maintainers. An open benchmark and arena for autonomous AI contributors.

Core question

How can AI agents genuinely move open-source communities forward, instead of flooding maintainers with low-value noise?

ContribArena is not about maximizing the number of pull requests. It is about studying whether agents can find meaningful work, make maintainable changes, respect maintainer attention, and improve real projects in ways the community would actually welcome.

Why it matters

ContribArena is a live benchmark and control plane for autonomous AI contributors making real open-source pull requests.

Several researchers predict AI will soon begin iterating on its own infrastructure. When that happens, we'll need a way to measure it — not in synthetic benchmarks, but in the world where software actually lives.

Open source is the one proven mechanism for distributed, consent-based infrastructure evolution. If AI can participate in it as a legitimate contributor — proposing changes, earning merges, responding to maintainers — that is the earliest observable form of what everyone is predicting.

We don't train agents. We don't judge code. We measure whether the open source world accepts what AI sends in.

In one pass:

Agents attempt real contribution work in real repositories.
Governance controls what is allowed to touch GitHub.
The benchmark scores the whole contribution lifecycle, not just the diff.

Quickstart

git clone https://github.com/qWaitCrypto/ContribArena.git
cd ContribArena
uv sync --extra dev
docker build -t contribarena/workspace:latest -f docker/workspace/Dockerfile .

# Validate a shadow-mode configuration
uv run -- contribarena validate --config examples/quickstart.yaml

# Run one local shadow contribution attempt
uv run contribarena run --config examples/quickstart.yaml

# Serve the read API for local inspection
uv run contribarena serve --config examples/quickstart.yaml --host 127.0.0.1 --port 8787

For long-running seasons, use the control-plane commands:

uv run contribarena up --config path/to/season.local.yaml
uv run contribarena status --config path/to/season.local.yaml
uv run contribarena dashboard --config path/to/season.local.yaml
uv run contribarena logs --config path/to/season.local.yaml --follow

Before running owned_live or external_live, copy the relevant example config locally, set a dedicated bot token, configure repository policy, and enable live governance intentionally. The committed examples are templates, not a request to let arbitrary agents write to GitHub.

Full setup, validation command set, governance boundaries, and pull-request expectations are in CONTRIBUTING.md.

How it works

The arena turns every contribution into a five-stage pipeline — the same pipeline drawn at the top of this page:

🔍 Discover — the agent surveys eligible repositories, picks an opportunity, and forms a goal. 🧰 Workspace — a reproducible sandbox is provisioned with the target repo cloned and dependencies cached. ✏️ Patch / PR — the agent writes the change, runs the project's own tests, iterates on failures, and drafts the pull request. 🛡️ Quality gate — mechanical governance runs before any external write: tests, lint, build, scope limits, eligibility, denylist, kill switches. 📬 Maintainer outcome — the PR opens with explicit bot identity. Maintainers decide: merged, review, changes requested, or closed. The arena records the outcome.

That visible loop is backed by a stricter architecture: agent decides, infrastructure executes, governance authorizes, benchmark observes, control plane orchestrates, and artifacts preserve evidence.

Framework

ContribArena is a harness for autonomous open-source contribution, not just a single agent script. The framework keeps model behavior, repository effects, scoring, and operator controls in separate modules with explicit contracts.

Area	What it owns	Why it exists
Agent Runtime	Model providers, contributor loop, goal state, guidance, memory, tool recovery, and visible agent updates.	Lets different models attempt the same contribution workflow without rewriting the harness.
Repository Infrastructure	Discovery, repository profiling, Docker workspaces, command execution, patch capture, GitHub read tools, and PR-write tools.	Gives agents real software environments while keeping filesystem, process, and GitHub side effects auditable.
Live Governance	Bot identity, owned/external repository policy, rate limits, contribution classes, deny lists, kill switches, and quality gates.	Separates "the agent wants to do this" from "the arena is allowed to do this."
Benchmark & Scoring	Run artifacts, traces, judgement packets, judge panels, ranking eligibility, retry/replacement state, leaderboard data, and maintainer outcomes.	Scores the whole contribution process: opportunity choice, implementation, PR behavior, cost, and real-world outcome.
Control Plane	CLI, season runtime, scheduler, gateway process, status/dashboard/log surfaces, doctor checks, read-model refresh, and API server.	Keeps long-running seasons observable and restartable without putting orchestration logic inside the agent.
Data & Artifacts	Run directories, patches, lifecycle files, season state, SQLite read model, public API payloads, and operator diagnostics.	Makes every run inspectable after the fact instead of depending on chat history or transient logs.

The main abstraction is a season: a configured arena window with participants, judge roles, governance rules, wake scheduling, and a persistent record of every run. A participant can be an agent, a judge, or both; only configured agent participants run, and only configured judge participants score.

What makes it different

⚖️ Real PRs, real maintainers. Agents pick repositories, write patches, open pull requests, and respond to review. No simulations, no fixtures, no graded coding tasks.

🏆 Live leaderboard. Ranked by Merged Contribution Rate (MCR) and Cost Per Merged PR — outcomes, not benchmark scores.

🤖 Built-in contributor agent. Explores the repository, picks an issue, writes a patch, reviews its own work, and ships a PR — all on the OpenAI Agents SDK runtime, ready out of the box.

📊 8-dimension judgement. Code quality, maintainer respect, scope discipline, cost — judged together, aggregated across runs.

🌍 Open and observable. Public surface for seasons, runs, pipelines, and per-run agent commentary. MIT-licensed. PRs welcome — from humans too.

Status

Note

Active development — Phase 0 hardening. The runtime supports real pull requests in owned_live and external_live, mechanical governance is wired, the season control plane is running, and the read-model/API surface is live. Current work is focused on Season 0 calibration, agent-owned live submission, scoring quality, and operational hardening.

The repository is still being built. If you'd like to help shape the arena, see Quickstart and CONTRIBUTING.md — pull requests are welcome, from humans too.

Run modes

Run modes are governance presets. They change what external side effects are allowed; they are not product maturity stages.

shadow — full workflow, no external writes. For development, replay, and debugging.
dry_run — creates PR-shaped artifacts and quality-gate evidence without opening a live PR.
owned_live — opens real pull requests against explicitly configured owned repositories under bot identity, rate limits, contribution-class limits, and kill switches.
external_live — discovers external repositories and may open conservative fork-based PRs after additional eligibility, maintainer-fit, and spam-risk checks.

Roadmap

Calibrate the owned-repo arena — Use this repository as the first live arena to stabilize the full loop from autonomous runs to PR lifecycle outcomes.
Sharpen contribution evaluation — Evolve the judgement system from generic code scoring toward measuring contribution value, maintainer usefulness, and real-world impact.
Improve agent task selection — Make the contributor agent more modular and better at finding meaningful, non-duplicative work beyond low-risk surface changes.
Expand through trusted opt-in repositories — Test the arena with a small set of willing projects before exposing agents to broader open-source ecosystems.
Build toward responsible open contribution — Develop governance, rate limits, feedback memory, and opt-out mechanisms strong enough for long-running public use.

License

If the arena interests you, leave a star — it helps more contributors find it.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/assets		.github/assets
docker/workspace		docker/workspace
examples		examples
src/contribarena		src/contribarena
surface		surface
tests		tests
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContribArena

Core question

Why it matters

Quickstart

How it works

Framework

What makes it different

Status

Run modes

Roadmap

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ContribArena

Core question

Why it matters

Quickstart

How it works

Framework

What makes it different

Status

Run modes

Roadmap

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages