Custom ProgramBench Runtime

This directory is a custom runtime layer around ProgramBench.

Its purpose is:

run a task with your own agent
produce a submission.tar.gz in the format expected by ProgramBench
evaluate that submission with the installed programbench CLI

Prerequisites

This README assumes:

Docker is available locally
programbench is already installed locally
you normally run it inside your own conda environment
programbench was installed with uv

Run the commands below inside that environment, for example:

conda activate swe
cd custom-programbench-runtime

Confirm that the CLI is available:

programbench --help

Layout

Core files:

agent-image/Dockerfile
agent-image/agent_runner.py
run_agent.py
runtime_utils.py
scripts/common.sh
scripts/01_prepare_env.sh
scripts/02_run_problem.sh
scripts/03_eval.sh
scripts/debug_problem.sh

User settings and task prompts are separated:

user settings: my-cc-config/settings.json
task prompt: instances/chirlu__sox.42b3557/prompt.md

Outputs are written to:

runs/my-agent/<instance_id>/submission.tar.gz
runs/my-agent/<instance_id>/agent-output/

How To Run

The example below uses chirlu__sox.42b3557.

Prepare settings.json

mkdir -p my-cc-config
cp config/settings.example.json my-cc-config/settings.json

Then edit:

my-cc-config/settings.json

Prepare the environment

cd custom-programbench-runtime
bash scripts/01_prepare_env.sh chirlu__sox.42b3557

This does three things:

pulls the upstream cleanroom image: programbench/chirlu_1776_sox.42b3557:task_cleanroom
pulls the upstream eval image: programbench/chirlu_1776_sox.42b3557:task
builds the local agent image: local-programbench/chirlu_1776_sox.42b3557:agent

Run the problem

cd custom-programbench-runtime
bash scripts/02_run_problem.sh chirlu__sox.42b3557

This will:

start local-programbench/chirlu_1776_sox.42b3557:agent
mount settings.json into the container
mount the task prompt into the container
run Claude Code inside the container
collect the contents of ./out
package them into the ProgramBench submission layout

Outputs:

runs/my-agent/chirlu__sox.42b3557/submission.tar.gz
runs/my-agent/chirlu__sox.42b3557/agent-output/

Evaluate

cd custom-programbench-runtime
bash scripts/03_eval.sh chirlu__sox.42b3557

This directly calls the installed programbench eval.

The script always sets:

PROGRAMBENCH_DOCKER_ORG=programbench

If local blobs are available, it also sets:

PROGRAMBENCH_BLOB_DIR=<blob root>

If local blobs are not available, it leaves PROGRAMBENCH_BLOB_DIR unset, so ProgramBench falls back to its default cache / download behavior.

Evaluation output is written to:

runs/my-agent/chirlu__sox.42b3557/chirlu__sox.42b3557.eval.json

Blob Directory Layout

If you want to provide local blobs, the directory must look like this:

<blob_root>/
  chirlu__sox.42b3557/
    tests/
      86f3178a4b28.tar.gz
      269b33cf14ef.tar.gz

Important:

PROGRAMBENCH_BLOB_DIR must point to <blob_root>
it must not point directly to chirlu__sox.42b3557

Debugging The Agent

To open an interactive shell inside the agent container:

cd custom-programbench-runtime
bash scripts/debug_problem.sh chirlu__sox.42b3557

The container starts in /workspace.

You can then run:

python3 /agent/agent_runner.py \
  --instance-id chirlu__sox.42b3557 \
  --out-dir /workspace/out \
  --settings-file /settings.json \
  --prompt-file /prompt.md

How It Works

There are three layers:

Upstream task_cleanroom image
This is the cleanroom environment for the task. We build our own :agent image on top of it and install Claude Code there.
Local :agent image
Claude Code runs here. It reads the task prompt and writes its output to ./out.
Upstream task image
This is the official ProgramBench evaluation image. programbench eval uses it to unpack submission.tar.gz, run compile.sh, and execute tests.

The flow is:

task_cleanroom
  -> build local :agent
  -> Claude Code writes ./out
  -> host packages ./out as submission.tar.gz
  -> programbench eval
  -> upstream :task image runs the official tests

This keeps the boundary clean:

you can fully customize the agent runtime
the submission format and evaluation semantics still stay aligned with ProgramBench

Current sox Images

programbench/chirlu_1776_sox.42b3557:task_cleanroom
programbench/chirlu_1776_sox.42b3557:task
local-programbench/chirlu_1776_sox.42b3557:agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Custom ProgramBench Runtime

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent-image		agent-image
config		config
instances/chirlu__sox.42b3557		instances/chirlu__sox.42b3557
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
run_agent.py		run_agent.py
runtime_utils.py		runtime_utils.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Custom ProgramBench Runtime

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages