This directory is a custom runtime layer around ProgramBench.
Its purpose is:
- run a task with your own agent
- produce a
submission.tar.gzin the format expected by ProgramBench - evaluate that submission with the installed
programbenchCLI
Prerequisites
This README assumes:
- Docker is available locally
programbenchis already installed locally- you normally run it inside your own conda environment
programbenchwas installed withuv
Run the commands below inside that environment, for example:
conda activate swe
cd custom-programbench-runtimeConfirm that the CLI is available:
programbench --helpLayout
Core files:
agent-image/Dockerfileagent-image/agent_runner.pyrun_agent.pyruntime_utils.pyscripts/common.shscripts/01_prepare_env.shscripts/02_run_problem.shscripts/03_eval.shscripts/debug_problem.sh
User settings and task prompts are separated:
- user settings:
my-cc-config/settings.json - task prompt:
instances/chirlu__sox.42b3557/prompt.md
Outputs are written to:
runs/my-agent/<instance_id>/submission.tar.gzruns/my-agent/<instance_id>/agent-output/
How To Run
The example below uses chirlu__sox.42b3557.
- Prepare
settings.json
mkdir -p my-cc-config
cp config/settings.example.json my-cc-config/settings.jsonThen edit:
my-cc-config/settings.json
- Prepare the environment
cd custom-programbench-runtime
bash scripts/01_prepare_env.sh chirlu__sox.42b3557This does three things:
- pulls the upstream cleanroom image:
programbench/chirlu_1776_sox.42b3557:task_cleanroom - pulls the upstream eval image:
programbench/chirlu_1776_sox.42b3557:task - builds the local agent image:
local-programbench/chirlu_1776_sox.42b3557:agent
- Run the problem
cd custom-programbench-runtime
bash scripts/02_run_problem.sh chirlu__sox.42b3557This will:
- start
local-programbench/chirlu_1776_sox.42b3557:agent - mount
settings.jsoninto the container - mount the task prompt into the container
- run Claude Code inside the container
- collect the contents of
./out - package them into the ProgramBench submission layout
Outputs:
runs/my-agent/chirlu__sox.42b3557/submission.tar.gz
runs/my-agent/chirlu__sox.42b3557/agent-output/
- Evaluate
cd custom-programbench-runtime
bash scripts/03_eval.sh chirlu__sox.42b3557This directly calls the installed programbench eval.
The script always sets:
PROGRAMBENCH_DOCKER_ORG=programbench
If local blobs are available, it also sets:
PROGRAMBENCH_BLOB_DIR=<blob root>
If local blobs are not available, it leaves PROGRAMBENCH_BLOB_DIR unset, so
ProgramBench falls back to its default cache / download behavior.
Evaluation output is written to:
runs/my-agent/chirlu__sox.42b3557/chirlu__sox.42b3557.eval.json
Blob Directory Layout
If you want to provide local blobs, the directory must look like this:
<blob_root>/
chirlu__sox.42b3557/
tests/
86f3178a4b28.tar.gz
269b33cf14ef.tar.gz
Important:
PROGRAMBENCH_BLOB_DIRmust point to<blob_root>- it must not point directly to
chirlu__sox.42b3557
Debugging The Agent
To open an interactive shell inside the agent container:
cd custom-programbench-runtime
bash scripts/debug_problem.sh chirlu__sox.42b3557The container starts in /workspace.
You can then run:
python3 /agent/agent_runner.py \
--instance-id chirlu__sox.42b3557 \
--out-dir /workspace/out \
--settings-file /settings.json \
--prompt-file /prompt.mdHow It Works
There are three layers:
-
Upstream
task_cleanroomimage
This is the cleanroom environment for the task. We build our own:agentimage on top of it and install Claude Code there. -
Local
:agentimage
Claude Code runs here. It reads the task prompt and writes its output to./out. -
Upstream
taskimage
This is the official ProgramBench evaluation image.programbench evaluses it to unpacksubmission.tar.gz, runcompile.sh, and execute tests.
The flow is:
task_cleanroom
-> build local :agent
-> Claude Code writes ./out
-> host packages ./out as submission.tar.gz
-> programbench eval
-> upstream :task image runs the official tests
This keeps the boundary clean:
- you can fully customize the agent runtime
- the submission format and evaluation semantics still stay aligned with ProgramBench
Current sox Images
programbench/chirlu_1776_sox.42b3557:task_cleanroomprogrambench/chirlu_1776_sox.42b3557:tasklocal-programbench/chirlu_1776_sox.42b3557:agent