Skip to content

tingly-dev/programbench-runtime

Repository files navigation

Custom ProgramBench Runtime

This directory is a custom runtime layer around ProgramBench.

Its purpose is:

  • run a task with your own agent
  • produce a submission.tar.gz in the format expected by ProgramBench
  • evaluate that submission with the installed programbench CLI

Prerequisites

This README assumes:

  • Docker is available locally
  • programbench is already installed locally
  • you normally run it inside your own conda environment
  • programbench was installed with uv

Run the commands below inside that environment, for example:

conda activate swe
cd custom-programbench-runtime

Confirm that the CLI is available:

programbench --help

Layout

Core files:

  • agent-image/Dockerfile
  • agent-image/agent_runner.py
  • run_agent.py
  • runtime_utils.py
  • scripts/common.sh
  • scripts/01_prepare_env.sh
  • scripts/02_run_problem.sh
  • scripts/03_eval.sh
  • scripts/debug_problem.sh

User settings and task prompts are separated:

  • user settings: my-cc-config/settings.json
  • task prompt: instances/chirlu__sox.42b3557/prompt.md

Outputs are written to:

  • runs/my-agent/<instance_id>/submission.tar.gz
  • runs/my-agent/<instance_id>/agent-output/

How To Run

The example below uses chirlu__sox.42b3557.

  1. Prepare settings.json
mkdir -p my-cc-config
cp config/settings.example.json my-cc-config/settings.json

Then edit:

my-cc-config/settings.json

  1. Prepare the environment
cd custom-programbench-runtime
bash scripts/01_prepare_env.sh chirlu__sox.42b3557

This does three things:

  • pulls the upstream cleanroom image: programbench/chirlu_1776_sox.42b3557:task_cleanroom
  • pulls the upstream eval image: programbench/chirlu_1776_sox.42b3557:task
  • builds the local agent image: local-programbench/chirlu_1776_sox.42b3557:agent
  1. Run the problem
cd custom-programbench-runtime
bash scripts/02_run_problem.sh chirlu__sox.42b3557

This will:

  • start local-programbench/chirlu_1776_sox.42b3557:agent
  • mount settings.json into the container
  • mount the task prompt into the container
  • run Claude Code inside the container
  • collect the contents of ./out
  • package them into the ProgramBench submission layout

Outputs:

runs/my-agent/chirlu__sox.42b3557/submission.tar.gz
runs/my-agent/chirlu__sox.42b3557/agent-output/
  1. Evaluate
cd custom-programbench-runtime
bash scripts/03_eval.sh chirlu__sox.42b3557

This directly calls the installed programbench eval.

The script always sets:

  • PROGRAMBENCH_DOCKER_ORG=programbench

If local blobs are available, it also sets:

  • PROGRAMBENCH_BLOB_DIR=<blob root>

If local blobs are not available, it leaves PROGRAMBENCH_BLOB_DIR unset, so ProgramBench falls back to its default cache / download behavior.

Evaluation output is written to:

runs/my-agent/chirlu__sox.42b3557/chirlu__sox.42b3557.eval.json

Blob Directory Layout

If you want to provide local blobs, the directory must look like this:

<blob_root>/
  chirlu__sox.42b3557/
    tests/
      86f3178a4b28.tar.gz
      269b33cf14ef.tar.gz

Important:

  • PROGRAMBENCH_BLOB_DIR must point to <blob_root>
  • it must not point directly to chirlu__sox.42b3557

Debugging The Agent

To open an interactive shell inside the agent container:

cd custom-programbench-runtime
bash scripts/debug_problem.sh chirlu__sox.42b3557

The container starts in /workspace.

You can then run:

python3 /agent/agent_runner.py \
  --instance-id chirlu__sox.42b3557 \
  --out-dir /workspace/out \
  --settings-file /settings.json \
  --prompt-file /prompt.md

How It Works

There are three layers:

  1. Upstream task_cleanroom image
    This is the cleanroom environment for the task. We build our own :agent image on top of it and install Claude Code there.

  2. Local :agent image
    Claude Code runs here. It reads the task prompt and writes its output to ./out.

  3. Upstream task image
    This is the official ProgramBench evaluation image. programbench eval uses it to unpack submission.tar.gz, run compile.sh, and execute tests.

The flow is:

task_cleanroom
  -> build local :agent
  -> Claude Code writes ./out
  -> host packages ./out as submission.tar.gz
  -> programbench eval
  -> upstream :task image runs the official tests

This keeps the boundary clean:

  • you can fully customize the agent runtime
  • the submission format and evaluation semantics still stay aligned with ProgramBench

Current sox Images

  • programbench/chirlu_1776_sox.42b3557:task_cleanroom
  • programbench/chirlu_1776_sox.42b3557:task
  • local-programbench/chirlu_1776_sox.42b3557:agent

About

A custom runtime layer around ProgramBench.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors