Adapter scaffold for running AgentPlane-backed Codex execution against BitGN benchmarks.
AgentPlane is not submitted as a model. It is used as a control-plane profile around an executor:
- benchmark runtime: BitGN PCM or ECOM
- executor: Codex CLI
- control layer: policy, step loop, proof bundle, score-detail capture
Experimental. Current proven coverage is deliberately narrow:
bitgn/sandbox t01: pass, score1.00.bitgn/pac1-dev t01: pass, score1.00.bitgn/ecom1-dev t01: pass, score1.00.
All non-t01 PAC1 and ECOM1 tasks are not passing in current evidence and must
be treated as failing/unsupported until a live run proves otherwise. This repo
is not leaderboard-ready.
BitGN evaluates observable agent behavior: runtime tool calls, files, task state, side effects, outcome codes, compliance, and security posture. That is the same surface where AgentPlane can add value: bounded policy, traceability, explicit outcomes, and failure evidence.
The near-term goal is not "AgentPlane beats everyone". The useful public claim is narrower:
AgentPlane can wrap a strong executor, preserve BitGN benchmark validity, and produce auditable evidence for why trials passed or failed.
make syncInstall BitGN SDK dependencies from the same Buf registry used by the upstream samples:
make sync-bitgnThe SDK currently tracks Python 3.14 in the sample agents, so the Make targets create a Python 3.14 uv environment.
Codex can use ChatGPT subscription auth:
codex login
codex login statusThat path is useful for local smoke runs because the adapter invokes codex exec. For reproducible public runs, API-key auth is still cleaner because it is
easier to document and recreate in CI or another machine.
BitGN official runs still need:
export BITGN_API_KEY="..."cp .env.example .env.local
$EDITOR .env.local
make oauth
make sandboxscripts/bitgn_smoke.sh loads .env and then .env.local; keep secrets in
one of those ignored files, not in committed config.
Sandbox is the first end-to-end check because it does not require a BitGN Platform key. PAC1 is the next check:
make pac1Set:
BENCHMARK_ID=bitgn/ecom1-dev
BITGN_RUNTIME=ecomThen run a single task:
make ecomEach trial writes:
.agentplane-bitgn/<benchmark-id>/<runtime>/<task-id>/<trial-id>/
AGENTS.md
proof.json
The proof bundle captures:
- benchmark id and runtime
- model id
- task id and trial id
- each JSON tool command requested by Codex
- runtime observations, truncated for readability
- final status
- Runbook
- Test strategy
- Coverage matrix
- Leaderboard plan
- Evidence report template
- Cost notes
- OAuth notes
- Smoke results
PAC1 live already has multiple 104/104 runs. A naive scaffold is unlikely to stand out there. The best AgentPlane path is:
- Use PAC1 DEV to harden outcome selection, grounding refs, structured writes, and injection refusal.
- Mine
score_detailinto regression cases. - Move to ECOM1, where policy books, payment state, SQL, fraud controls, and audit trails are closer to AgentPlane's control-plane strengths.
- Publish a proof-backed run rather than only a score screenshot.
Do not:
- fetch benchmark solutions from the internet;
- inspect hidden graders or oracle solutions;
- alter BitGN scoring, task sets, or runtime contracts;
- inject task-specific hints into the adapter policy;
- claim leaderboard readiness without a reproducible run id and proof bundle.