Mandatory OpenSandbox runs: containers replace srt, environments become images by pk8189 · Pull Request #22 · portofcontext/agent-voyager-project

pk8189 · 2026-06-04T22:23:44Z

What

Every `avp eval` / `avp run` / `avp env run` now executes the agent inside an OpenSandbox container. srt and all host-machine provisioning are deleted; nothing the agent sees comes from the host. The `--sandbox` flag is gone — Docker reachable is the one prerequisite (crisp exit-2 diagnostic otherwise).

`osb.py` — the CLI manages its own OpenSandbox control plane: config generated once under `~~/.avp/opensandbox/` (port 18763, minted API key, bind mounts confined to `~~/.avp`, bridge networking), server spawned detached and reused when healthy. `avp sandbox status|stop`.
Environments are image-first — `{image, packages:{apt,pip}, paths, files, setup, net, resources}`. `packages` + the agent's container recipe compile to a cached derived image (`avp-env:`, `images.py`); `paths`/`files` seed a per-run host workspace bind-mounted at `/avp/workspace`; egress is default-deny + model-provider domains + `net`. Old `runtimes`/`expose` specs get a teaching error.
Agent container recipes — manifests gain an optional `container` block (`{install, command, env}`); built-in pinned recipes for goose (Linux release binary, arch-aware) and claude-code (release wheels + node + claude CLI). Only provider-prefixed env vars (`ANTHROPIC_`, `GOOSE_`, …) cross into the sandbox.
`run_agent` — fresh sandbox per cell, run contract executed inside, trajectory tailed live on the host through the bind mount, `kill()` in finally.
Example: `avp-cli/examples/coding/` — 4 coding katas, claude-code vs goose head-to-head; the hard kata (`hamming-10000`) genuinely separates them.

Verification

`make check` green; `grep -ri srt avp-cli` clean (the conformance harness's own host-side srt wrapper is untouched — follow-up).
New `make test-docker`: 4 real-sandbox seam tests — trajectory streams through the bind mount while the run is live, stderr-tail error reporting, egress deny actually blocks, image build + cache. ~8.5s warm.
Paid in-container verification (Haiku): goose 2 turns/$0.031 and claude-code 3 turns/$0.027 both completed a real file-write task; edits persisted to the host workspace. The srt-era claude-code overlay bug (discarded edits) does not reproduce in-container (needs `IS_SANDBOX=1`, carried in its recipe).
Cold run (server boot + image build) ~4.6s; warm ~2.9s. Docker-off fails fast with exit 2.

Notes

`opensandbox==0.1.9` / `opensandbox-server==0.1.14` pinned exactly (the project warns about SDK/server version skew); bump together.
Known gap: `make onboarding-smoke PAID=1` now needs Docker-in-Docker for its eval step (flagged in the script header; free path unaffected).
Podman: deliberately deferred until upstream lands (feat(service): add podman runtime support opensandbox-group/OpenSandbox#626), then make the engine configurable.

🤖 Generated with Claude Code

avp now owns a local opensandbox-server: config TOML generated under ~/.avp/opensandbox (minted api key, bind mounts restricted to ~/.avp, bridge networking for egress policy), spawned detached on demand, healthy instances reused across invocations. Default-deny egress policy seeded with the model-provider domains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

srt is gone. Every agent run now executes in an OpenSandbox container: run_agent creates a sandbox from the env's derived image (built and cached by images.py from image + packages + the agent's container recipe), bind-mounts the seeded workspace and a per-run io dir, runs the same run contract inside, and tails the trajectory on the host. Environments are image-first: {image, packages{apt,pip}, paths, files, setup, net, resources}; runtimes/expose and all host (uv) provisioning are deleted. Agent manifests gain an optional container block ({install, command}); in-tree agents get pinned built-in recipes (goose: linux release binary; claude-code: release wheels + claude CLI). The --sandbox flag is removed: Docker reachable is the one prerequisite (crisp diagnostic otherwise), and avp sandbox status/stop manage the CLI-owned server. Provider credentials forward by prefix allowlist; the rest of the sandbox env is fully declared. Seam tests keep the run contract + tail loop real via a host-side FakeSandbox that substitutes bind-mount paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Verifies against a live Docker daemon: managed server bootstrap, the run contract inside a stock-image sandbox, trajectory streaming through the bind mount while the run is live, stderr-tail error reporting, default-deny egress enforcement, and derived-image build + cache. Fixes the SDK log shape (OutputMessage.text) the mocks had wrong. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

README (root + avp-cli) and CLAUDE.md rewritten for the OpenSandbox model: Docker as the one prerequisite, image-first env spec, container recipes, avp sandbox status/stop, make test-docker. Example envs updated to the new shape; last srt mentions in avp-cli removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Container recipes (and the manifest container block) gain an env field for agent-required sandbox vars. The claude CLI refuses bypassPermissions as root unless IS_SANDBOX=1 — which is accurate here. Paid in-container verification: goose (2 turns, $0.031) and claude-code (3 turns, $0.027) both completed a real file-write task on Haiku and the edits persisted to the host workspace through the bind mount. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Sandboxed runs are mandatory, so the root README Quickstart now names the Docker daemon as a prerequisite with install one-liners, and notes first-run sandbox setup vs warm starts. Flag the onboarding-smoke PAID gap (eval inside the smoke container now needs DinD). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three write-run-report katas, exact-match scored, head-to-head on both in-tree agents in the default sandbox world. Commission created via the CLI (recorded in the README); the eval JSON is the one hand-authored artifact. Smoke-verified on one item: both agents 100%, claude-code $0.026/run vs goose $0.047/run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hamming-10000 (the 10,000th 5-smooth number, ~2.9e17) punishes brute-force scanning and sloppy dedup. Smoke: claude-code solves it (5 turns, right algorithm); goose computes the right number but breaks the answer-only output contract under load, so exact-match fails it — a real instruction-compliance divergence, which is the point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ts to anthropic Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All command docs now assume avp (and avp-conformance) are on PATH; the quickstarts gain the one-line venv activation that makes it true. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e CLI Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CLAUDE_CODE_OAUTH_TOKEN (from claude setup-token) lets the claude-code agent run on a Claude subscription instead of an API key; the prefix allowlist now passes it through. goose still needs a real API key (it calls the Anthropic API directly). Documented in the credentials note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…an't enforce OpenSandbox's egress sidecar disables itself silently when it can't get its netfilter hooks (observed on GitHub Actions; Docker Desktop's VM enforces fine). The test now: hard failure under AVP_REQUIRE_EGRESS_ENFORCEMENT=1 (make test-docker, the strict local gate), skip-with-evidence (policy snapshot in the skip reason) elsewhere, so CI keeps the 3 seam tests that do hold on its runners. README notes the host dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pk8189 and others added 14 commits June 4, 2026 15:45

docs(cli): drop GOOSE_PROVIDER from the coding example — goose defaul…

5eea44c

…ts to anthropic Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: invoke the CLI as bare avp everywhere

8afdd9e

All command docs now assume avp (and avp-conformance) are on PATH; the quickstarts gain the one-line venv activation that makes it true. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: Docker first — Quickstart step 1 installs the daemon, step 2 th…

0429879

…e CLI Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

readme update

4477a89

pk8189 merged commit f5edf57 into main Jun 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mandatory OpenSandbox runs: containers replace srt, environments become images#22

Mandatory OpenSandbox runs: containers replace srt, environments become images#22
pk8189 merged 14 commits into
mainfrom
opensandbox-mandatory

pk8189 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pk8189 commented Jun 4, 2026

What

Verification

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant