Mandatory OpenSandbox runs: containers replace srt, environments become images#22
Merged
Conversation
avp now owns a local opensandbox-server: config TOML generated under ~/.avp/opensandbox (minted api key, bind mounts restricted to ~/.avp, bridge networking for egress policy), spawned detached on demand, healthy instances reused across invocations. Default-deny egress policy seeded with the model-provider domains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
srt is gone. Every agent run now executes in an OpenSandbox container:
run_agent creates a sandbox from the env's derived image (built and
cached by images.py from image + packages + the agent's container
recipe), bind-mounts the seeded workspace and a per-run io dir, runs
the same run contract inside, and tails the trajectory on the host.
Environments are image-first: {image, packages{apt,pip}, paths, files,
setup, net, resources}; runtimes/expose and all host (uv) provisioning
are deleted. Agent manifests gain an optional container block
({install, command}); in-tree agents get pinned built-in recipes
(goose: linux release binary; claude-code: release wheels + claude CLI).
The --sandbox flag is removed: Docker reachable is the one prerequisite
(crisp diagnostic otherwise), and avp sandbox status/stop manage the
CLI-owned server. Provider credentials forward by prefix allowlist;
the rest of the sandbox env is fully declared.
Seam tests keep the run contract + tail loop real via a host-side
FakeSandbox that substitutes bind-mount paths.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verifies against a live Docker daemon: managed server bootstrap, the run contract inside a stock-image sandbox, trajectory streaming through the bind mount while the run is live, stderr-tail error reporting, default-deny egress enforcement, and derived-image build + cache. Fixes the SDK log shape (OutputMessage.text) the mocks had wrong. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README (root + avp-cli) and CLAUDE.md rewritten for the OpenSandbox model: Docker as the one prerequisite, image-first env spec, container recipes, avp sandbox status/stop, make test-docker. Example envs updated to the new shape; last srt mentions in avp-cli removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Container recipes (and the manifest container block) gain an env field for agent-required sandbox vars. The claude CLI refuses bypassPermissions as root unless IS_SANDBOX=1 — which is accurate here. Paid in-container verification: goose (2 turns, $0.031) and claude-code (3 turns, $0.027) both completed a real file-write task on Haiku and the edits persisted to the host workspace through the bind mount. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sandboxed runs are mandatory, so the root README Quickstart now names the Docker daemon as a prerequisite with install one-liners, and notes first-run sandbox setup vs warm starts. Flag the onboarding-smoke PAID gap (eval inside the smoke container now needs DinD). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three write-run-report katas, exact-match scored, head-to-head on both in-tree agents in the default sandbox world. Commission created via the CLI (recorded in the README); the eval JSON is the one hand-authored artifact. Smoke-verified on one item: both agents 100%, claude-code $0.026/run vs goose $0.047/run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hamming-10000 (the 10,000th 5-smooth number, ~2.9e17) punishes brute-force scanning and sloppy dedup. Smoke: claude-code solves it (5 turns, right algorithm); goose computes the right number but breaks the answer-only output contract under load, so exact-match fails it — a real instruction-compliance divergence, which is the point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts to anthropic Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All command docs now assume avp (and avp-conformance) are on PATH; the quickstarts gain the one-line venv activation that makes it true. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e CLI Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CLAUDE_CODE_OAUTH_TOKEN (from claude setup-token) lets the claude-code agent run on a Claude subscription instead of an API key; the prefix allowlist now passes it through. goose still needs a real API key (it calls the Anthropic API directly). Documented in the credentials note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…an't enforce OpenSandbox's egress sidecar disables itself silently when it can't get its netfilter hooks (observed on GitHub Actions; Docker Desktop's VM enforces fine). The test now: hard failure under AVP_REQUIRE_EGRESS_ENFORCEMENT=1 (make test-docker, the strict local gate), skip-with-evidence (policy snapshot in the skip reason) elsewhere, so CI keeps the 3 seam tests that do hold on its runners. README notes the host dependency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Every `avp eval` / `avp run` / `avp env run` now executes the agent inside an OpenSandbox container. srt and all host-machine provisioning are deleted; nothing the agent sees comes from the host. The `--sandbox` flag is gone — Docker reachable is the one prerequisite (crisp exit-2 diagnostic otherwise).
/.avp/opensandbox/` (port 18763, minted API key, bind mounts confined to `/.avp`, bridge networking), server spawned detached and reused when healthy. `avp sandbox status|stop`.Verification
Notes
🤖 Generated with Claude Code