Skip to content

Mandatory OpenSandbox runs: containers replace srt, environments become images#22

Merged
pk8189 merged 14 commits into
mainfrom
opensandbox-mandatory
Jun 4, 2026
Merged

Mandatory OpenSandbox runs: containers replace srt, environments become images#22
pk8189 merged 14 commits into
mainfrom
opensandbox-mandatory

Conversation

@pk8189
Copy link
Copy Markdown
Contributor

@pk8189 pk8189 commented Jun 4, 2026

What

Every `avp eval` / `avp run` / `avp env run` now executes the agent inside an OpenSandbox container. srt and all host-machine provisioning are deleted; nothing the agent sees comes from the host. The `--sandbox` flag is gone — Docker reachable is the one prerequisite (crisp exit-2 diagnostic otherwise).

  • `osb.py` — the CLI manages its own OpenSandbox control plane: config generated once under `/.avp/opensandbox/` (port 18763, minted API key, bind mounts confined to `/.avp`, bridge networking), server spawned detached and reused when healthy. `avp sandbox status|stop`.
  • Environments are image-first — `{image, packages:{apt,pip}, paths, files, setup, net, resources}`. `packages` + the agent's container recipe compile to a cached derived image (`avp-env:`, `images.py`); `paths`/`files` seed a per-run host workspace bind-mounted at `/avp/workspace`; egress is default-deny + model-provider domains + `net`. Old `runtimes`/`expose` specs get a teaching error.
  • Agent container recipes — manifests gain an optional `container` block (`{install, command, env}`); built-in pinned recipes for goose (Linux release binary, arch-aware) and claude-code (release wheels + node + claude CLI). Only provider-prefixed env vars (`ANTHROPIC_`, `GOOSE_`, …) cross into the sandbox.
  • `run_agent` — fresh sandbox per cell, run contract executed inside, trajectory tailed live on the host through the bind mount, `kill()` in finally.
  • Example: `avp-cli/examples/coding/` — 4 coding katas, claude-code vs goose head-to-head; the hard kata (`hamming-10000`) genuinely separates them.

Verification

  • `make check` green; `grep -ri srt avp-cli` clean (the conformance harness's own host-side srt wrapper is untouched — follow-up).
  • New `make test-docker`: 4 real-sandbox seam tests — trajectory streams through the bind mount while the run is live, stderr-tail error reporting, egress deny actually blocks, image build + cache. ~8.5s warm.
  • Paid in-container verification (Haiku): goose 2 turns/$0.031 and claude-code 3 turns/$0.027 both completed a real file-write task; edits persisted to the host workspace. The srt-era claude-code overlay bug (discarded edits) does not reproduce in-container (needs `IS_SANDBOX=1`, carried in its recipe).
  • Cold run (server boot + image build) ~4.6s; warm ~2.9s. Docker-off fails fast with exit 2.

Notes

  • `opensandbox==0.1.9` / `opensandbox-server==0.1.14` pinned exactly (the project warns about SDK/server version skew); bump together.
  • Known gap: `make onboarding-smoke PAID=1` now needs Docker-in-Docker for its eval step (flagged in the script header; free path unaffected).
  • Podman: deliberately deferred until upstream lands (feat(service): add podman runtime support opensandbox-group/OpenSandbox#626), then make the engine configurable.

🤖 Generated with Claude Code

pk8189 and others added 14 commits June 4, 2026 15:45
avp now owns a local opensandbox-server: config TOML generated under
~/.avp/opensandbox (minted api key, bind mounts restricted to ~/.avp,
bridge networking for egress policy), spawned detached on demand,
healthy instances reused across invocations. Default-deny egress
policy seeded with the model-provider domains.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
srt is gone. Every agent run now executes in an OpenSandbox container:
run_agent creates a sandbox from the env's derived image (built and
cached by images.py from image + packages + the agent's container
recipe), bind-mounts the seeded workspace and a per-run io dir, runs
the same run contract inside, and tails the trajectory on the host.

Environments are image-first: {image, packages{apt,pip}, paths, files,
setup, net, resources}; runtimes/expose and all host (uv) provisioning
are deleted. Agent manifests gain an optional container block
({install, command}); in-tree agents get pinned built-in recipes
(goose: linux release binary; claude-code: release wheels + claude CLI).
The --sandbox flag is removed: Docker reachable is the one prerequisite
(crisp diagnostic otherwise), and avp sandbox status/stop manage the
CLI-owned server. Provider credentials forward by prefix allowlist;
the rest of the sandbox env is fully declared.

Seam tests keep the run contract + tail loop real via a host-side
FakeSandbox that substitutes bind-mount paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Verifies against a live Docker daemon: managed server bootstrap, the
run contract inside a stock-image sandbox, trajectory streaming through
the bind mount while the run is live, stderr-tail error reporting,
default-deny egress enforcement, and derived-image build + cache.
Fixes the SDK log shape (OutputMessage.text) the mocks had wrong.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README (root + avp-cli) and CLAUDE.md rewritten for the OpenSandbox
model: Docker as the one prerequisite, image-first env spec, container
recipes, avp sandbox status/stop, make test-docker. Example envs
updated to the new shape; last srt mentions in avp-cli removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Container recipes (and the manifest container block) gain an env field
for agent-required sandbox vars. The claude CLI refuses
bypassPermissions as root unless IS_SANDBOX=1 — which is accurate here.
Paid in-container verification: goose (2 turns, $0.031) and claude-code
(3 turns, $0.027) both completed a real file-write task on Haiku and
the edits persisted to the host workspace through the bind mount.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sandboxed runs are mandatory, so the root README Quickstart now names
the Docker daemon as a prerequisite with install one-liners, and notes
first-run sandbox setup vs warm starts. Flag the onboarding-smoke PAID
gap (eval inside the smoke container now needs DinD).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three write-run-report katas, exact-match scored, head-to-head on both
in-tree agents in the default sandbox world. Commission created via the
CLI (recorded in the README); the eval JSON is the one hand-authored
artifact. Smoke-verified on one item: both agents 100%, claude-code
$0.026/run vs goose $0.047/run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
hamming-10000 (the 10,000th 5-smooth number, ~2.9e17) punishes
brute-force scanning and sloppy dedup. Smoke: claude-code solves it
(5 turns, right algorithm); goose computes the right number but breaks
the answer-only output contract under load, so exact-match fails it —
a real instruction-compliance divergence, which is the point.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts to anthropic

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All command docs now assume avp (and avp-conformance) are on PATH;
the quickstarts gain the one-line venv activation that makes it true.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e CLI

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CLAUDE_CODE_OAUTH_TOKEN (from claude setup-token) lets the claude-code
agent run on a Claude subscription instead of an API key; the prefix
allowlist now passes it through. goose still needs a real API key (it
calls the Anthropic API directly). Documented in the credentials note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…an't enforce

OpenSandbox's egress sidecar disables itself silently when it can't get
its netfilter hooks (observed on GitHub Actions; Docker Desktop's VM
enforces fine). The test now: hard failure under
AVP_REQUIRE_EGRESS_ENFORCEMENT=1 (make test-docker, the strict local
gate), skip-with-evidence (policy snapshot in the skip reason)
elsewhere, so CI keeps the 3 seam tests that do hold on its runners.
README notes the host dependency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pk8189 pk8189 merged commit f5edf57 into main Jun 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant