Skip to content

Commit 1ea5a63

Browse files
Skobeltsynclaude
andcommitted
docs(#1726): refine sandbox roadmap + add multimodal I/O
Sandboxing refinements (informed by Claude Code's actual impl): - Scope ProcessSandbox to subprocess-shaped tools only (matches Claude Code's Bash-only scope; in-process lambdas covered by grants { } + frozen agents already). - Name Seatbelt explicitly for macOS, bwrap primary for Linux, firejail as fallback. WSL1 unsupported by design. - Add network-proxy sub-policy with TLS-inspection caveat (hostname-only gating; domain-fronting risk if allowlist too broad). Reference anthropic-experimental/sandbox-runtime as the canonical Linux/bwrap reference. - Document permission/sandbox path merge — both layers apply. Multimodal additions: - Phase 2: image input (Anthropic / OpenAI / Ollama / Gemini) and audio input (Gemini, GPT-4o-audio) as LlmContent sealed blocks. Binary-compat path: add contentBlocks sibling field first, deprecate the String content later. - Phase 3: generative outputs — ImageModelClient (DALL-E, Imagen, Stability) and TTSModelClient (OpenAI TTS, ElevenLabs, Google). Streaming via LlmChunk.ImageDelta / AudioDelta for partial-preview and low-latency playback. - README Limitations gains an honest "text-only I/O today" entry pointing at the phased plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bde0d91 commit 1ea5a63

2 files changed

Lines changed: 15 additions & 4 deletions

File tree

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,7 @@ What the framework does **not** enforce — your responsibility:
164164
- **No A2A protocol yet** — agent-to-agent over network (Phase 2 / 3).
165165
- **Inline-tool-call fallback model variance** — small Ollama models (e.g. `gemma3:4b`) reliably emit single tool calls via the inline format but may produce thin final-turn text after multi-step tool sequences. For multi-step reasoning, a tool-native model (`gpt-oss:20b-cloud` and similar) is the better fit.
166166
- **No tool sandboxing** — tool executors run in-process with full JVM privileges. `grants { }` controls *which* tools an agent can call, not *what they can do* once invoked. Sandboxed execution (`ProcessSandbox` / `WasmSandbox` / `DockerSandbox` opt-in backends) is on the Phase 3 roadmap.
167+
- **Text-only I/O today**`LlmMessage.content: String` carries text. Image input (vision-capable adapters: Anthropic, OpenAI, Ollama, Gemini) and audio input land in Phase 2 alongside an `LlmContent` sealed-block evolution of the message model. Image generation (`ImageModelClient`: DALL-E, Imagen, Stability) and text-to-speech (`TTSModelClient`: OpenAI TTS, ElevenLabs, Google) are Phase 3.
167168

168169
For planned features beyond these limitations, see [docs/roadmap.md](docs/roadmap.md).
169170

@@ -221,9 +222,9 @@ Testing details — task names, integration test setup, mutation testing, how to
221222

222223
**Phase 1 — Core DSL** *(in progress)*: typed agents, skills, knowledge, composition operators (`then`, `/`, `*`, `forum`, `.loop`, `.branch`), MCP client + server, agent memory, `loadResource(path)` for prompts from classpath, agentic loop with full budget controls (`maxTurns` / `maxToolCalls` / `maxDuration` / `perToolTimeout` / `maxTokens` / `maxConsecutiveSameTool`), observability hooks (`onSkillChosen`, `onToolUse`, `onKnowledgeUsed`, `onError`, `onBudgetThreshold`, `Agent.observe { }`).
223224

224-
**Phase 2 — Runtime + Distribution** *(Q2 2026)*: remaining provider (Google), `Flow<...>` streaming on every adapter, KSP compile-time `@Generable`, native CLI / jlink, `Tool<IN, OUT>` hierarchy, `grants {}` permissions, session model, Flow-based observability, `agent.json` serialization, Gradle plugin. *(Anthropic and OpenAI adapters already landed in #1644 and #1656.)*
225+
**Phase 2 — Runtime + Distribution** *(Q2 2026)*: remaining provider (Google), `Flow<...>` streaming on every adapter, KSP compile-time `@Generable`, native CLI / jlink, `Tool<IN, OUT>` hierarchy, `grants {}` permissions, session model, Flow-based observability, **multimodal input** (image + audio content blocks; vision-capable adapters for Anthropic/OpenAI/Ollama/Gemini), `agent.json` serialization, Gradle plugin. *(Anthropic and OpenAI adapters already landed in #1644 and #1656.)*
225226

226-
**Phase 3 — Production** *(Q3 2026)*: Layer 2 Structure DSL, all 37 compile-time validations, AgentUnit, A2A protocol, file-based knowledge with RAG, OpenTelemetry, **sandboxed tool execution** (`SandboxedExecutor` with `ProcessSandbox` / `WasmSandbox` / `DockerSandbox` backends — opt-in per tool, default executor stays in-process).
227+
**Phase 3 — Production** *(Q3 2026)*: Layer 2 Structure DSL, all 37 compile-time validations, AgentUnit, A2A protocol, file-based knowledge with RAG, OpenTelemetry, **sandboxed tool execution** (`SandboxedExecutor` with `ProcessSandbox` (Seatbelt / bwrap), `WasmSandbox` (Chicory), `DockerSandbox` backends — opt-in per tool, subprocess-shaped tools only, default executor stays in-process), **generative outputs** (`ImageModelClient` for DALL-E / Imagen / Stability, `TTSModelClient` for OpenAI / ElevenLabs / Google).
227228

228229
**Phase 4 — Ecosystem** *(Q4 2026)*: knowledge packs, NL → DSL generation, Skillify, visual editor, knowledge marketplace.
229230

docs/roadmap.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@
4848
- [x] Agent memory — `MemoryBank`, `memory_read`/`memory_write`/`memory_search` auto-injected tools
4949
- [ ] `.spawn {}` — independent sub-agent lifecycle, `AgentHandle<OUT>`, parent-managed join
5050
- [ ] `Flow<PipelineEvent>` for reactive UIs + Pipeline-level events (`StageStarted`, `PipelineCompleted`, etc) — depends on streaming, sub-agents, sessions
51+
- [ ] **Multimodal input** — vision and audio content blocks on LLM messages.
52+
- **Image input:** vision-capable adapters accept image bytes + media type as a content block alongside text. Targets: Anthropic (`image` content blocks), OpenAI (`image_url` / base64 in content), Ollama (`llava` / `bakllava` via `images` field), Google Gemini.
53+
- **Audio input:** true audio input (Gemini, GPT-4o-audio) — `LlmContent.Audio` block. Optional STT-only helper `audio.transcribe(file)` for the Whisper-style use case.
54+
- **Architectural change:** `LlmMessage.content: String` needs to evolve into a `List<LlmContent>` sealed type (Text / Image / Audio blocks). Binary-compat risk: add a sibling `contentBlocks: List<LlmContent>?` field first with the existing String form auto-coerced into a single Text block; deprecate the String form once the API surface settles. Typed boundaries are unaffected — `Agent<Image, String>` (image classifier) and `Agent<AudioClip, String>` (transcriber) become coherent agent shapes.
5155
- [ ] Serialization — `agent.json`, A2A AgentCard
5256
- [ ] JAR bundles and folder-based assembly
5357
- [ ] Gradle plugin
@@ -60,8 +64,14 @@
6064
- [ ] File-based knowledge: `skill.md`, `reference`, `examples`, `checklist` + RAG pipeline
6165
- [ ] Production observability: OpenTelemetry traces
6266
- [ ] Team DSL — swarm coordination (if isolated execution available)
63-
- [ ] **Sandboxed tool execution**`SandboxedExecutor` interface with three backends, opt-in per tool (`tool(..., sandbox = ...)`) or per skill (`sandbox { }` block). Default executor stays in-process for backward compatibility.
64-
- `ProcessSandbox` — subprocess executor with env / cwd / timeout / network constraints. Backends: `sandbox-exec` on macOS (built into the OS), `bwrap` or `firejail` on Linux. Falls back to plain `ProcessBuilder` with a loud warning on platforms with no native sandboxing tool. **Most pragmatic** — every dev box has at least one path.
67+
- [ ] **Generative outputs (image + audio)** — sibling client interfaces to `ModelClient` for non-chat model families.
68+
- `ImageModelClient.generate(prompt, options): ImageBytes` — text → image. Adapters: OpenAI DALL-E 3, Google Imagen, Stability. Optional streaming via `generateStream(...): Flow<LlmChunk.ImageDelta>` for partial-preview UX.
69+
- `TTSModelClient.synthesize(text, voice, options): AudioBytes` — text → speech. Adapters: OpenAI TTS, ElevenLabs, Google Cloud TTS. Streaming via `LlmChunk.AudioDelta(pcmChunk)` for low-latency playback (relevant for IDE voice agents, chat UIs).
70+
- These keep the typed-boundary identity: `Agent<String, ImageBytes>` and `Agent<TextRequest, AudioBytes>` are first-class. Composition operators (`then`, `wrap`) work unchanged across modalities.
71+
- [ ] **Sandboxed tool execution**`SandboxedExecutor` interface with three backends, opt-in per tool (`tool(..., sandbox = ...)`) or per skill (`sandbox { }` block). Default executor stays in-process for backward compatibility. **Scope (lesson from Claude Code's implementation):** sandbox only applies to subprocess-shaped tools — tools whose executor shells out via `ProcessBuilder` or invokes external binaries. In-process Kotlin lambdas don't get OS-level isolation because `grants { }` + frozen agents already bound them; bolting on a sandbox is overkill that just makes the framework feel heavier.
72+
- `ProcessSandbox` — subprocess executor with env / cwd / timeout / network constraints. Backends: **Seatbelt** on macOS (the framework behind `sandbox-exec`; built into the OS), `bwrap` (bubblewrap) on Linux as the primary, `firejail` as the fallback. On WSL2 same as Linux; WSL1 unsupported (no namespace support). Plain `ProcessBuilder` with a loud warning on platforms with no native sandboxing tool. **Most pragmatic** — every dev box has at least one path. Cribs profile shape + socat-proxy plumbing from [`anthropic-experimental/sandbox-runtime`](https://github.com/anthropic-experimental/sandbox-runtime) (Anthropic's open-source Linux bwrap reference).
73+
- **Network sub-policy:** outbound blocked by default; allowlist via `sandbox.network.allowedDomains`. A proxy server (running outside the sandbox) intercepts DNS + connections and gates by hostname. **TLS caveat:** the default proxy doesn't terminate TLS — it allows/denies by hostname only. Allowing broad domains (`github.com`, `googleapis.com`) leaves room for domain-fronting; consumers needing real traffic inspection plug in their own MITM proxy. Document this explicitly so it's not a surprise.
74+
- **Permission/sandbox interaction:** sandbox path config and `grants { }` path config *merge* — both layers apply (matches Claude Code semantics). Sandbox cannot accidentally widen what `grants` denies. A tool with both must satisfy both.
6575
- `WasmSandbox` — JAR-embedded WASM runtime via Chicory (pure-Java; no host setup). Tools compiled to WASM; filesystem and network capabilities granted explicitly at registration. **Most truly embedded** — works anywhere a JVM runs.
6676
- `DockerSandbox` — opt-in extras module (`agents-kt-docker-sandbox`) via `docker-java`. Talks to whatever Docker daemon the host already runs. **Not embeddable** — library ships in the JAR, daemon does not. For teams that already operate Docker.
6777
- Why this axis matters: today `grants { tools(writeFile, compile) }` controls *which* tools an agent can call; sandboxing controls *what those tools can do* once invoked. Pairs with frozen agents + typed args to give a security model that's strictly stronger than "trust the executor lambda."

0 commit comments

Comments
 (0)