docs(#1726): refine sandbox roadmap + add multimodal I/O

Skobeltsyn · claude · Skobeltsyn · commit 1ea5a6302bca · 2026-05-15T18:31:24.000+03:00
Sandboxing refinements (informed by Claude Code's actual impl):
- Scope ProcessSandbox to subprocess-shaped tools only (matches
  Claude Code's Bash-only scope; in-process lambdas covered by
  grants { } + frozen agents already).
- Name Seatbelt explicitly for macOS, bwrap primary for Linux,
  firejail as fallback. WSL1 unsupported by design.
- Add network-proxy sub-policy with TLS-inspection caveat
  (hostname-only gating; domain-fronting risk if allowlist too
  broad). Reference anthropic-experimental/sandbox-runtime as
  the canonical Linux/bwrap reference.
- Document permission/sandbox path merge — both layers apply.

Multimodal additions:
- Phase 2: image input (Anthropic / OpenAI / Ollama / Gemini)
  and audio input (Gemini, GPT-4o-audio) as LlmContent sealed
  blocks. Binary-compat path: add contentBlocks sibling field
  first, deprecate the String content later.
- Phase 3: generative outputs — ImageModelClient (DALL-E,
  Imagen, Stability) and TTSModelClient (OpenAI TTS,
  ElevenLabs, Google). Streaming via LlmChunk.ImageDelta /
  AudioDelta for partial-preview and low-latency playback.
- README Limitations gains an honest "text-only I/O today"
  entry pointing at the phased plan.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -164,6 +164,7 @@ What the framework does **not** enforce — your responsibility:
 - **No A2A protocol yet** — agent-to-agent over network (Phase 2 / 3).
 - **Inline-tool-call fallback model variance** — small Ollama models (e.g. `gemma3:4b`) reliably emit single tool calls via the inline format but may produce thin final-turn text after multi-step tool sequences. For multi-step reasoning, a tool-native model (`gpt-oss:20b-cloud` and similar) is the better fit.
 - **No tool sandboxing** — tool executors run in-process with full JVM privileges. `grants { }` controls *which* tools an agent can call, not *what they can do* once invoked. Sandboxed execution (`ProcessSandbox` / `WasmSandbox` / `DockerSandbox` opt-in backends) is on the Phase 3 roadmap.
+- **Text-only I/O today** — `LlmMessage.content: String` carries text. Image input (vision-capable adapters: Anthropic, OpenAI, Ollama, Gemini) and audio input land in Phase 2 alongside an `LlmContent` sealed-block evolution of the message model. Image generation (`ImageModelClient`: DALL-E, Imagen, Stability) and text-to-speech (`TTSModelClient`: OpenAI TTS, ElevenLabs, Google) are Phase 3.
 
 For planned features beyond these limitations, see [docs/roadmap.md](docs/roadmap.md).
 
@@ -221,9 +222,9 @@ Testing details — task names, integration test setup, mutation testing, how to
 
 **Phase 1 — Core DSL** *(in progress)*: typed agents, skills, knowledge, composition operators (`then`, `/`, `*`, `forum`, `.loop`, `.branch`), MCP client + server, agent memory, `loadResource(path)` for prompts from classpath, agentic loop with full budget controls (`maxTurns` / `maxToolCalls` / `maxDuration` / `perToolTimeout` / `maxTokens` / `maxConsecutiveSameTool`), observability hooks (`onSkillChosen`, `onToolUse`, `onKnowledgeUsed`, `onError`, `onBudgetThreshold`, `Agent.observe { }`).
 
-**Phase 2 — Runtime + Distribution** *(Q2 2026)*: remaining provider (Google), `Flow<...>` streaming on every adapter, KSP compile-time `@Generable`, native CLI / jlink, `Tool<IN, OUT>` hierarchy, `grants {}` permissions, session model, Flow-based observability, `agent.json` serialization, Gradle plugin. *(Anthropic and OpenAI adapters already landed in #1644 and #1656.)*
+**Phase 2 — Runtime + Distribution** *(Q2 2026)*: remaining provider (Google), `Flow<...>` streaming on every adapter, KSP compile-time `@Generable`, native CLI / jlink, `Tool<IN, OUT>` hierarchy, `grants {}` permissions, session model, Flow-based observability, **multimodal input** (image + audio content blocks; vision-capable adapters for Anthropic/OpenAI/Ollama/Gemini), `agent.json` serialization, Gradle plugin. *(Anthropic and OpenAI adapters already landed in #1644 and #1656.)*
 
-**Phase 3 — Production** *(Q3 2026)*: Layer 2 Structure DSL, all 37 compile-time validations, AgentUnit, A2A protocol, file-based knowledge with RAG, OpenTelemetry, **sandboxed tool execution** (`SandboxedExecutor` with `ProcessSandbox` / `WasmSandbox` / `DockerSandbox` backends — opt-in per tool, default executor stays in-process).
+**Phase 3 — Production** *(Q3 2026)*: Layer 2 Structure DSL, all 37 compile-time validations, AgentUnit, A2A protocol, file-based knowledge with RAG, OpenTelemetry, **sandboxed tool execution** (`SandboxedExecutor` with `ProcessSandbox` (Seatbelt / bwrap), `WasmSandbox` (Chicory), `DockerSandbox` backends — opt-in per tool, subprocess-shaped tools only, default executor stays in-process), **generative outputs** (`ImageModelClient` for DALL-E / Imagen / Stability, `TTSModelClient` for OpenAI / ElevenLabs / Google).
 
 **Phase 4 — Ecosystem** *(Q4 2026)*: knowledge packs, NL → DSL generation, Skillify, visual editor, knowledge marketplace.
 
diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -48,6 +48,10 @@
 - [x] Agent memory — `MemoryBank`, `memory_read`/`memory_write`/`memory_search` auto-injected tools
 - [ ] `.spawn {}` — independent sub-agent lifecycle, `AgentHandle<OUT>`, parent-managed join
 - [ ] `Flow<PipelineEvent>` for reactive UIs + Pipeline-level events (`StageStarted`, `PipelineCompleted`, etc) — depends on streaming, sub-agents, sessions
+- [ ] **Multimodal input** — vision and audio content blocks on LLM messages.
+  - **Image input:** vision-capable adapters accept image bytes + media type as a content block alongside text. Targets: Anthropic (`image` content blocks), OpenAI (`image_url` / base64 in content), Ollama (`llava` / `bakllava` via `images` field), Google Gemini.
+  - **Audio input:** true audio input (Gemini, GPT-4o-audio) — `LlmContent.Audio` block. Optional STT-only helper `audio.transcribe(file)` for the Whisper-style use case.
+  - **Architectural change:** `LlmMessage.content: String` needs to evolve into a `List<LlmContent>` sealed type (Text / Image / Audio blocks). Binary-compat risk: add a sibling `contentBlocks: List<LlmContent>?` field first with the existing String form auto-coerced into a single Text block; deprecate the String form once the API surface settles. Typed boundaries are unaffected — `Agent<Image, String>` (image classifier) and `Agent<AudioClip, String>` (transcriber) become coherent agent shapes.
 - [ ] Serialization — `agent.json`, A2A AgentCard
 - [ ] JAR bundles and folder-based assembly
 - [ ] Gradle plugin
@@ -60,8 +64,14 @@
 - [ ] File-based knowledge: `skill.md`, `reference`, `examples`, `checklist` + RAG pipeline
 - [ ] Production observability: OpenTelemetry traces
 - [ ] Team DSL — swarm coordination (if isolated execution available)
-- [ ] **Sandboxed tool execution** — `SandboxedExecutor` interface with three backends, opt-in per tool (`tool(..., sandbox = ...)`) or per skill (`sandbox { }` block). Default executor stays in-process for backward compatibility.
-  - `ProcessSandbox` — subprocess executor with env / cwd / timeout / network constraints. Backends: `sandbox-exec` on macOS (built into the OS), `bwrap` or `firejail` on Linux. Falls back to plain `ProcessBuilder` with a loud warning on platforms with no native sandboxing tool. **Most pragmatic** — every dev box has at least one path.
+- [ ] **Generative outputs (image + audio)** — sibling client interfaces to `ModelClient` for non-chat model families.
+  - `ImageModelClient.generate(prompt, options): ImageBytes` — text → image. Adapters: OpenAI DALL-E 3, Google Imagen, Stability. Optional streaming via `generateStream(...): Flow<LlmChunk.ImageDelta>` for partial-preview UX.
+  - `TTSModelClient.synthesize(text, voice, options): AudioBytes` — text → speech. Adapters: OpenAI TTS, ElevenLabs, Google Cloud TTS. Streaming via `LlmChunk.AudioDelta(pcmChunk)` for low-latency playback (relevant for IDE voice agents, chat UIs).
+  - These keep the typed-boundary identity: `Agent<String, ImageBytes>` and `Agent<TextRequest, AudioBytes>` are first-class. Composition operators (`then`, `wrap`) work unchanged across modalities.
+- [ ] **Sandboxed tool execution** — `SandboxedExecutor` interface with three backends, opt-in per tool (`tool(..., sandbox = ...)`) or per skill (`sandbox { }` block). Default executor stays in-process for backward compatibility. **Scope (lesson from Claude Code's implementation):** sandbox only applies to subprocess-shaped tools — tools whose executor shells out via `ProcessBuilder` or invokes external binaries. In-process Kotlin lambdas don't get OS-level isolation because `grants { }` + frozen agents already bound them; bolting on a sandbox is overkill that just makes the framework feel heavier.
+  - `ProcessSandbox` — subprocess executor with env / cwd / timeout / network constraints. Backends: **Seatbelt** on macOS (the framework behind `sandbox-exec`; built into the OS), `bwrap` (bubblewrap) on Linux as the primary, `firejail` as the fallback. On WSL2 same as Linux; WSL1 unsupported (no namespace support). Plain `ProcessBuilder` with a loud warning on platforms with no native sandboxing tool. **Most pragmatic** — every dev box has at least one path. Cribs profile shape + socat-proxy plumbing from [`anthropic-experimental/sandbox-runtime`](https://github.com/anthropic-experimental/sandbox-runtime) (Anthropic's open-source Linux bwrap reference).
+  - **Network sub-policy:** outbound blocked by default; allowlist via `sandbox.network.allowedDomains`. A proxy server (running outside the sandbox) intercepts DNS + connections and gates by hostname. **TLS caveat:** the default proxy doesn't terminate TLS — it allows/denies by hostname only. Allowing broad domains (`github.com`, `googleapis.com`) leaves room for domain-fronting; consumers needing real traffic inspection plug in their own MITM proxy. Document this explicitly so it's not a surprise.
+  - **Permission/sandbox interaction:** sandbox path config and `grants { }` path config *merge* — both layers apply (matches Claude Code semantics). Sandbox cannot accidentally widen what `grants` denies. A tool with both must satisfy both.
   - `WasmSandbox` — JAR-embedded WASM runtime via Chicory (pure-Java; no host setup). Tools compiled to WASM; filesystem and network capabilities granted explicitly at registration. **Most truly embedded** — works anywhere a JVM runs.
   - `DockerSandbox` — opt-in extras module (`agents-kt-docker-sandbox`) via `docker-java`. Talks to whatever Docker daemon the host already runs. **Not embeddable** — library ships in the JAR, daemon does not. For teams that already operate Docker.
   - Why this axis matters: today `grants { tools(writeFile, compile) }` controls *which* tools an agent can call; sandboxing controls *what those tools can do* once invoked. Pairs with frozen agents + typed args to give a security model that's strictly stronger than "trust the executor lambda."