Skip to content

Commit da438c7

Browse files
Skobeltsynclaude
andcommitted
docs(#1744): docs/streaming.md — consumer guide for the session API
In-repo reference for the v0.5.0 streaming work. Walks through: - agent.session(input) usage with the canonical when-event block - the AgentEvent hierarchy with field-by-field fire conditions - provider streaming status table with live-measured chunk counts (Ollama 19/84ms, Anthropic 2/27ms, OpenAI 19/202ms — taken from the integration tests) - TokenUsage in events (cumulative per-skill; null for implementedBy) - Cancellation contract — what works today and what's deferred to step 4 (HTTP sendAsync migration) - Test coverage map — 22 test methods across 12 files, what each pins - Composition note — leaf-agent sessions only until Pipeline/Branch/ wrap/Swarm flow events through README's Limitations entry rewritten to point at the new doc and reflect the actually-shipped state. docs/testing.md gains a small "Testing streaming agents" subsection linking back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5e9b420 commit da438c7

3 files changed

Lines changed: 143 additions & 1 deletion

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ What the framework does **not** enforce — your responsibility:
160160
- **No incoming auth on `McpServer`** — outgoing client supports Bearer; the server does not validate credentials. Suitable for trusted-network deployments only.
161161
- **No Origin header validation on MCP HTTP** — deferred until the MCP-server hardening pass.
162162
- **No per-adapter native streaming yet**`LlmChunk` sealed type + `ModelClient.chatStream(messages): Flow<LlmChunk>` default impl landed in v0.4.6 (#1722), so `chatStream` is callable on every `ModelClient`. The default wraps `chat()` and emits one `TextDelta` + `End` (or `ToolCallStarted` / `ArgumentsDelta` / `Finished` / `End` for tool turns), so non-streaming consumers see ordered chunks but no real-time partial output. Native streaming overrides (Anthropic SSE, OpenAI SSE, Ollama `stream: true`) are next on the Phase 2 list — see [docs/premortem-0.5.0-streaming.md](docs/premortem-0.5.0-streaming.md).
163-
- **Streaming session surface ships now; token-level streaming inside the agentic loop is in progress.** `agent.session(input): AgentSession<OUT>` exposes `events: Flow<AgentEvent<OUT>>` with `SkillStarted` / `SkillCompleted` / `Completed<OUT>` / `Failed` bracket events (#1736). The `Token` / `ToolCallStarted` / `ToolCallArgumentsDelta` / `ToolCallFinished` event types are defined but not yet emitted — `executeAgentic` still calls `chat()` synchronously. Step 3 rewires the agentic loop onto a `FlowCollector<AgentEvent>` so mid-loop events fire as the LLM streams. Cancellation is partial today (Flow collection cancels promptly; `implementedBy` skill bodies have no suspend points and may run to completion in the background — step 3 closes that gap via the native streaming HTTP path).
163+
- **Streaming session surface ships, all three adapters stream natively.** `agent.session(input): AgentSession<OUT>` exposes `events: Flow<AgentEvent<OUT>>` — bracket events (`SkillStarted` / `SkillCompleted` / `Completed<OUT>` / `Failed`) plus mid-loop `Token` / `ToolCallStarted` / `ToolCallArgumentsDelta` / `ToolCallFinished` events as the agentic loop runs. Ollama (NDJSON), Anthropic (SSE), and OpenAI (SSE) all stream at the wire level; live integration tests measure 19 / 2 / 19 chunks per response respectively. `SkillCompleted.tokensUsed` and `Completed.tokensUsed` carry cumulative `TokenUsage` across all turns. See [docs/streaming.md](docs/streaming.md) for the full API + the [v0.5.0 streaming premortem](docs/premortem-0.5.0-streaming.md) for design rationale. Cancellation is partial today (Flow collection cancels promptly; synchronous skill bodies and blocking HTTP reads aren't coroutine-cancellable mid-call — step 4 migrates adapters to `sendAsync`). Composition operators (`Pipeline` / `Branch` / `wrap` / `Swarm`) don't yet flow events through; leaf-agent sessions only.
164164
- **No native binary** — JVM-only (≥ JDK 21). GraalVM and `jlink` bundles are Phase 2 priorities.
165165
- **No A2A protocol yet** — agent-to-agent over network (Phase 2 / 3).
166166
- **Inline-tool-call fallback model variance** — small Ollama models (e.g. `gemma3:4b`) reliably emit single tool calls via the inline format but may produce thin final-turn text after multi-step tool sequences. For multi-step reasoning, a tool-native model (`gpt-oss:20b-cloud` and similar) is the better fit.

docs/streaming.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Streaming agents
2+
3+
How to consume agent execution as a typed event stream. Pairs with the [v0.5.0 streaming premortem](premortem-0.5.0-streaming.md) for the design rationale.
4+
5+
## Quick start
6+
7+
```kotlin
8+
import agents_engine.runtime.events.AgentEvent
9+
import agents_engine.runtime.events.session
10+
11+
val session = myAgent.session(input)
12+
13+
session.events.collect { event ->
14+
when (event) {
15+
is AgentEvent.SkillStarted -> log("${event.skillName}")
16+
is AgentEvent.Token -> render(event.text) // mid-loop
17+
is AgentEvent.ToolCallStarted -> log("tool: ${event.toolName} (${event.callId})")
18+
is AgentEvent.ToolCallArgumentsDelta -> previewArgs(event.callId, event.deltaJson)
19+
is AgentEvent.ToolCallFinished -> if (event.isError) err(event) else showResult(event)
20+
is AgentEvent.SkillCompleted -> log("${event.skillName} (${event.tokensUsed?.total ?: '?'} tokens)")
21+
is AgentEvent.Completed -> done(event.output, event.tokensUsed)
22+
is AgentEvent.Failed -> err(event.cause)
23+
}
24+
}
25+
26+
// OR: skip the events, just wait for the typed output.
27+
val output: OUT = myAgent.session(input).await()
28+
```
29+
30+
Each `agent.session(input)` call starts a fresh invocation. `events` is a cold `Flow<AgentEvent<OUT>>` — collecting it twice would run the agent twice. Use `events.shareIn(...)` if you need multiple collectors.
31+
32+
## The AgentEvent hierarchy
33+
34+
All subtypes carry an `agentId: String` field — the name of the agent that produced the event. (Composition operators don't yet flow events through; see the [composition note](#composition) below.) Only `Completed` is parameterized on the agent's `OUT`; everything else is `AgentEvent<Nothing>` so events flow through any `AgentSession<OUT>`.
35+
36+
| Event | Fires when | Carries |
37+
|---|---|---|
38+
| `SkillStarted` | Before the resolved skill executes | `skillName` |
39+
| `Token` | LLM streams a content chunk | `skillName`, `text` |
40+
| `ToolCallStarted` | Streaming adapter sees a new tool call | `skillName`, `callId`, `toolName` |
41+
| `ToolCallArgumentsDelta` | Each fragment of streamed tool-call args | `callId`, `deltaJson` |
42+
| `ToolCallFinished` | After the agentic loop runs the executor | `callId`, `toolName`, `arguments`, `result`, `isError` |
43+
| `SkillCompleted` | Skill body has returned | `skillName`, `tokensUsed` (cumulative across all LLM turns of this skill; null for `implementedBy`) |
44+
| `Completed<OUT>` | Terminal success — emitted exactly once | `output`, `tokensUsed` |
45+
| `Failed` | Terminal failure — emitted exactly once before the exception propagates | `cause` |
46+
47+
**`implementedBy` skills:** only `SkillStarted``SkillCompleted``Completed`. No `Token` or `ToolCall*` (no LLM round-trip). `tokensUsed` is always null.
48+
49+
**Agentic skills (LLM-driven):** the full set fires. `Token` events arrive incrementally as the model streams (proof in `AgentSessionIncrementalArrivalTest`).
50+
51+
**`Completed` and `Failed` are mutually exclusive.** A session emits exactly one of them as its terminal event.
52+
53+
## Provider streaming status
54+
55+
All three first-party adapters override `ModelClient.chatStream` with native wire-level streaming. Numbers below are from the live integration tests under `./gradlew integrationTest` against real APIs.
56+
57+
| Provider | Protocol | File | Live measurement (count 1–10 prompt) |
58+
|---|---|---|---|
59+
| Ollama | NDJSON | `OllamaClient.chatStream` | 19 chunks / 84ms gap (gpt-oss:120b-cloud) |
60+
| Anthropic | SSE with named events + indexed content blocks | `ClaudeClient.chatStream` | 2 chunks / 27ms gap (claude-haiku-4-5) |
61+
| OpenAI | SSE with `[DONE]` terminator | `OpenAiClient.chatStream` | 19 chunks / 202ms gap (gpt-4o-mini) |
62+
63+
Custom `ModelClient` implementations don't need to override `chatStream` — the default impl wraps `chat()` and emits one bundled chunk sequence. That's fine for non-streaming providers; it just won't show incremental arrival.
64+
65+
### Anthropic-specific: interleaved content blocks
66+
67+
Anthropic's SSE can interleave chunks across content blocks (text vs tool_use) — both have an `index` and chunks for different indices arrive mixed. `ClaudeClient.chatStream` tracks blocks in a `Map<Int, BlockState>` and routes each delta to the right block's id/builder. This is what `ToolCall.callId` was designed for; the test `ClaudeClientChatStreamTest > interleaved text and tool_use blocks emit correctly keyed by callId` pins it.
68+
69+
### OpenAI-specific: usage opt-in
70+
71+
Token usage on streamed responses requires `stream_options.include_usage: true` in the request. `OpenAiClient.buildRequestJson(stream = true)` sets it automatically; OpenAI then sends a final usage-only delta before `[DONE]`.
72+
73+
## TokenUsage in events
74+
75+
`SkillCompleted.tokensUsed` and `Completed.tokensUsed` carry a cumulative `TokenUsage` summed across every LLM turn of the skill — `promptTokens` and `completionTokens` summed independently. For a single-turn run, this equals that turn's usage; for a multi-turn loop, it's the total billed for the skill.
76+
77+
```kotlin
78+
val session = agent.session(input)
79+
val output = session.await()
80+
session.events.toList().filterIsInstance<AgentEvent.Completed<*>>().single().tokensUsed
81+
// → TokenUsage(promptTokens=147, completionTokens=63), or null if the provider didn't report
82+
```
83+
84+
`implementedBy` skills: always null (no LLM).
85+
86+
## Cancellation
87+
88+
Today's contract: **cancelling the coroutine collecting `events` terminates the Flow promptly.** The session's outer scope is a `SupervisorJob` with `Dispatchers.Unconfined`; cancellation propagates through the channel-backed Flow.
89+
90+
**Step-3 gap (deferred to step 4):** the agent invocation itself may run to completion in the background when cancelled, because:
91+
92+
- `implementedBy` lambdas are `(IN) -> OUT` — pure synchronous code with no suspension points. Coroutine cancellation can only fire at suspension points, so `Thread.sleep` or a tight loop won't be interrupted.
93+
- Native streaming adapters use `HttpClient.send(BodyHandlers.ofInputStream())` which blocks inside `BufferedReader.readLine()`. Same issue — coroutine cancel doesn't interrupt the blocking read.
94+
95+
`AgentSessionCancellationTest` documents the current contract: collector cancellation returns under 500ms even while a 2-second `Thread.sleep` is still running in the skill body. Step 4 migrates the adapters to `sendAsync` so HTTP cancellation propagates properly; that's also when synchronous-skill cancellation will be addressed (likely via explicit `session.cancel()`).
96+
97+
## Test coverage map
98+
99+
For contributors navigating the streaming test surface:
100+
101+
### Session API
102+
103+
| File | Pins |
104+
|---|---|
105+
| `AgentSessionBasicEventsTest` | implementedBy happy path — three ordered bracket events |
106+
| `AgentSessionIntegrationTest` | failure path (identity-preserved cause), concurrent sessions, agentic-stub bracketing with Token, tool-call event sequence with shared callId, tokensUsed single-turn, tokensUsed cumulative across two turns |
107+
| `AgentSessionLiveTest` | live π to 20 decimals against Ollama — `full20=true` end-to-end |
108+
| `AgentSessionCancellationTest` | collector cancel returns under 500ms even with a 2-second sleeping skill |
109+
| `AgentSessionIncrementalArrivalTest` | timing proof — first Token ≥100ms before Completed under a delayed-chunk stub |
110+
| `ModelClientChatStreamDefaultTest` | default `chatStream` wrap of non-streaming `chat()` — Text and ToolCalls cases |
111+
112+
### Adapter streaming (provider-level chunk parsing)
113+
114+
| File | Pins |
115+
|---|---|
116+
| `OllamaClientChatStreamTest` | NDJSON: TextDelta sequence + End with usage; tool-call triple; empty-content skip |
117+
| `OllamaClientChatStreamLiveTest` | live Ollama — multiple chunks with measurable timing gap |
118+
| `ClaudeClientChatStreamTest` | SSE text-only; tool_use with `input_json_delta` accumulation; interleaved text + tool_use blocks correctly keyed by callId |
119+
| `ClaudeClientChatStreamLiveTest` | live Anthropic — multiple chunks with usage |
120+
| `OpenAiClientChatStreamTest` | SSE text-only with usage-only final delta; tool-call with `call_*` id reused across deltas |
121+
| `OpenAiClientChatStreamLiveTest` | live OpenAI — multiple chunks with usage |
122+
123+
22 test methods across 12 files. The non-live tests run under `./gradlew test`; the live ones run under `./gradlew integrationTest` (tagged `live-llm`).
124+
125+
## Composition
126+
127+
`Pipeline` / `Branch` / `wrap` / `Swarm` do **not** yet flow events through to a parent session. Only leaf agents (the agent you directly call `.session(input)` on) expose streaming. A composed pipeline still works end-to-end with `pipeline(input)` returning the typed output, but `pipeline.session(input)` is not yet defined.
128+
129+
This is the next v0.5.0 milestone after step 3. The shape per the premortem: `agentId` on every event already namespaces by source agent — a Pipeline session would flow events from each inner agent with their respective `agentId`s, plus its own bracket events.
130+
131+
## Known gaps (post-step-3, pre-v0.5.0-release)
132+
133+
- **Composition flow-through** (above).
134+
- **HTTP cancellation** mid-read — blocking InputStream isn't coroutine-cancellable.
135+
- **Synchronous skill body cancellation**`implementedBy` lambdas can't be interrupted.
136+
- **Provider-specific limits** — Ollama bundles tool-call args in one final chunk (no progressive `input_json_delta`); only Anthropic streams tool args progressively today.
137+
138+
See [`docs/roadmap.md`](roadmap.md) Phase 2 *Secondary* for the planned closure of each.

docs/testing.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,10 @@ Two canonical patterns to crib from:
120120
- **Synchronous unit test** — see `src/test/kotlin/agents_engine/model/ModelClientChatStreamDefaultTest.kt`. Inline stub via `ModelClient { _ -> ... }`, asserts a Flow output.
121121
- **Whole-loop test with a fake provider** — see `src/test/kotlin/agents_engine/model/AgenticLoopTest.kt`. Multi-turn stub that returns different responses per call to exercise tool-call → result → final-text sequences.
122122

123+
### Testing streaming agents
124+
125+
Sessions (`agent.session(input)`) and the adapter-level `chatStream` overrides have their own test pattern — inline NDJSON or SSE payloads for non-live tests, optional live-LLM coverage for end-to-end. The full taxonomy of streaming tests with what each pins is in [docs/streaming.md → Test coverage map](streaming.md#test-coverage-map).
126+
123127
### Reflection-fallback paths
124128

125129
If you change anything in `ReflectionFallback` or any wrapped `kotlin.reflect.full.*` callsite, **also** add or update assertions in `agents-kt-no-reflect-test/src/test/kotlin/smoke/`. The main suite has `kotlin-reflect` on its `testImplementation` — it cannot catch a regression where the reflect-absent branch breaks. The smoke subproject is the only place that can.

0 commit comments

Comments
 (0)