Skip to content

Commit 81e4188

Browse files
committed
add(docs): docc article and extensions for checkpoint and resume
1 parent 557915f commit 81e4188

16 files changed

Lines changed: 399 additions & 14 deletions

Sources/AgentRunKit/Documentation.docc/AgentRunKit.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ if let content = result.content {
3838
}
3939
```
4040

41-
If a run ends because `maxIterations` or `tokenBudget` is reached before the model calls `finish`, ``Agent/run(userMessage:history:context:tokenBudget:requestContext:approvalHandler:)`` still returns an ``AgentResult`` with a structural ``FinishReason`` and `content == nil`.
41+
If a run ends because `maxIterations` or `tokenBudget` is reached before the model calls `finish`, ``Agent/run(userMessage:history:context:tokenBudget:requestContext:approvalHandler:)-(String,_,_,_,_,_)`` still returns an ``AgentResult`` with a structural ``FinishReason`` and `content == nil`.
4242

4343
For a complete walkthrough, see <doc:GettingStarted>.
4444

@@ -54,9 +54,27 @@ For a complete walkthrough, see <doc:GettingStarted>.
5454

5555
- <doc:StreamingAndSwiftUI>
5656
- ``StreamEvent``
57+
- ``EventOrigin``
5758
- ``AgentStream``
59+
- ``StreamEventBuffer``
60+
- ``BufferReplayError``
5861
- ``ToolCallInfo``
5962

63+
### Checkpoint and Resume
64+
65+
- <doc:CheckpointAndResume>
66+
- ``AgentCheckpoint``
67+
- ``AgentCheckpointer``
68+
- ``InMemoryCheckpointer``
69+
- ``FileCheckpointer``
70+
- ``MCPToolBinding``
71+
- ``ContextBudgetCheckpointState``
72+
- ``AgentCheckpointError``
73+
- ``CheckpointID``
74+
- ``SessionID``
75+
- ``RunID``
76+
- ``EventID``
77+
6078
### Tool Approval
6179

6280
- <doc:ToolApproval>

Sources/AgentRunKit/Documentation.docc/Articles/AgentAndChat.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ if let content = result.content {
2323

2424
``Agent`` also exposes `stream()`, which returns an `AsyncThrowingStream<StreamEvent, Error>` for real-time token delivery and tool progress. See <doc:StreamingAndSwiftUI>.
2525

26+
For long-running sessions, `stream()` accepts `sessionID:` and `checkpointer:` parameters that persist iteration state to a backend implementing ``AgentCheckpointer``. ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` reconstructs a stopped run from any saved ``CheckpointID``. See <doc:CheckpointAndResume>.
27+
2628
Key behaviors:
2729
- Injects a `finish` tool automatically. The model must call it to end the loop.
2830
- Alternate termination for on-device clients: when the LLM client cannot surface tool calls in its response (e.g., `FoundationModelsClient`), the loop terminates on the first iteration that produces content without tool calls. The user-visible contract is unchanged.
@@ -135,3 +137,4 @@ let (_, history2) = try await chat.send("Tell me more.", history: history)
135137
- <doc:DefiningTools>
136138
- <doc:StreamingAndSwiftUI>
137139
- <doc:ContextManagement>
140+
- <doc:CheckpointAndResume>
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Checkpoint and Resume
2+
3+
Persist iteration state mid-run, then resume from any saved checkpoint into a fresh streaming continuation.
4+
5+
## Overview
6+
7+
A checkpointer captures the agent's full loop state at the end of every iteration: messages, accumulated token usage, per-iteration usage, context budget phase, session and run identity, the local-rewrite flag, the session approval allowlist, and any participating MCP tool bindings. ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` loads a saved checkpoint, replays its history into the consuming stream as one synthetic event, then continues live from the next iteration.
8+
9+
This unblocks long-running sessions that need to survive process restarts, UI re-renders, or planned suspension. Checkpoints are written automatically by ``Agent/stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)`` when both a `sessionID` and a `checkpointer` are passed.
10+
11+
## What a Checkpoint Captures
12+
13+
``AgentCheckpoint`` is a `Codable` snapshot. It is written at the end of each iteration after tools execute and before the next request is built.
14+
15+
| Field | Description |
16+
|---|---|
17+
| `messages` | Full conversation including system prompt, user, assistant, and tool messages |
18+
| `iteration` | One-based iteration number that produced this snapshot |
19+
| `tokenUsage` | Cumulative input/output usage across all iterations to date |
20+
| `iterationUsage` | Token usage for this iteration alone, when the provider reported it |
21+
| `contextBudgetState` | ``ContextBudgetCheckpointState`` capturing config, window size, last budget snapshot, and the soft-advisory armed flag |
22+
| `historyWasRewrittenLocally` | Whether the agent rewrote history (compaction, pruning) before this iteration |
23+
| `sessionAllowlist` | Tool names the user accepted with `.approveAlways` during this session |
24+
| `sessionID` | Logical session that owns the run |
25+
| `runID` | Run that produced this checkpoint |
26+
| `checkpointID` | Stable identity for the snapshot |
27+
| `timestamp` | UTC time the snapshot was taken |
28+
| `mcpToolBindings` | ``MCPToolBinding`` set: which MCP tools participated in this checkpoint's history |
29+
30+
## Backends
31+
32+
``AgentCheckpointer`` is a three-method protocol. Two backends ship with the framework.
33+
34+
| Backend | Use When |
35+
|---|---|
36+
| ``InMemoryCheckpointer`` | The session is bounded by a single process lifetime: previews, tests, transient UI |
37+
| ``FileCheckpointer`` | The session must survive process restart: production apps, server workers, recovery flows |
38+
39+
``FileCheckpointer`` stores one JSON file per checkpoint under `<directory>/checkpoints/<uuid>.json`. ``FileCheckpointer/list(session:)`` skips files it cannot read or decode, so unrelated debris in the directory does not break enumeration. ``FileCheckpointer/load(_:)`` throws ``AgentCheckpointError/fileSystem(_:)`` on the requested file if it is corrupt.
40+
41+
Custom backends conform to ``AgentCheckpointer`` directly; database-backed and remote-storage implementations are out of scope for the built-in backends.
42+
43+
## Enabling Checkpointing on a Stream
44+
45+
Pass `sessionID:` and `checkpointer:` to either entry point.
46+
47+
```swift
48+
let session = SessionID()
49+
let checkpointer = InMemoryCheckpointer()
50+
51+
let stream = agent.stream(
52+
userMessage: "Plan and execute the migration.",
53+
context: ctx,
54+
sessionID: session,
55+
checkpointer: checkpointer
56+
)
57+
for try await event in stream {
58+
handle(event)
59+
}
60+
61+
let savedIDs = try await checkpointer.list(session: session)
62+
```
63+
64+
If either argument is omitted, no checkpoint is written. The `stream()` overloads continue to default both to `nil`, so existing call sites are unaffected.
65+
66+
## Resuming a Run
67+
68+
``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` loads the named checkpoint, replays its history as one synthetic ``StreamEvent/Kind/iterationCompleted(usage:iteration:history:)`` event tagged with ``EventOrigin/replayed(from:)``, then continues from `iteration + 1`.
69+
70+
```swift
71+
let stream = try await agent.resume(
72+
from: checkpointID,
73+
checkpointer: checkpointer,
74+
context: ctx
75+
)
76+
for try await event in stream {
77+
if case .replayed(let id) = event.origin {
78+
applySnapshot(id)
79+
continue
80+
}
81+
handle(event)
82+
}
83+
```
84+
85+
The resumed run gets a fresh ``RunID`` under the same ``SessionID``. Callers can distinguish replayed events from the live continuation by inspecting ``StreamEvent/origin``.
86+
87+
### Preflight Termination
88+
89+
If the saved checkpoint already exceeds the new `tokenBudget`, the stream replays and finishes with ``FinishReason/tokenBudgetExceeded(budget:used:)`` without making any LLM call. If `iteration >= maxIterations`, it replays and finishes with ``FinishReason/maxIterationsReached(limit:)``.
90+
91+
### Cursor-State Providers
92+
93+
Providers with conversation cursor state (the OpenAI Responses API's `previous_response_id`) cannot reuse a stale cursor after resume because the resumed run is a different run. The first live request after resume forces full history (`.forceFullRequest`) so cursor-state providers reconstruct the conversation from messages rather than from a vanished cursor.
94+
95+
### MCP Binding Validation
96+
97+
Before replay begins, ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` checks that every ``MCPToolBinding`` recorded in the checkpoint has a live counterpart on the resuming agent. If any are missing, resume throws ``AgentCheckpointError/mcpBindingMismatch(_:)`` with the missing bindings before any event is yielded. This catches deployment skew where the agent that resumes is configured against fewer or different MCP servers than the agent that saved.
98+
99+
See <doc:MCPIntegration> for how MCP tools are discovered.
100+
101+
## AgentStream Resume
102+
103+
``AgentStream/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` is the SwiftUI-side entry point. It cancels any in-flight prior task before any await runs, loads the checkpoint exactly once, then synchronously preloads observable state before yielding control back to the caller.
104+
105+
```swift
106+
@State private var stream = AgentStream(agent: agent, bufferCapacity: 256)
107+
108+
try await stream.resume(
109+
from: checkpointID,
110+
checkpointer: checkpointer,
111+
context: ctx
112+
)
113+
```
114+
115+
When `resume` returns, these properties are already populated from the checkpoint:
116+
117+
| Property | Source |
118+
|---|---|
119+
| ``AgentStream/sessionID`` | `target.sessionID` |
120+
| ``AgentStream/history`` | `target.messages` |
121+
| ``AgentStream/tokenUsage`` | `target.tokenUsage` |
122+
| ``AgentStream/currentCheckpoint`` | `target.checkpointID` |
123+
124+
The live continuation runs in a background task; ``AgentStream/iterationsReplayed`` increments once the synthetic replay event is observed, then the live iteration cycle proceeds normally. ``AgentStream/iterationsReplayed`` only counts replayed iterations, so callers can distinguish a fresh send from a resume.
125+
126+
See <doc:StreamingAndSwiftUI> for the full SwiftUI contract.
127+
128+
## Cancellation Safety
129+
130+
``AgentStream/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` calls ``AgentStream/cancel()`` and resets observable state before any await. A prior in-flight task cannot continue mutating observers while the new checkpoint loads. The same generation-token discipline that protects ``AgentStream/send(_:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)`` against late-arriving stale events applies to resume.
131+
132+
## Cross-Process Resume
133+
134+
``FileCheckpointer`` is safe to use from a fresh process. The directory layout is stable; reopening the same directory and calling ``FileCheckpointer/list(session:)`` returns checkpoints written by an earlier process.
135+
136+
```swift
137+
// Process A
138+
let writer = FileCheckpointer(directory: stateDirectory)
139+
for try await _ in agent.stream(
140+
userMessage: "Long task...",
141+
context: ctx, sessionID: session, checkpointer: writer
142+
) {}
143+
144+
// Process B (later)
145+
let reader = FileCheckpointer(directory: stateDirectory)
146+
let ids = try await reader.list(session: session)
147+
guard let last = ids.last else { return }
148+
let stream = try await agent.resume(
149+
from: last, checkpointer: reader, context: ctx
150+
)
151+
```
152+
153+
The file backend is single-writer oriented. Multi-process coordination over the same directory is the caller's responsibility; for concurrent writers, use a database-backed custom ``AgentCheckpointer``.
154+
155+
## Errors
156+
157+
``AgentCheckpointError`` covers the three failure modes that resume can surface:
158+
159+
| Case | Meaning |
160+
|---|---|
161+
| ``AgentCheckpointError/notFound(_:)`` | The named ``CheckpointID`` is not present in the backend |
162+
| ``AgentCheckpointError/fileSystem(_:)`` | A file backend operation failed (read, write, decode for the requested ID) |
163+
| ``AgentCheckpointError/mcpBindingMismatch(_:)`` | Resume cannot continue because one or more recorded MCP bindings have no live counterpart |
164+
165+
## See Also
166+
167+
- <doc:StreamingAndSwiftUI>
168+
- <doc:MCPIntegration>
169+
- ``AgentCheckpoint``
170+
- ``AgentCheckpointer``
171+
- ``InMemoryCheckpointer``
172+
- ``FileCheckpointer``
173+
- ``MCPToolBinding``
174+
- ``AgentCheckpointError``
175+
- ``ContextBudgetCheckpointState``
176+
- ``EventOrigin``
177+
- ``CheckpointID``
178+
- ``SessionID``
179+
- ``RunID``

Sources/AgentRunKit/Documentation.docc/Articles/MCPIntegration.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ Pass multiple ``MCPServerConfiguration`` values to a single session. All servers
5454

5555
``MCPTool`` adapts each discovered MCP tool to the ``AnyTool`` protocol. Once inside the `withTools` closure, MCP tools are indistinguishable from native ``Tool`` instances. The agent calls them through the same interface, and their results follow the same ``ToolResult`` type.
5656

57+
## Checkpoint Binding Validation
58+
59+
When a checkpointed run includes MCP tool calls, the agent loop records each participating tool as an ``MCPToolBinding`` in ``AgentCheckpoint/mcpToolBindings``. On resume, ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` validates that every recorded binding has a live counterpart with the same `serverName` and `toolName`. Missing bindings throw ``AgentCheckpointError/mcpBindingMismatch(_:)`` before any event is yielded, catching deployment skew where the resuming agent is configured against a different MCP server set. See <doc:CheckpointAndResume>.
60+
5761
## Error Handling
5862

5963
``MCPError`` covers all failure modes:
@@ -112,10 +116,12 @@ For session-based usage with custom transports, use the internal initializer tha
112116

113117
- <doc:DefiningTools>
114118
- <doc:AgentAndChat>
119+
- <doc:CheckpointAndResume>
115120
- ``MCPClient``
116121
- ``MCPSession``
117122
- ``MCPTool``
118123
- ``MCPToolInfo``
124+
- ``MCPToolBinding``
119125
- ``MCPServerConfiguration``
120126
- ``StdioMCPTransport``
121127
- ``MCPTransport``

Sources/AgentRunKit/Documentation.docc/Articles/StreamingAndSwiftUI.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,13 @@ Every event includes:
4949
|---|---|
5050
| ``StreamEvent/id`` | Stable event identity for transcript rendering and correlation |
5151
| ``StreamEvent/timestamp`` | Emission time in UTC |
52-
| ``StreamEvent/sessionID`` | Optional session identity |
53-
| ``StreamEvent/runID`` | Optional run identity |
52+
| ``StreamEvent/sessionID`` | Session identity, populated when a stream is started with `sessionID:` |
53+
| ``StreamEvent/runID`` | Run identity, freshly assigned on each `stream()` or `resume(...)` |
5454
| ``StreamEvent/parentEventID`` | Optional parent correlation identity |
55+
| ``StreamEvent/origin`` | ``EventOrigin/live`` or ``EventOrigin/replayed(from:)`` (set on resume) |
5556
| ``StreamEvent/kind`` | The semantic payload |
5657

57-
Today, direct `Agent` and `Chat` streams leave `sessionID`, `runID`, and `parentEventID` unset. A future session layer will populate those fields consistently.
58+
Pass `sessionID:` to ``Agent/stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)`` to thread an explicit session through events; otherwise a fresh ``SessionID`` is minted per stream. ``Chat`` continues to leave identity envelope fields unset.
5859

5960
## StreamEvent Kinds
6061

@@ -135,12 +136,20 @@ This canonical codec uses the framework's fixed JSON settings for event transcri
135136
| `toolCalls` | [``ToolCallInfo``] | Top-level and nested tool calls with live state (`.running`, `.awaitingApproval`, `.completed`, `.failed`) |
136137
| `iterationUsages` | [``TokenUsage``] | Per-iteration usage, one entry per `.iterationCompleted` |
137138
| `contextBudget` | ``ContextBudget``? | Latest budget snapshot from `.budgetUpdated` |
139+
| `sessionID` | ``SessionID``? | Session identity threaded through emitted events |
140+
| `currentCheckpoint` | ``CheckpointID``? | Last replayed or live checkpoint observed; preloaded on resume |
141+
| `iterationsReplayed` | `Int` | Count of replayed `.iterationCompleted` events; only incremented on `.replayed` origin |
138142

139143
**Methods:**
140144

141-
- `send(_:history:context:tokenBudget:requestContext:approvalHandler:)` cancels any active stream, resets state, and starts a new one.
145+
- `send(_:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)` cancels any active stream, resets state, and starts a new one. Pass `sessionID:` and `checkpointer:` to persist iteration state.
146+
- `resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)` synchronously preloads observable state from the loaded checkpoint, then starts the live continuation. See <doc:CheckpointAndResume>.
142147
- `cancel()` cancels the active stream without resetting state. It is a local cancellation API and does not guarantee a terminal `.finished` event.
143148

149+
### Late-Binding Replay
150+
151+
Construct ``AgentStream`` with a `bufferCapacity:` to capture every emitted event in a ``StreamEventBuffer``. Late observers reattach via ``AgentStream/replay(from:)``, which streams every buffered event from the given monotonic cursor and then errors with ``BufferReplayError`` if buffering is disabled. The buffer is per-send-isolated: a new `send` or `resume` clears the buffer to keep cursors comparable within one logical run.
152+
144153
When sub-agents emit nested tool events, `toolCalls` flattens them into the same collection and prefixes names using `parent > child`.
145154

146155
## SwiftUI Example
@@ -196,7 +205,10 @@ for (index, usage) in stream.iterationUsages.enumerated() {
196205

197206
- <doc:AgentAndChat>
198207
- <doc:SubAgents>
208+
- <doc:CheckpointAndResume>
199209
- ``StreamEvent``
210+
- ``EventOrigin``
200211
- ``AgentStream``
212+
- ``StreamEventBuffer``
201213
- ``ToolCallInfo``
202214
- ``TokenUsage``

Sources/AgentRunKit/Documentation.docc/Extensions/Agent.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,10 @@
1313

1414
### Streaming
1515

16-
- ``stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:)-(String,_,_,_,_,_)``
17-
- ``stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:)-(ChatMessage,_,_,_,_,_)``
16+
- ``stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)``
17+
- ``stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(ChatMessage,_,_,_,_,_,_,_)``
18+
19+
### Resuming
20+
21+
- ``resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)``
22+
- <doc:CheckpointAndResume>
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# ``AgentRunKit/AgentCheckpoint``
2+
3+
Snapshot of agent loop state captured at the end of an iteration.
4+
5+
The snapshot is what ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` reads to reconstruct a run. See <doc:CheckpointAndResume> for the full lifecycle.
6+
7+
## Topics
8+
9+
### Identity
10+
11+
- ``checkpointID``
12+
- ``sessionID``
13+
- ``runID``
14+
- ``timestamp``
15+
16+
### Loop State
17+
18+
- ``messages``
19+
- ``iteration``
20+
- ``tokenUsage``
21+
- ``iterationUsage``
22+
23+
### Resume Inputs
24+
25+
- ``contextBudgetState``
26+
- ``historyWasRewrittenLocally``
27+
- ``sessionAllowlist``
28+
- ``mcpToolBindings``
29+
30+
### Initialization
31+
32+
- ``init(messages:iteration:tokenUsage:iterationUsage:contextBudgetState:historyWasRewrittenLocally:sessionAllowlist:sessionID:runID:checkpointID:timestamp:mcpToolBindings:)``
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# ``AgentRunKit/AgentCheckpointError``
2+
3+
Errors thrown by ``AgentCheckpointer`` backends and ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)``.
4+
5+
See <doc:CheckpointAndResume> for the resume contract that surfaces these.
6+
7+
## Topics
8+
9+
### Cases
10+
11+
- ``notFound(_:)``
12+
- ``fileSystem(_:)``
13+
- ``mcpBindingMismatch(_:)``
14+
15+
### LocalizedError
16+
17+
- ``errorDescription``
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# ``AgentRunKit/AgentCheckpointer``
2+
3+
Persistence backend for ``AgentCheckpoint`` snapshots.
4+
5+
Conform to write a custom backend (database, remote storage). The two built-in conformances are ``InMemoryCheckpointer`` and ``FileCheckpointer``. See <doc:CheckpointAndResume>.
6+
7+
## Topics
8+
9+
### Operations
10+
11+
- ``save(_:)``
12+
- ``load(_:)``
13+
- ``list(session:)``
14+
15+
### Built-In Backends
16+
17+
- ``InMemoryCheckpointer``
18+
- ``FileCheckpointer``
19+
20+
### Errors
21+
22+
- ``AgentCheckpointError``

0 commit comments

Comments
 (0)