|
| 1 | +# Checkpoint and Resume |
| 2 | + |
| 3 | +Persist iteration state mid-run, then resume from any saved checkpoint into a fresh streaming continuation. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +A checkpointer captures the agent's full loop state at the end of every iteration: messages, accumulated token usage, per-iteration usage, context budget phase, session and run identity, the local-rewrite flag, the session approval allowlist, and any participating MCP tool bindings. ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` loads a saved checkpoint, replays its history into the consuming stream as one synthetic event, then continues live from the next iteration. |
| 8 | + |
| 9 | +This unblocks long-running sessions that need to survive process restarts, UI re-renders, or planned suspension. Checkpoints are written automatically by ``Agent/stream(userMessage:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)`` when both a `sessionID` and a `checkpointer` are passed. |
| 10 | + |
| 11 | +## What a Checkpoint Captures |
| 12 | + |
| 13 | +``AgentCheckpoint`` is a `Codable` snapshot. It is written at the end of each iteration after tools execute and before the next request is built. |
| 14 | + |
| 15 | +| Field | Description | |
| 16 | +|---|---| |
| 17 | +| `messages` | Full conversation including system prompt, user, assistant, and tool messages | |
| 18 | +| `iteration` | One-based iteration number that produced this snapshot | |
| 19 | +| `tokenUsage` | Cumulative input/output usage across all iterations to date | |
| 20 | +| `iterationUsage` | Token usage for this iteration alone, when the provider reported it | |
| 21 | +| `contextBudgetState` | ``ContextBudgetCheckpointState`` capturing config, window size, last budget snapshot, and the soft-advisory armed flag | |
| 22 | +| `historyWasRewrittenLocally` | Whether the agent rewrote history (compaction, pruning) before this iteration | |
| 23 | +| `sessionAllowlist` | Tool names the user accepted with `.approveAlways` during this session | |
| 24 | +| `sessionID` | Logical session that owns the run | |
| 25 | +| `runID` | Run that produced this checkpoint | |
| 26 | +| `checkpointID` | Stable identity for the snapshot | |
| 27 | +| `timestamp` | UTC time the snapshot was taken | |
| 28 | +| `mcpToolBindings` | ``MCPToolBinding`` set: which MCP tools participated in this checkpoint's history | |
| 29 | + |
| 30 | +## Backends |
| 31 | + |
| 32 | +``AgentCheckpointer`` is a three-method protocol. Two backends ship with the framework. |
| 33 | + |
| 34 | +| Backend | Use When | |
| 35 | +|---|---| |
| 36 | +| ``InMemoryCheckpointer`` | The session is bounded by a single process lifetime: previews, tests, transient UI | |
| 37 | +| ``FileCheckpointer`` | The session must survive process restart: production apps, server workers, recovery flows | |
| 38 | + |
| 39 | +``FileCheckpointer`` stores one JSON file per checkpoint under `<directory>/checkpoints/<uuid>.json`. ``FileCheckpointer/list(session:)`` skips files it cannot read or decode, so unrelated debris in the directory does not break enumeration. ``FileCheckpointer/load(_:)`` throws ``AgentCheckpointError/fileSystem(_:)`` on the requested file if it is corrupt. |
| 40 | + |
| 41 | +Custom backends conform to ``AgentCheckpointer`` directly; database-backed and remote-storage implementations are out of scope for the built-in backends. |
| 42 | + |
| 43 | +## Enabling Checkpointing on a Stream |
| 44 | + |
| 45 | +Pass `sessionID:` and `checkpointer:` to either entry point. |
| 46 | + |
| 47 | +```swift |
| 48 | +let session = SessionID() |
| 49 | +let checkpointer = InMemoryCheckpointer() |
| 50 | + |
| 51 | +let stream = agent.stream( |
| 52 | + userMessage: "Plan and execute the migration.", |
| 53 | + context: ctx, |
| 54 | + sessionID: session, |
| 55 | + checkpointer: checkpointer |
| 56 | +) |
| 57 | +for try await event in stream { |
| 58 | + handle(event) |
| 59 | +} |
| 60 | + |
| 61 | +let savedIDs = try await checkpointer.list(session: session) |
| 62 | +``` |
| 63 | + |
| 64 | +If either argument is omitted, no checkpoint is written. The `stream()` overloads continue to default both to `nil`, so existing call sites are unaffected. |
| 65 | + |
| 66 | +## Resuming a Run |
| 67 | + |
| 68 | +``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` loads the named checkpoint, replays its history as one synthetic ``StreamEvent/Kind/iterationCompleted(usage:iteration:history:)`` event tagged with ``EventOrigin/replayed(from:)``, then continues from `iteration + 1`. |
| 69 | + |
| 70 | +```swift |
| 71 | +let stream = try await agent.resume( |
| 72 | + from: checkpointID, |
| 73 | + checkpointer: checkpointer, |
| 74 | + context: ctx |
| 75 | +) |
| 76 | +for try await event in stream { |
| 77 | + if case .replayed(let id) = event.origin { |
| 78 | + applySnapshot(id) |
| 79 | + continue |
| 80 | + } |
| 81 | + handle(event) |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +The resumed run gets a fresh ``RunID`` under the same ``SessionID``. Callers can distinguish replayed events from the live continuation by inspecting ``StreamEvent/origin``. |
| 86 | + |
| 87 | +### Preflight Termination |
| 88 | + |
| 89 | +If the saved checkpoint already exceeds the new `tokenBudget`, the stream replays and finishes with ``FinishReason/tokenBudgetExceeded(budget:used:)`` without making any LLM call. If `iteration >= maxIterations`, it replays and finishes with ``FinishReason/maxIterationsReached(limit:)``. |
| 90 | + |
| 91 | +### Cursor-State Providers |
| 92 | + |
| 93 | +Providers with conversation cursor state (the OpenAI Responses API's `previous_response_id`) cannot reuse a stale cursor after resume because the resumed run is a different run. The first live request after resume forces full history (`.forceFullRequest`) so cursor-state providers reconstruct the conversation from messages rather than from a vanished cursor. |
| 94 | + |
| 95 | +### MCP Binding Validation |
| 96 | + |
| 97 | +Before replay begins, ``Agent/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` checks that every ``MCPToolBinding`` recorded in the checkpoint has a live counterpart on the resuming agent. If any are missing, resume throws ``AgentCheckpointError/mcpBindingMismatch(_:)`` with the missing bindings before any event is yielded. This catches deployment skew where the agent that resumes is configured against fewer or different MCP servers than the agent that saved. |
| 98 | + |
| 99 | +See <doc:MCPIntegration> for how MCP tools are discovered. |
| 100 | + |
| 101 | +## AgentStream Resume |
| 102 | + |
| 103 | +``AgentStream/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` is the SwiftUI-side entry point. It cancels any in-flight prior task before any await runs, loads the checkpoint exactly once, then synchronously preloads observable state before yielding control back to the caller. |
| 104 | + |
| 105 | +```swift |
| 106 | +@State private var stream = AgentStream(agent: agent, bufferCapacity: 256) |
| 107 | + |
| 108 | +try await stream.resume( |
| 109 | + from: checkpointID, |
| 110 | + checkpointer: checkpointer, |
| 111 | + context: ctx |
| 112 | +) |
| 113 | +``` |
| 114 | + |
| 115 | +When `resume` returns, these properties are already populated from the checkpoint: |
| 116 | + |
| 117 | +| Property | Source | |
| 118 | +|---|---| |
| 119 | +| ``AgentStream/sessionID`` | `target.sessionID` | |
| 120 | +| ``AgentStream/history`` | `target.messages` | |
| 121 | +| ``AgentStream/tokenUsage`` | `target.tokenUsage` | |
| 122 | +| ``AgentStream/currentCheckpoint`` | `target.checkpointID` | |
| 123 | + |
| 124 | +The live continuation runs in a background task; ``AgentStream/iterationsReplayed`` increments once the synthetic replay event is observed, then the live iteration cycle proceeds normally. ``AgentStream/iterationsReplayed`` only counts replayed iterations, so callers can distinguish a fresh send from a resume. |
| 125 | + |
| 126 | +See <doc:StreamingAndSwiftUI> for the full SwiftUI contract. |
| 127 | + |
| 128 | +## Cancellation Safety |
| 129 | + |
| 130 | +``AgentStream/resume(from:checkpointer:context:tokenBudget:requestContext:approvalHandler:)`` calls ``AgentStream/cancel()`` and resets observable state before any await. A prior in-flight task cannot continue mutating observers while the new checkpoint loads. The same generation-token discipline that protects ``AgentStream/send(_:history:context:tokenBudget:requestContext:approvalHandler:sessionID:checkpointer:)-(String,_,_,_,_,_,_,_)`` against late-arriving stale events applies to resume. |
| 131 | + |
| 132 | +## Cross-Process Resume |
| 133 | + |
| 134 | +``FileCheckpointer`` is safe to use from a fresh process. The directory layout is stable; reopening the same directory and calling ``FileCheckpointer/list(session:)`` returns checkpoints written by an earlier process. |
| 135 | + |
| 136 | +```swift |
| 137 | +// Process A |
| 138 | +let writer = FileCheckpointer(directory: stateDirectory) |
| 139 | +for try await _ in agent.stream( |
| 140 | + userMessage: "Long task...", |
| 141 | + context: ctx, sessionID: session, checkpointer: writer |
| 142 | +) {} |
| 143 | + |
| 144 | +// Process B (later) |
| 145 | +let reader = FileCheckpointer(directory: stateDirectory) |
| 146 | +let ids = try await reader.list(session: session) |
| 147 | +guard let last = ids.last else { return } |
| 148 | +let stream = try await agent.resume( |
| 149 | + from: last, checkpointer: reader, context: ctx |
| 150 | +) |
| 151 | +``` |
| 152 | + |
| 153 | +The file backend is single-writer oriented. Multi-process coordination over the same directory is the caller's responsibility; for concurrent writers, use a database-backed custom ``AgentCheckpointer``. |
| 154 | + |
| 155 | +## Errors |
| 156 | + |
| 157 | +``AgentCheckpointError`` covers the three failure modes that resume can surface: |
| 158 | + |
| 159 | +| Case | Meaning | |
| 160 | +|---|---| |
| 161 | +| ``AgentCheckpointError/notFound(_:)`` | The named ``CheckpointID`` is not present in the backend | |
| 162 | +| ``AgentCheckpointError/fileSystem(_:)`` | A file backend operation failed (read, write, decode for the requested ID) | |
| 163 | +| ``AgentCheckpointError/mcpBindingMismatch(_:)`` | Resume cannot continue because one or more recorded MCP bindings have no live counterpart | |
| 164 | + |
| 165 | +## See Also |
| 166 | + |
| 167 | +- <doc:StreamingAndSwiftUI> |
| 168 | +- <doc:MCPIntegration> |
| 169 | +- ``AgentCheckpoint`` |
| 170 | +- ``AgentCheckpointer`` |
| 171 | +- ``InMemoryCheckpointer`` |
| 172 | +- ``FileCheckpointer`` |
| 173 | +- ``MCPToolBinding`` |
| 174 | +- ``AgentCheckpointError`` |
| 175 | +- ``ContextBudgetCheckpointState`` |
| 176 | +- ``EventOrigin`` |
| 177 | +- ``CheckpointID`` |
| 178 | +- ``SessionID`` |
| 179 | +- ``RunID`` |
0 commit comments