Create s06.md

ivan-magda · ivan-magda · commit 66bdafa93003 · 2026-03-18T19:58:34.000+03:00
diff --git a/docs/s06.md b/docs/s06.md
@@ -0,0 +1,306 @@
+# Context compaction: infinite conversations
+
+Our agent has come a long way. It runs commands, reads and writes files, tracks its own work, delegates to subagents, and loads skills on demand — seven tools, one loop. But every one of those capabilities adds to the same growing resource: the messages array. A single `read_file` on a 1,000-line source file costs roughly 4,000 tokens. Load a skill body, and that's another 2,000. After reading 30 files and running 20 bash commands across a long session, the context pushes past 100,000 tokens. At that point, the agent either hits the API's context window limit and errors out, or — more subtly — the model's response quality degrades as the relevant information gets buried in a sea of stale tool results.
+
+This is the threshold that separates a demo from a useful tool. Everything we've built so far assumes the context has room. Once it doesn't, the agent has a hard ceiling on how much work it can do in a single session. That's where context compaction comes in: a three-layer compression strategy that progressively shrinks the messages array — quietly trimming old results, automatically summarizing when a threshold is crossed, and letting the model request compression explicitly. With these three layers working together, the agent can run indefinitely.
+
+In this guide, let's build `ContextCompactor` — the type that implements all three layers — and wire it into the agent loop. This is the beginning of Act III in our series: the agent now needs to manage its own memory.
+
+_The complete source code for this stage is available at the [`06-context-compaction`](https://github.com/ivan-magda/swift-claude-code/tree/06-context-compaction/Sources) tag on GitHub. Code blocks below show key excerpts._
+
+---
+
+### Three layers, three strategies
+
+The compression strategy works in layers, each more aggressive than the last. Layer 1 — **micro-compact** — runs silently before every API call. It scans the messages array for old tool results (anything beyond the three most recent) and replaces their content with a short placeholder like `"[Previous: used read_file]"`. The model still sees that a tool was called and what kind it was, but the actual output — the 500-line file, the verbose bash output — is gone. This is the quiet housekeeping layer: no API call required, no information loss that the model would typically need, and it runs every single turn.
+
+Layer 2 — **auto-compact** — triggers when the estimated token count crosses a threshold (50,000 by default). This is the dramatic one: the agent saves the entire conversation transcript to disk as a JSONL file, then asks the LLM itself to summarize the conversation. The summary replaces the entire messages array — every prior turn collapses into two messages: a user message containing the compressed summary and an assistant acknowledgment. The conversation continues from there with a clean slate and full context of what happened.
+
+Layer 3 — the **compact tool** — is the same summarization as layer 2, but triggered deliberately. The model calls `compact` when it decides compression would help, optionally specifying a `focus` parameter to guide what the summary should preserve. It's the difference between automatic garbage collection and an explicit `free()` — sometimes the model knows best when to compress.
+
+---
+
+### The ContextCompactor type
+
+Let's start with the type that owns all three layers. `ContextCompactor` holds two configuration values — the path where transcripts are saved and the token threshold that triggers auto-compaction — and exposes methods for each layer:
+
+```swift
+// Sources/Core/ContextCompactor.swift
+public struct ContextCompactor: Sendable {
+  public static let keepRecent = 3
+  public static let minContentLength = 100
+
+  public let transcriptDirectory: String
+  public let tokenThreshold: Int
+
+  public init(
+    transcriptDirectory: String,
+    tokenThreshold: Int = Limits.defaultTokenThreshold
+  ) {
+    self.transcriptDirectory = transcriptDirectory
+    self.tokenThreshold = tokenThreshold
+  }
+}
+```
+
+The `keepRecent` and `minContentLength` constants control micro-compact's behavior: keep the three most recent tool results untouched, and only replace results longer than 100 characters. Anything shorter isn't worth compacting.
+
+---
+
+### Micro-compact: the quiet layer
+
+The `microCompact` method scans the messages array for every `.toolResult` content block, identifies which ones are old enough to compress, and replaces their content with a placeholder. One thing to keep in mind here is that `Message.content` is a `let` property — we can't mutate a content block in place. Instead, we reconstruct entire `Message` values with new content arrays:
+
+```swift
+public func microCompact(messages: inout [Message]) {
+  let toolResultLocations = findToolResultLocations(in: messages)
+  guard toolResultLocations.count > Self.keepRecent else {
+    return
+  }
+
+  let toolNameMap = buildToolNameMap(from: messages)
+  let oldResults = toolResultLocations.dropLast(Self.keepRecent)
+  var modifiedContents: [Int: [ContentBlock]] = [:]
+
+  for (msgIdx, contentIdx) in oldResults {
+    guard
+      case .toolResult(let toolUseId, let content, let isError) = messages[msgIdx].content[contentIdx],
+      content.count > Self.minContentLength
+    else {
+      continue
+    }
+
+    let toolName = toolNameMap[toolUseId] ?? "unknown"
+    let replacement = ContentBlock.toolResult(
+      toolUseId: toolUseId,
+      content: "[Previous: used \(toolName)]",
+      isError: isError
+    )
+
+    if modifiedContents[msgIdx] == nil {
+      modifiedContents[msgIdx] = messages[msgIdx].content
+    }
+    modifiedContents[msgIdx]![contentIdx] = replacement
+  }
+
+  for (msgIdx, newContent) in modifiedContents {
+    messages[msgIdx] = Message(role: messages[msgIdx].role, content: newContent)
+  }
+}
+```
+
+The method is intentionally synchronous — it's pure data transformation with no reason to await anything. Two private helpers do the scanning: `findToolResultLocations` collects every `toolResult` position in the array, and `buildToolNameMap` walks assistant messages to map each `toolUseId` back to its tool name — bridging a gap in the API's data model where `toolResult` blocks carry an ID but no name.
+
+---
+
+### Auto-compact: threshold-triggered summarization
+
+Layer 2 needs to answer a question before it can act: how many tokens are we using? The API doesn't tell us the context size mid-conversation, so we estimate:
+
+```swift
+public func estimateTokens(from messages: [Message]) -> Int {
+  let data = (try? JSONEncoder().encode(messages)) ?? Data()
+  return data.count / 4
+}
+```
+
+The divide-by-four heuristic is rough, but it's close enough for a threshold check — and JSON encoding closely matches the actual API payload size, which is what we care about.
+
+When the estimate crosses the threshold, `autoCompact` takes over. It saves the full transcript to disk first — nothing is truly lost — then asks the LLM to summarize:
+
+```swift
+public func autoCompact(
+  messages: [Message],
+  using apiClient: APIClientProtocol,
+  model: String,
+  focus: String?
+) async -> [Message] {
+  do {
+    let path = try saveTranscript(messages)
+
+    let encoder = JSONEncoder()
+    let data = (try? encoder.encode(messages)) ?? Data()
+
+    var transcript = String(data: data, encoding: .utf8) ?? "[]"
+    if transcript.count > Self.maxSummaryInputLength {
+      transcript = String(transcript.prefix(Self.maxSummaryInputLength)) + "\n[truncated]"
+    }
+
+    var prompt = ""
+    if let focus, !focus.isEmpty {
+      prompt += "Focus on: \(focus). "
+    }
+    prompt += """
+      Summarize this conversation for continuity. Include: \
+      1) What was accomplished, 2) Current state, 3) Key decisions made. \
+      Be concise but preserve critical details.
+
+      \(transcript)
+      """
+
+    let request = APIRequest(
+      model: model,
+      maxTokens: 2000,
+      messages: [.user(prompt)]
+    )
+    let response = try await apiClient.createMessage(request: request)
+    let summary = response.content.textContent
+
+    return [
+      .user("[Conversation compressed. Transcript: \(path)]\n\n\(summary)"),
+      .assistant("Understood. I have the context from the summary. Continuing.")
+    ]
+  } catch {
+    print("[warning] Auto-compact failed: \(error). Keeping original messages.")
+    return messages
+  }
+}
+```
+
+The `do/catch` wrapping the entire method body is a deliberate safety net — compaction failure should never crash the agent loop. If the API call fails or the transcript can't be written, the method prints a warning and returns the original messages unchanged. The agent continues with a full context rather than no context.
+
+The `saveTranscript` method writes each message as a single JSON line to a `.transcripts/` directory. One early version used a bare Unix timestamp for the filename, which created collisions when two compactions happened in the same second. The fix appends a UUID prefix:
+
+```swift
+let timestamp = Int(Date().timeIntervalSince1970)
+let unique = UUID().uuidString.prefix(8)
+let path = "\(transcriptDirectory)/transcript_\(timestamp)_\(unique).jsonl"
+```
+
+---
+
+### The compact tool and two-phase dispatch
+
+Layer 3 gives the model direct control over compression. The `compact` tool definition includes an optional `focus` parameter that lets the model specify what the summary should preserve:
+
+```swift
+ToolDefinition(
+  name: "compact",
+  description: "Compress conversation history to free context space. Use when working on long tasks.",
+  inputSchema: .object([
+    "type": "object",
+    "properties": .object([
+      "focus": .object([
+        "type": "string",
+        "description": "What to preserve in the summary (e.g., 'file paths edited', 'current task progress')"
+      ])
+    ]),
+    "required": .array([])
+  ])
+)
+```
+
+The handler, though, is surprising — it doesn't actually compact anything:
+
+```swift
+private func executeCompact(_ input: JSONValue) async -> Result<String, ToolError> {
+  .success("Compressing...")
+}
+```
+
+This is the two-phase dispatch pattern. The `compact` tool can't perform the actual compaction because tool handlers return `Result<String, ToolError>` — they don't have access to the messages array. The real work needs to happen in the loop, where `messages` is a local `var`. So the handler returns a marker string, and `processToolUses` captures the focus parameter as a signal:
+
+```swift
+struct ToolProcessingResult {
+  let results: [ContentBlock]
+  let didUseTodo: Bool
+  let compactFocus: String?
+}
+```
+
+The `compactFocus` field is `nil` when compact wasn't called, and holds the focus value (or an empty string for no focus) when it was. This replaces the growing tuple that `processToolUses` previously returned — a named struct with a clear `nil`-vs-present semantic is easier to reason about than a third tuple element.
+
+Inside `processToolUses`, the compact detection is a simple check alongside the existing `didUseTodo` tracking:
+
+```swift
+if name == "compact" {
+  compactFocus = input["focus"]?.stringValue ?? ""
+}
+```
+
+---
+
+### Wiring into the agent loop
+
+With all three layers built, let's connect them. The `applyCompaction` helper runs layers 1 and 2 in sequence:
+
+```swift
+private func applyCompaction(_ messages: [Message]) async -> [Message] {
+  var compacted = messages
+  contextCompactor.microCompact(messages: &compacted)
+
+  if contextCompactor.estimateTokens(from: compacted) > contextCompactor.tokenThreshold {
+    print("[auto_compact triggered]")
+    return await contextCompactor.autoCompact(
+      messages: compacted, using: apiClient, model: model, focus: nil
+    )
+  }
+
+  return compacted
+}
+```
+
+Micro-compact runs first (every turn), then the threshold check determines whether auto-compact fires. The method takes messages by value and returns a new array — the same pure-value pattern we've used since extracting `agentLoop` for subagents.
+
+In the loop itself, `applyCompaction` runs before each API call, and manual compaction runs after tool results are appended:
+
+```swift
+while true {
+  try Task.checkCancellation()
+
+  iteration += 1
+  if iteration > config.maxIterations {
+    return (lastAssistantText + "\n(\(config.label) reached iteration limit)", messages)
+  }
+
+  messages = await applyCompaction(messages)
+
+  let request = APIRequest(
+    model: model, maxTokens: Limits.defaultMaxTokens,
+    system: systemPrompt, messages: messages, tools: config.tools
+  )
+
+  let response = try await apiClient.createMessage(request: request)
+  messages.append(Message(role: .assistant, content: response.content))
+  // ... print, check stop reason, process tools ...
+
+  messages.append(Message(role: .user, content: toolResults))
+
+  if let compactFocus = toolProcessing.compactFocus {
+    print("[manual compact]")
+    messages = await contextCompactor.autoCompact(
+      messages: messages, using: apiClient, model: model, focus: compactFocus
+    )
+  }
+}
+```
+
+The placement matters. Micro-compact and auto-compact run _before_ the API call, so the request always goes out with a trimmed context. Manual compact runs _after_ tool results are appended, so the summary includes the compact tool call itself — the model's explicit decision to compress is preserved in the transcript.
+
+The `compact` tool is excluded from `LoopConfig.subagent` alongside `agent` and `todo` — a subagent shouldn't be able to compress the parent's history. But micro-compact and auto-compact _do_ run in subagent loops, since subagents share the same `agentLoop` code path. A subagent making heavy `read_file` calls across its 30-iteration limit can benefit from the quiet cleanup.
+
+With that in place, we now have an agent that manages its own memory. Three layers of compression, one new type, and two injection points in the loop — before the API call and after tool processing.
+
+---
+
+### Taking it for a spin
+
+Let's build and run:
+
+```bash
+swift build && swift run claude
+```
+
+Try: `Read every Swift file in the Sources/ directory one by one.` Watch the terminal — after the first few files, earlier tool results in the context will start appearing as `"[Previous: used read_file]"` in subsequent API requests. That's micro-compact doing its work silently.
+
+For a more dramatic demonstration, keep reading files or ask the agent to explore a large codebase. When the estimated token count crosses 50,000, auto-compact triggers: the agent saves a full transcript to `.transcripts/`, asks the LLM for a summary, and continues with a fresh two-message context. Check the `.transcripts/` directory afterward — the full conversation history is preserved as JSONL.
+
+To see layer 3 in action, try: `Use the compact tool to compress this conversation, focusing on what files we've read.` The model calls `compact` with a focus parameter, the loop triggers summarization, and the conversation continues with a targeted summary.
+
+---
+
+### What we've built and where it breaks
+
+We now have an agent that can work indefinitely. Micro-compact quietly trims old tool results every turn. Auto-compact summarizes the full conversation when the context gets large. The `compact` tool gives the model deliberate control. Transcripts on disk mean nothing is truly lost — just moved out of active context.
+
+The limitation is that compression is lossy. When auto-compact fires, the model loses access to the exact content of files it read, the precise error messages it encountered, the specific commands it ran. The summary preserves the _gist_ — what was accomplished, the current state, key decisions — but not the details. For a long-running task with dozens of steps, the model might forget exactly which files it edited or which approach it tried and abandoned. The loop is still the invariant; tools are still the variable. But now one of those tools can reshape the loop's own working memory — the first time in our series that the agent isn't just acting on the world, but acting on itself. In the next guide, we'll address the lossy-compression problem directly: a file-based task system that gives the agent durable state that survives compaction. Thanks for reading!