feat: add plan mode for task execution — allows users to generate and review plans before execution, includes caching improvements for Anthropic prompts

VladoIvankovic · VladoIvankovic · commit 6f1c15f8092e · 2026-05-19T11:28:32.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,66 @@ For releases before v1.3.35, see [GitHub Releases](https://github.com/VladoIvank
 > as the social-share summary (IFTTT → X/Bluesky), capped at 220 chars.
 > If omitted, the feed falls back to the first paragraph.
 
+## [2.0.2] — 2026-05-19
+
+> Two big quality-of-life additions: Anthropic prompt caching is on by default (60–90% cheaper on cache-eligible input), and `/plan` lets you preview an agent's full plan before any file gets touched. Run `/go` to execute, or `/plan <revised task>` to refine.
+
+### Added — Anthropic prompt caching, automatic
+
+- **Two cache breakpoints per request**: the system prompt (and embedded
+  skills catalog / project intelligence) and the tools array. Cache hits
+  bill at 0.1× the input rate; cache writes at 1.25×. Net win after the
+  second same-shape request, which is every iteration in an agent loop.
+  Below 1024 input tokens Anthropic silently skips caching — no error
+  path. Applies to the agent chat path, the agent fallback path, and
+  the chat() path used by `/agent` and inline replies. Also propagates
+  through OpenRouter → Anthropic routes (caching headers honoured
+  upstream).
+- **`TokenUsage.cacheCreationTokens` + `cacheReadTokens`** fields
+  surfaced on every record. `getCacheStats()` aggregates per-session
+  cache hits, misses, and estimated USD savings vs running without
+  caching. `/cost` (and `/stats`) renders a new "Prompt caching"
+  section when at least one cached call landed.
+
+### Added — Plan mode (`/plan` + `/go`)
+
+- **`/plan <task>`** — generates a numbered plan for the task (no tool
+  calls, no file changes), surfaces it as a Markdown message so you can
+  review what the agent would do, which files it would touch, what
+  commands it would run, and the risk level it self-assesses. Holds
+  the (task, plan) pair as the *pending* plan, scoped to the current
+  process. Re-running `/plan <revised task>` replaces the pending plan
+  with a new one (you pay one extra LLM call but get readable revision
+  history in the chat).
+- **`/go`** — executes the pending plan: hands the task + approved plan
+  as a single prompt to the regular agent loop, so all MCP tools,
+  lifecycle hooks, verification, permissions, and skill bundles apply
+  unchanged. Includes an explicit anti-improvisation clause in the
+  injected prompt — if any step turns out to be wrong mid-execution
+  the agent must stop and report rather than silently rewriting the
+  plan.
+- Available in **both the TUI and ACP clients** (Zed, VS Code). ACP
+  `/plan` streams the plan back via `session/update`; ACP `/go` runs
+  the agent inline and streams iterations through onChunk.
+- Surfaced in `/help` ("Agent Mode" section) and `/` autocomplete.
+
+### Fixed
+
+- **Anthropic streaming usage extraction missed cache fields.** Both
+  the agent stream handler (`utils/agentStream.ts`) and the chat
+  stream handler (`api/index.ts`) now pick up
+  `cache_creation_input_tokens` and `cache_read_input_tokens` from the
+  `message_start` event, so cached requests no longer undercount
+  prompt tokens or display $0 savings.
+
+### Notes
+
+- OpenAI-format providers (OpenAI direct, Z.AI, DeepSeek, MiniMax,
+  Ollama) don't expose explicit cache markers — those providers
+  generally apply automatic prefix caching server-side. No code change
+  on our end needed; cost reports stay accurate via standard
+  `prompt_tokens` accounting.
+
 ## [2.0.1] — 2026-05-18
 
 > Patch: `/mcp` now works in the CLI TUI (was only wired into the ACP path
diff --git a/src/acp/commands.ts b/src/acp/commands.ts
@@ -680,6 +680,63 @@ Anything else the agent should know — edge cases, gotchas, things to double-ch
 
     // ─── Export ────────────────────────────────────────────────────────────────
 
+    // ─── Plan mode (2.0.2) ────────────────────────────────────────────────────
+
+    case 'plan': {
+      // Identical contract to TUI /plan: generate a pre-execution plan,
+      // surface it, hold as pending so /go can execute it without re-planning.
+      if (!args.length) {
+        const { getPendingPlan } = await import('../utils/planMode.js');
+        const cur = getPendingPlan();
+        return {
+          handled: true,
+          response: cur
+            ? `**Pending plan for:** _${cur.task}_\n\n${cur.plan}\n\n---\nRun \`/go\` to execute, or \`/plan <revised task>\` to revise.`
+            : 'Usage: `/plan <task>` — generates a plan you can review, then `/go` to execute.',
+        };
+      }
+      const task = args.join(' ');
+      onChunk(`_Generating plan for: ${task.slice(0, 80)}${task.length > 80 ? '…' : ''}_\n\n`);
+      try {
+        const { generatePlan } = await import('../utils/planMode.js');
+        const plan = await generatePlan(task);
+        return {
+          handled: true,
+          response: `${plan}\n\n---\nRun \`/go\` to execute this plan, or \`/plan <revised task>\` to refine it.`,
+          streaming: true,
+        };
+      } catch (err) {
+        return { handled: true, response: `Plan generation failed: ${(err as Error).message}`, streaming: true };
+      }
+    }
+
+    case 'go': {
+      const { getPendingPlan, composeExecutionPrompt, clearPendingPlan } = await import('../utils/planMode.js');
+      const cur = getPendingPlan();
+      if (!cur) {
+        return { handled: true, response: 'No pending plan. Run `/plan <task>` first.' };
+      }
+      const prompt = composeExecutionPrompt(cur);
+      clearPendingPlan();
+      onChunk(`_Executing approved plan…_\n\n`);
+      try {
+        const { buildProjectContext } = await import('./session.js');
+        const ctx = buildProjectContext(session.workspaceRoot);
+        const agentResult = await runAgent(prompt, ctx, {
+          abortSignal,
+          onIteration: (_i: number, msg: string) => { onChunk(msg + '\n'); },
+          onThinking: (text: string) => { onChunk(text); },
+        });
+        return {
+          handled: true,
+          response: agentResult.finalResponse || '_(plan executed; no final summary)_',
+          streaming: true,
+        };
+      } catch (err) {
+        return { handled: true, response: `Plan execution failed: ${(err as Error).message}`, streaming: true };
+      }
+    }
+
     case 'export': {
       if (!session.history.length) return { handled: true, response: 'No messages to export.' };
       const format = (args[0] || 'md').toLowerCase();
diff --git a/src/acp/server.ts b/src/acp/server.ts
@@ -74,6 +74,9 @@ const AVAILABLE_COMMANDS = [
   { name: 'mcp',       description: 'Manage MCP servers, marketplace, resources, prompts', input: { hint: '[browse | install <id> | add | remove | reload | resources | read <uri> | prompts | prompt <server> <name>]' } },
   { name: 'openrouter', description: 'OpenRouter routing preferences (prefer/ignore/fallbacks/privacy/clear)', input: { hint: '[show | prefer <p,...> | ignore <p,...> | fallbacks on|off | privacy strict|allow | clear]' } },
   { name: 'export',    description: 'Export conversation', input: { hint: 'json | md | txt' } },
+  // Plan mode (2.0.2)
+  { name: 'plan',      description: 'Generate a numbered plan for a task — review before /go executes', input: { hint: '<task>' } },
+  { name: 'go',        description: 'Execute the pending plan from /plan' },
   // Project intelligence
   { name: 'scan',      description: 'Scan project structure and generate summary' },
   { name: 'review',    description: 'Run code review on project or specific files', input: { hint: '[file…]' } },
diff --git a/src/api/index.ts b/src/api/index.ts
@@ -669,6 +669,13 @@ async function chatAnthropic(
   }
 
   try {
+    // Anthropic prompt caching: wrap system as an array with a
+    // `cache_control` marker so the static system prompt (typically large
+    // and stable across a session) is cached. Below 1024 input tokens
+    // Anthropic silently skips caching — no error.
+    const cachedSystem = useNativeSystem
+      ? { system: [{ type: 'text' as const, text: systemPrompt, cache_control: { type: 'ephemeral' as const } }] }
+      : {};
     const response = await fetch(`${baseUrl}/v1/messages`, {
       method: 'POST',
       headers,
@@ -678,7 +685,7 @@ async function chatAnthropic(
         max_tokens: maxTokens,
         temperature,
         stream,
-        ...(useNativeSystem ? { system: systemPrompt } : {}),
+        ...cachedSystem,
       }),
       signal: controller.signal,
     });
@@ -727,6 +734,8 @@ async function handleAnthropicStream(
   let buffer = '';
   let inputTokens = 0;
   let outputTokens = 0;
+  let cacheCreationTokens = 0;
+  let cacheReadTokens = 0;
   let streamModel = '';
 
   while (true) {
@@ -740,7 +749,7 @@ async function handleAnthropicStream(
     for (const line of lines) {
       if (line.startsWith('data: ')) {
         const data = line.slice(6);
-        
+
         try {
           const parsed = JSON.parse(data);
           if (parsed.type === 'content_block_delta') {
@@ -750,9 +759,13 @@ async function handleAnthropicStream(
               onChunk(text);
             }
           }
-          // message_start contains input_tokens
+          // message_start contains input_tokens (and cache create/read
+          // when prompt caching is in play).
           if (parsed.type === 'message_start' && parsed.message?.usage) {
-            inputTokens = parsed.message.usage.input_tokens || 0;
+            const u = parsed.message.usage;
+            inputTokens = u.input_tokens || 0;
+            cacheCreationTokens = u.cache_creation_input_tokens || 0;
+            cacheReadTokens = u.cache_read_input_tokens || 0;
             streamModel = parsed.message.model || '';
           }
           // message_delta contains output_tokens
@@ -767,9 +780,16 @@ async function handleAnthropicStream(
   }
 
   // Record token usage
-  if (inputTokens > 0 || outputTokens > 0) {
+  if (inputTokens > 0 || outputTokens > 0 || cacheReadTokens > 0 || cacheCreationTokens > 0) {
+    const totalPrompt = inputTokens + cacheCreationTokens + cacheReadTokens;
     recordTokenUsage(
-      { promptTokens: inputTokens, completionTokens: outputTokens, totalTokens: inputTokens + outputTokens },
+      {
+        promptTokens: totalPrompt,
+        completionTokens: outputTokens,
+        totalTokens: totalPrompt + outputTokens,
+        cacheCreationTokens: cacheCreationTokens || undefined,
+        cacheReadTokens: cacheReadTokens || undefined,
+      },
       streamModel || 'unknown',
       config.get('provider')
     );
diff --git a/src/renderer/App.ts b/src/renderer/App.ts
@@ -105,6 +105,8 @@ const COMMAND_DESCRIPTIONS: Record<string, string> = {
   'hooks': 'List installed lifecycle hooks (.codeep/hooks/<event>.sh)',
   'mcp': 'Manage MCP servers (browse, install, add, remove, resources, prompts)',
   'openrouter': 'Tune OpenRouter routing (preferred / ignore providers, fallbacks, privacy)',
+  'plan': 'Generate a numbered plan for a task — review before /go executes it',
+  'go': 'Execute the pending plan from /plan',
 };
 
 import { helpCategories, keyboardShortcuts } from './components/Help';
@@ -297,6 +299,8 @@ export class App {
     // Keep in lockstep with COMMAND_DESCRIPTIONS below and helpCategories.
     'compact', 'commands', 'checkpoint', 'checkpoints', 'rewind',
     'hooks', 'mcp', 'openrouter',
+    // 2.0.2 — plan mode.
+    'plan', 'go',
     'c', 't', 'd', 'r', 'f', 'e', 'o', 'b', 'p',
   ];
   
diff --git a/src/renderer/commands.ts b/src/renderer/commands.ts
@@ -248,6 +248,59 @@ export async function handleCommand(
       break;
     }
 
+    case 'plan': {
+      // Plan mode: ask the model for a plan, surface it, hold as pending.
+      // The user runs /go to execute or /plan <revised> to revise. See
+      // src/utils/planMode.ts for the rationale + system prompt.
+      if (!args.length) {
+        const { getPendingPlan } = await import('../utils/planMode');
+        const cur = getPendingPlan();
+        if (cur) {
+          ctx.app.addMessage({
+            role: 'system',
+            content: `**Pending plan for:** _${cur.task}_\n\n${cur.plan}\n\n---\nRun \`/go\` to execute, or \`/plan <revised task>\` to revise.`,
+          });
+        } else {
+          ctx.app.notify('Usage: /plan <task> — generates a plan you can review, then /go to execute.');
+        }
+        return;
+      }
+      if (ctx.isAgentRunning()) { ctx.app.notify('Agent already running. Use /stop first.'); return; }
+      const task = args.join(' ');
+      ctx.app.addMessage({ role: 'user', content: `/plan ${task}` });
+      ctx.app.notify('Generating plan…');
+      try {
+        const { generatePlan } = await import('../utils/planMode');
+        const plan = await generatePlan(task);
+        ctx.app.addMessage({
+          role: 'assistant',
+          content: `${plan}\n\n---\nRun \`/go\` to execute this plan, or \`/plan <revised task>\` to refine it.`,
+        });
+      } catch (err) {
+        ctx.app.notify(`Plan generation failed: ${(err as Error).message}`);
+      }
+      break;
+    }
+
+    case 'go': {
+      // Execute the pending plan from /plan. The agent loop sees the
+      // task + plan as a single prompt, so MCP tools, hooks, permissions,
+      // and verification all apply unchanged.
+      const { getPendingPlan, composeExecutionPrompt, clearPendingPlan } = await import('../utils/planMode');
+      const cur = getPendingPlan();
+      if (!cur) {
+        ctx.app.notify('No pending plan. Run `/plan <task>` first.');
+        return;
+      }
+      if (ctx.isAgentRunning()) { ctx.app.notify('Agent already running. Use /stop first.'); return; }
+      const prompt = composeExecutionPrompt(cur);
+      clearPendingPlan();
+      ctx.app.notify(`Executing plan for: ${cur.task.slice(0, 80)}${cur.task.length > 80 ? '…' : ''}`);
+      const { runAgentTask } = await import('./agentExecution');
+      runAgentTask(prompt, false, ctx, () => null, () => {});
+      break;
+    }
+
     case 'stop': {
       if (ctx.isAgentRunning() && ctx.abortController) {
         ctx.abortController.abort();
diff --git a/src/renderer/components/Help.ts b/src/renderer/components/Help.ts
@@ -56,6 +56,8 @@ export const helpCategories: HelpCategory[] = [
     items: [
       { key: '/agent <task>', description: 'Run agent with task' },
       { key: '/agent-dry <task>', description: 'Dry run (no changes)' },
+      { key: '/plan <task>', description: 'Generate a plan first — review before /go executes' },
+      { key: '/go', description: 'Execute the pending plan from /plan' },
       { key: '/stop', description: 'Stop running agent' },
       { key: '/undo', description: 'Undo last agent action' },
       { key: '/undo-all', description: 'Undo all agent actions' },
diff --git a/src/utils/agentChat.ts b/src/utils/agentChat.ts
@@ -338,9 +338,25 @@ export async function agentChat(
       };
     } else {
       endpoint = `${baseUrl}/v1/messages`;
+      // Anthropic prompt caching. Two cache breakpoints:
+      //   1. `system` (largest stable block — system prompt + skills catalog)
+      //   2. last tool in `tools` (Anthropic caches everything up to and
+      //      including the marker, so this caches the entire tools array)
+      // Cache hits cost 0.1× input. Misses ("cache creation") cost 1.25×.
+      // Net win after the 2nd same-shape request. Below 1024 input tokens
+      // Anthropic silently skips caching — no error path to handle.
+      const anthropicTools = getAnthropicTools(additionalTools);
+      const cachedTools = anthropicTools.length > 0
+        ? [
+            ...anthropicTools.slice(0, -1),
+            { ...anthropicTools[anthropicTools.length - 1], cache_control: { type: 'ephemeral' as const } },
+          ]
+        : anthropicTools;
       body = {
-        model, system: systemPrompt, messages,
-        tools: getAnthropicTools(additionalTools), stream: useStreaming,
+        model,
+        system: [{ type: 'text', text: systemPrompt, cache_control: { type: 'ephemeral' as const } }],
+        messages,
+        tools: cachedTools, stream: useStreaming,
         ...tempParam, max_tokens: getEffectiveMaxTokens(providerId, Math.max(config.get('maxTokens'), 16384)),
       };
     }
@@ -477,10 +493,15 @@ export async function agentChatFallback(
       };
     } else {
       endpoint = `${baseUrl}/v1/messages`;
+      // Fallback path injects system+tools as the first user message
+      // (no native tool API). Cache that block — it's large and stable.
       body = {
         model,
         messages: [
-          { role: 'user', content: fallbackPrompt },
+          {
+            role: 'user',
+            content: [{ type: 'text', text: fallbackPrompt, cache_control: { type: 'ephemeral' as const } }],
+          },
           { role: 'assistant', content: 'Understood. I will use the tools as specified.' },
           ...messages,
         ],
diff --git a/src/utils/agentStream.ts b/src/utils/agentStream.ts
@@ -210,12 +210,29 @@ export async function handleAnthropicAgentStream(
       try {
         const parsed = JSON.parse(data);
 
-        // message_start has input tokens; message_delta has output tokens — merge both
+        // message_start has input tokens (incl. cache create/read fields if
+        // prompt caching is in use); message_delta has output tokens —
+        // merge both so extractAnthropicUsage sees the full picture.
         if (parsed.type === 'message_start' && parsed.message?.usage) {
-          usageData = { usage: { input_tokens: parsed.message.usage.input_tokens || 0, output_tokens: 0 } };
+          const u = parsed.message.usage;
+          usageData = {
+            usage: {
+              input_tokens: u.input_tokens || 0,
+              output_tokens: 0,
+              cache_creation_input_tokens: u.cache_creation_input_tokens || 0,
+              cache_read_input_tokens: u.cache_read_input_tokens || 0,
+            },
+          };
         } else if (parsed.type === 'message_delta' && parsed.usage) {
-          const inputTokens: number = (usageData as any)?.usage?.input_tokens || 0;
-          usageData = { usage: { input_tokens: inputTokens, output_tokens: parsed.usage.output_tokens || 0 } };
+          const prev: Record<string, number> = (usageData as { usage?: Record<string, number> } | null)?.usage ?? {};
+          usageData = {
+            usage: {
+              input_tokens: prev.input_tokens || 0,
+              output_tokens: parsed.usage.output_tokens || 0,
+              cache_creation_input_tokens: prev.cache_creation_input_tokens || 0,
+              cache_read_input_tokens: prev.cache_read_input_tokens || 0,
+            },
+          };
         }
 
         if (parsed.type === 'content_block_start') {
diff --git a/src/utils/planMode.test.ts b/src/utils/planMode.test.ts
diff --git a/src/utils/planMode.ts b/src/utils/planMode.ts
diff --git a/src/utils/tokenTracker.test.ts b/src/utils/tokenTracker.test.ts
diff --git a/src/utils/tokenTracker.ts b/src/utils/tokenTracker.ts