Anthropic prompt caching: pi vs ECA comparison & gaps #396

ovistoica · 2026-04-07T07:40:06Z

ovistoica
Apr 7, 2026

How Anthropic prompt caching works

Before comparing implementations, these mechanics apply to both:

Exact prefix matching, KV-cache based. Each block's hash encodes all preceding blocks. A
cache hit at block N guarantees blocks 0–N-1 are byte-for-byte identical. Even one changed token
is a miss.
Processing order is a strict hierarchy: tools → system → messages. Changing tools busts
the system and message caches. Changing the system prompt busts the message cache. Each layer
only caches if the layers above it are stable.
Up to 4 explicit cache_control breakpoints per request. Each is independently
invalidatable. Breakpoints are free — you only pay for writes (1.25× base input price for 5-min
TTL, 2× for 1-hour TTL) and reads (0.1× base). One cache read breaks even on a 5-min write.
The critical 20-block lookback window. When a request has a cache_control breakpoint,
Anthropic automatically walks backward up to ~20 content blocks looking for any prior write it
can hit. Content more than 20 blocks before your breakpoint is invisible to cache lookup
unless you add an additional explicit breakpoint before it.
Cache TTL resets on every hit. The 5-min clock resets each time the cache is read. A
1-hour TTL ("ttl": "1h") is available at higher write cost.
Tool-call pairs each consume 2 blocks. Each tool_call (assistant turn) +
tool_call_output (user turn) pair = 2 content blocks toward the 20-block window. 10 tool
calls = 20 blocks consumed, pushing earlier history out of the automatic lookback range.

Caching strategy comparison

System prompt

	pi	ECA
Structure	Single system string, `cache_control` on the whole block	Two blocks: `static` (rarely changes) + `dynamic` (per-session context); `cache_control` on each
Stability	Varies — anything dynamic mixed in breaks the cache	Static block is fully stable across all turns and sessions; dynamic block changes per-session but is isolated
Assessment	✅ Works but doesn't separate stable from dynamic content	✅ Better — static block accumulates hits across all turns; dynamic block is independently cached

Tools

	pi	ECA
Placement	`cache_control` on last tool definition, caching the full tools array	`cache_control` on last tool definition via `add-cache-to-last-tool`, same approach
Stability	Tools are fixed per session	Tools are fixed per session
Assessment	✅ Correct	✅ Correct — identical strategy

Message history (normal turn, no compaction)

	pi	ECA
Breakpoints	1 breakpoint: `cache_control` on the last content block of the last user message	1 breakpoint: `cache_control` on the last content block of the last user message via `add-cache-to-last-message`
How hits accumulate	Turn N writes at block N. Turn N+1 writes at block N+1, walks back, hits block N. Grows incrementally	Identical — same rolling strategy
Assessment	✅ Correct	✅ Correct — identical strategy

Long conversations (>20 message blocks)

	pi	ECA
Approach	Compacts aggressively to keep the in-context window small; the kept window (~20k tokens) fits naturally within the 20-block lookback	No mid-history breakpoint. Single breakpoint at the last message. Messages more than 20 blocks from the end are outside the lookback window and get no cache coverage
Assessment	✅ Window management prevents lookback overflow	❌ Gap — long agentic sessions with many tool calls silently lose cache coverage for earlier turns

Compaction

	pi	ECA
What the LLM receives after compact	`[compaction_summary_user_msg]` + `[kept_msgs: firstKeptEntryId … compaction]` + `[post-compact turns]`	`[compact_marker (invisible)]` + `[summary_user_msg]` + `[post-compact turns]`
Kept window	~20k tokens of real messages preserved after the summary via `firstKeptEntryId`; these overlap with the prior session's cached prefix	No kept window — everything before the `compact_marker` is stripped from the LLM's view
Post-compact first turn	Summary is new (cache miss), but the kept window overlaps with prior cached turns → partial hit on kept portion	Summary is new (cache miss), no prior overlap → full cache miss on the message layer
Compact boundary safety	Explicit `isSplitTurn` detection prevents the cut from landing inside a `user → assistant → tool_result` turn	No boundary enforcement — cut always falls at the compact_marker which is appended after a complete turn, so in practice safe but not explicitly checked
Assessment	✅ Best-effort cache continuity through compaction	❌ Gap — compaction breaks the message-layer cache entirely; recovery only starts after the new summary prefix is written and re-read on the next turn

Cache TTL

	pi	ECA
Default	5-min (`short`)	5-min
Extended	`PI_CACHE_RETENTION=long` → 1-hour TTL on `api.anthropic.com`	Not available
Strategy	Single TTL for all breakpoints	Single TTL for all breakpoints
Assessment	✅ Configurable per-deployment	❌ Gap — no way to extend TTL; sessions with pauses >5 min pay full cache re-write cost on the next turn

Resume / fork from before a compaction point

	pi	ECA
Hard fork (new session file)	Copies path root → fork point into a new JSONL; no compaction entry on this path → LLM receives full raw history to that point	`/fork` copies full raw `:messages`; `messages-after-last-compact-marker` still applies → if a compact marker exists in the fork, the LLM still sees only summary + post-compact messages
In-place navigation to pre-compact node	`navigateTree` moves the leaf pointer; `buildSessionContext` walks the new path with no compaction entry → LLM receives clean raw history to that point; Anthropic may have this prefix cached from the original session	Not available — ECA has no in-place tree navigation
Cache hit on resume	Prior cached prefix (from original session before compact) may still be warm → hit on first resumed turn	Pre-compact prefix is a different message sequence from post-compact → always a miss
Assessment	✅ Tree model enables cache-friendly navigation	❌ Gap — linear message model means resuming from before a compaction always causes a full cache miss

Summary of ECA gaps

Gap 1 — Long conversations exceed the 20-block lookback (quick win)

An agentic session with 10+ tool calls generates 20+ blocks from tool pairs alone. With a single
breakpoint at the last user message, earlier turns fall outside the lookback window and are never
cached.

Example: 30-block conversation, breakpoint on block 30. Next turn: breakpoint moves to block
31, lookback reaches block 11 — blocks 1–10 are invisible and never cached.

Fix: Place a second cache_control breakpoint ~20 blocks before the final one. This extends
cache coverage across the full conversation with minimal code change (one additional marker
placement in add-cache-to-last-message or its caller).

Gap 2 — Compaction breaks the message-layer cache entirely

After /compact, the message prefix sent to the LLM starts with a freshly generated summary that
Anthropic has never seen. This is a guaranteed cache miss. ECA then builds a new cached prefix
from scratch over subsequent turns.

Pi avoids this by keeping a trailing window of real messages (firstKeptEntryId) after the
summary. These messages overlap with the prior session's cached prefix, so Anthropic can hit the
kept portion on the first post-compact turn.

Fix: Preserve the last N tokens (e.g. 8–16k) of real messages after the summary in
compact-side-effect!. Update messages-after-last-compact-marker to return
[summary][kept_msgs...][new_turns...]. The kept window is stable across several subsequent
turns and accumulates cache hits naturally. The tombstone marker introduced in commit 69011cf
already keeps full display history available, so this is purely a change to what gets passed to
the LLM.

Gap 3 — No 1-hour TTL option

Every coding session involves pauses: reading output, thinking, context-switching. With a 5-min
TTL, any pause longer than 5 minutes expires the cache and the next turn pays full re-write cost.

Fix: Add a config option (e.g. anthropic.cacheRetention: "1h") that switches the
cache_control TTL to {"type": "ephemeral", "ttl": "1h"}. The optimal strategy would apply
1-hour TTL to the stable layers (tools, static system) — which are re-used across many turns and
sessions — and keep 5-min on the rolling message layer, which changes every turn anyway.

Gap 4 — Compact boundary must not split a tool-call pair (prerequisite for Gap 2)

When implementing the kept window (Gap 2), the start of the kept window must land on a clean
message boundary — a user message that is not a tool_call_output. Starting inside a
tool_call / tool_call_output pair would produce an invalid Anthropic request (mismatched
tool_use_id).

Fix: When selecting the first kept message in compact, walk forward until the boundary is a
non-tool-output user message. This mirrors pi's isSplitTurn detection.

Priority

Gap	Impact	Effort
1 — Mid-history breakpoint	High: every agentic session with >10 tool calls	Low: one extra `cache_control` placement
2 — Compact kept window	High: every compacted session	Medium: change `compact-side-effect!` + `messages-after-last-compact-marker`
3 — 1-hour TTL config	Medium: any session with pauses >5 min	Low: config flag + conditional in `anthropic.clj`
4 — Clean compact boundary	Prerequisite for Gap 2 only	Low: boundary check in compact logic

References

Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
ECA commit 7754a05: static/dynamic system prompt split
ECA commit 69011cf: tombstone compact markers
pi packages/ai/src/providers/anthropic.ts: convertMessages, getCacheControl
pi packages/coding-agent/src/core/session-manager.ts: buildSessionContext, firstKeptEntryId
pi packages/coding-agent/src/core/compaction/compaction.ts: prepareCompaction, findCutPoint

ericdallo · 2026-04-07T13:15:17Z

ericdallo
Apr 7, 2026
Maintainer

Gap 1 — I don't think this is actually a problem

1 reply

ericdallo Apr 7, 2026
Maintainer

The analysis assumes that in a 30-block conversation with a breakpoint on block 30, the next turn's lookback from block 31 would only "reach" block 11, leaving blocks 1–10 uncached. But that's not how the lookback works.

The lookback doesn't determine how far back the cache coverage extends — it determines how far back the system searches for a prior write. Once it finds one, the entire prefix up to that position is a cache hit.

In a growing conversation, each API call writes a cache entry at its breakpoint position. So on the next turn, the search only needs to go back 1–2 positions to find the previous write:

Turn N: 30 blocks, breakpoint on 30 → write at 30
Turn N+1: 32 blocks, breakpoint on 32 → lookback finds write at 30 (2 positions back) → cache hit for all of blocks 1–30

This holds even with heavy tool usage. ECA's tool-call continuation loop calls the API for each round of tool results, so each call adds ~2 blocks (tool_use + tool_result) and writes a new cache entry. The lookback trivially finds the previous write every time.

The Anthropic docs confirm this explicitly: "In a growing conversation the final block works as long as each turn adds fewer than 20 blocks." Since a normal turn adds 2–4 blocks (and even a tool-heavy loop adds 2 per iteration), we're well within that limit.

Claude Code also uses exactly one message-level breakpoint, and their code has an explicit comment explaining why a second one would actually be counterproductive — it wastes KV cache resources by protecting pages at a position nothing will ever resume from.

The 20-block limit only matters if content changes more than 20 blocks before the breakpoint (e.g., editing an earlier message), or if a single turn somehow adds >20 blocks between consecutive API calls. Neither applies to normal chat or agentic flows.

Does that makes sense to you @ovistoica ?

ericdallo · 2026-04-07T13:52:39Z

ericdallo
Apr 7, 2026
Maintainer

Gap 3 — No 1-hour TTL option done!
Now there is a config "cacheRetention": "long" that you can add to your anthropic provider.

0 replies

ericdallo · 2026-04-07T23:11:35Z

ericdallo
Apr 7, 2026
Maintainer

Gap 2 — The cache argument doesn't hold up

1 reply

ericdallo Apr 7, 2026
Maintainer

The cache miss after compaction is real — the summary is new text the API has never seen, so it's a guaranteed miss. But the proposed fix (keeping a trailing window of real messages after the summary) doesn't actually help with caching.

anthropic uses exact prefix matching. Before compaction, the cache entry for position 50 stores the KV state for prefix [m1, m2, ..., m50]. After compaction with a kept window, the message array becomes [summary, m40, m41, ..., m50]. Even though m40–m50 are byte-for-byte identical, the prefix at m40's position is now [summary, m40] instead of [m1, ..., m40] — completely different hash. No prior cache entry matches at any position.

both with and without a kept window, the first post-compact turn is a full cache miss, and the second turn recovers normally (the lookback finds the write from the first post-compact turn, 1–2 positions back). Same recovery speed either way...

the real benefit of a kept window is context quality — summaries are lossy, and keeping recent real messages helps the LLM continue the current task more accurately. Pi likely keeps 10K–40K tokens of real messages for this reason, not for cache hits. That's a valid improvement, but it's a different discussion from prompt caching optimization, and the cost/benefit is less clear-cut than the discussion suggests, said that, I'm not sure that's worth it wdyt?

ericdallo · 2026-04-07T23:12:37Z

ericdallo
Apr 7, 2026
Maintainer

Gap 4 — Not a problem in ECA's current design

1 reply

ericdallo Apr 7, 2026
Maintainer

The compact_marker is always appended at the end of the messages array, after the entire compact round completes (including the compact_chat tool call and its result). The continue-fn only detects :compact-done? after all tool results are recorded, so the boundary always lands at a clean position. There's no scenario where the cut falls inside a tool_call/tool_result pair.

You even mention "cut always falls at the compact_marker which is appended after a complete turn, so in practice safe but not explicitly checked." This gap only matters if we implement Gap 2's kept window, where we'd need to pick a starting point for kept messages that could land mid-pair.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anthropic prompt caching: pi vs ECA comparison & gaps #396

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Anthropic prompt caching: pi vs ECA comparison & gaps #396

Uh oh!

ovistoica Apr 7, 2026

How Anthropic prompt caching works

Caching strategy comparison

System prompt

Tools

Message history (normal turn, no compaction)

Long conversations (>20 message blocks)

Compaction

Cache TTL

Resume / fork from before a compaction point

Summary of ECA gaps

Gap 1 — Long conversations exceed the 20-block lookback (quick win)

Gap 2 — Compaction breaks the message-layer cache entirely

Gap 3 — No 1-hour TTL option

Gap 4 — Compact boundary must not split a tool-call pair (prerequisite for Gap 2)

Priority

References

Replies: 4 comments · 3 replies

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

Uh oh!

ericdallo Apr 7, 2026 Maintainer

ovistoica
Apr 7, 2026

Replies: 4 comments 3 replies

ericdallo
Apr 7, 2026
Maintainer

ericdallo Apr 7, 2026
Maintainer

ericdallo
Apr 7, 2026
Maintainer

ericdallo
Apr 7, 2026
Maintainer

ericdallo Apr 7, 2026
Maintainer

ericdallo
Apr 7, 2026
Maintainer

ericdallo Apr 7, 2026
Maintainer