fix(opencode): make OpenRouter prompt cache 1h TTL opt-in via env#30190
fix(opencode): make OpenRouter prompt cache 1h TTL opt-in via env#30190rndmcnlly wants to merge 2 commits into
Conversation
…trol
Use the documented cache_control { type: "ephemeral", ttl: "1h" } on
message content for OpenRouter. The default 5-minute TTL causes constant
cache misses and rewrites during normal usage gaps (reviewing output,
editing files, thinking). Extending to 1 hour significantly reduces
redundant cache write costs on Anthropic models via OpenRouter.
Empirically verified: only the per-message cache_control.ttl approach
survives past 5 minutes; the undocumented top-level prompt_cache_ttl
parameter is silently ignored.
Closes anomalyco#16848
|
I'm currently using 1h cache via https://github.com/cortexkit/anthropic-auth plugin and I can confirm that it reduces my consumption by ~38% in the last 15 days. |
Replace the unconditional ttl: "1h" with an env-controlled resolver that defaults to the cheaper 5m TTL, matching Claude Code's prompt-caching semantics. 1h cache writes are billed at a higher rate (~2x base input vs ~1.25x for 5m), so it only pays off when a session resumes after >5m but within the hour; imposing it on every request taxes autonomous/walk-away runs that finish fast or idle past the hour. Knobs (Anthropic-scoped, since cache_control.ttl is an Anthropic extension): - OPENCODE_ANTHROPIC_PROMPT_CACHING_1H=1 opts into 1h - OPENCODE_ANTHROPIC_FORCE_PROMPT_CACHING_5M=1 forces 5m (overrides the above) Applied to both openrouter and openaiCompatible gateways. anthropic-native and bedrock paths are unchanged.
Revised: 1h TTL is now opt-in via env, defaulting to 5mPushed
Applied to both the Why opt-in, by analogy to Claude CodeClaude Code models this exact decision with three orthogonal env vars and, crucially, makes 1h opt-in rather than default:
Their stated reasoning (Claude Code env-vars, How Claude Code uses prompt caching):
The split is along the billing model: per-token paths default to 5m; only subscription paths (where 1h is plan-included and effectively free) auto-enable 1h. OpenRouter is a per-token path, so 5m-default-with-opt-in is the matching behavior. The cost math the previous version ignored1h cache writes are billed at a higher rate than 5m writes (Anthropic prompt caching, corroborated by Cadence and Prism):
So 1h only pays off when a session resumes after a >5m gap but within the hour (one 1h write at 2.0x beats two consecutive 5m writes at 2 x 1.25 = 2.5x). For autonomous/walk-away runs that finish fast, or idle past the hour, the unconditional 1h premium was pure waste. See AgentPatterns: Extended Prompt Cache TTL for the break-even derivation:
The original commit message's motivation (cache misses during review/edit/think gaps) is real and valid; this revision preserves it as an opt-in for exactly that workflow without taxing everyone else. How other harnesses handle it
TestsReverted the assertion that expected the hardcoded |
|
@rndmcnlly I would advise either by default or with a configuration to make 1h cache based on if the agent is a primary or subagent. For ephemeral subagents, it's better to use 5m instead of 1h. |
|
@ualtinok Thanks for both the field data and the design instinct here. The ~38% reduction over 15 days lands right next to the ~41% my cache simulation estimated in #16848 (comment) (#16848 (comment)), so two independent methods are converging on "this is a large pile of money." Subagents are a clean discriminator. Ephemeral subagents rarely resume activity after a 5m gap because there's no human sitting and thinking, so they'd mostly just eat the 2.0x write premium with no payback. I dug into the code and the signal is already present at request-prep time (agent.mode === "subagent", and equivalently a non-null parentSessionID), so it's implementable without heuristics. Where I'm still uncertain is the right long-term control surface for it: a static subagent-detection heuristic baked into defaults, the current env-var approach, a config knob, or some combination. That deserves its own discussion after we stop the bleeding rather than something to settle inline here. My proposal: land this PR as-is. It's the conservative move: opt-in, defaults to the cheaper 5m, and matches Claude Code's per-token semantics. It only benefits users paying close enough attention to spot the env var, which is exactly the population that can evaluate the tradeoff for their workflow. That stops a major source of waste for power users today without changing anyone's defaults. Then, once it's merged, I'll open a follow-up issue on extending these benefits to a less-technical audience. |
Issue for this PR
Closes #16848
(Re-submitting after #16850 was auto-closed by the stale bot.)
Type of change
What does this PR do?
OpenRouter's prompt cache defaults to a 5-minute TTL. Adds
ttl: "1h"to thecacheControlprovider option already set on OpenRouter messages, so caches survive normal pauses (review, edits, thinking) instead of getting rewritten every few minutes.The earlier attempt in #16850 used a top-level
prompt_cache_ttlparameter, which OpenRouter silently ignores. The per-messagecache_control.ttlis the only mechanism that actually extends the TTL.How did you verify your code works?
Ran a three-arm probe against
anthropic/claude-opus-4.8, with a 6-minute gap past the 5-minute default. Approach B (cache_control.ttl: "1h") was the only one that wrote to cache and got a hit on R2 (21,635 cached tokens). Approaches A (prompt_cache_ttl=3600) and C (baseline) wrote nothing.Script: https://gist.github.com/rndmcnlly/6680b63aeb26b5ea9ffbb56dab3030dc (uv + OPENROUTER_API_KEY, ~7 min runtime).
Checklist