Summary
GeminiContextCacheManager._create_new_cache_with_contents decides whether to call caches.create() using a token count that represents the full request (system_instruction + tools + all chat history + user turn), but the actual caches.create() call only sends the cacheable prefix (system_instruction + tools + contents[:cache_contents_count]).
This mismatch causes caches.create() to fail with 400 INVALID_ARGUMENT ("The cached content is of tokens. The minimum token count to start caching is 4096.") on every cold-prefix request whose chat history happens to push the full token count over 4096 while the prefix itself stays below.
Worse, on failure the manager returns CacheMetadata(fingerprint=..., contents_count=..., cache_name=None) (no cache_name), so the next request with the same fingerprint hits the same failure path and tries to create again. Infinite create-fail loop, zero cache hits, flood of Failed to create cache: 400 INVALID_ARGUMENT warnings.
Affected version
Environment
- Model:
gemini-3.5-flash on Vertex AI
ContextCacheConfig(cache_intervals=10, ttl_seconds=1800, min_tokens=4096)
- App-level config (
App.context_cache_config)
- ~24h prod observation: hundreds of failures/min,
cache_hit_pct ~= 0% across all routed models
Reproduction (code path walkthrough)
-
google/adk/flows/llm_flows/context_cache_processor.py:81
llm_request.cacheable_contents_token_count = previous_token_count
previous_token_count is the prior response's usage_metadata.prompt_token_count — i.e. the full prompt the model just saw (incl. chat history + tools + system_instruction + the latest user turn).
-
google/adk/models/gemini_context_cache_manager.py:312-324 (gate inside _create_new_cache_with_contents):
if cacheable_contents_token_count < ctx_cache_config.min_tokens:
logger.info("Previous request too small to cache: ...")
return None
if cacheable_contents_token_count < _GEMINI_MIN_CACHE_TOKENS:
logger.info("Token count below Gemini minimum: ...")
return None
This uses the full prompt count from step 1 as the gate.
-
google/adk/models/gemini_context_cache_manager.py:372-441 (_create_gemini_cache):
cache_contents = llm_request.contents[:cache_contents_count] # line 388
...
cache_config = CreateCachedContentConfig(
system_instruction=llm_request.config.system_instruction,
tools=llm_request.config.tools,
contents=cache_contents,
...
)
await self.genai_client.aio.caches.create(model=..., config=cache_config) # line 423
The actual caches.create() call only sends the prefix slice. The server measures that prefix, finds it < 4096, and rejects with 400.
-
On failure (gemini_context_cache_manager.py:~133), the manager returns:
CacheMetadata(fingerprint=..., contents_count=cache_contents_count)
with no cache_name. The next request matches the same fingerprint, sees no cache_name, and goes back through _create_new_cache_with_contents — same gate, same failure.
Observed log spam (production)
WARNING Failed to create cache: 400 INVALID_ARGUMENT.
{'error': {'code': 400, 'message': 'The cached content is of 1820 tokens. The minimum token count to start caching is 4096.', 'status': 'INVALID_ARGUMENT'}}
Token counts in real failures: 1820, 1857, 1991, 1829 — all well below 4096 because the prefix (system_instruction + tools + few-content sub-agents) is small even when the full chat history pushes the gate's measurement above 4096.
Suggested fix
The gate must be measured on the same payload caches.create() will receive. Two options:
- Estimate prefix tokens at gate time. Compute
system_instruction + tools + contents[:cache_contents_count] token count locally and compare against min_tokens / _GEMINI_MIN_CACHE_TOKENS. This avoids the extra count_tokens API round trip if you can use the on-device tokenizer; otherwise one client.models.count_tokens call is cheap relative to letting the request fail at the server.
- Cache the negative result by fingerprint. Even with the gate fix, transient API rejections still happen; refusing to retry the same fingerprint for
ttl_seconds would prevent the infinite-loop log flood.
Either fix on its own breaks the loop; both together would be ideal.
Workaround for users
Set App.context_cache_config = None to disable caching entirely until a fix lands. There is no per-agent disable in 1.34.1 — App.context_cache_config is global (llm_agent.py:266), and the only safe workaround is full disable.
Summary
GeminiContextCacheManager._create_new_cache_with_contentsdecides whether to callcaches.create()using a token count that represents the full request (system_instruction + tools + all chat history + user turn), but the actualcaches.create()call only sends the cacheable prefix (system_instruction + tools +contents[:cache_contents_count]).This mismatch causes
caches.create()to fail with400 INVALID_ARGUMENT("The cached content is of tokens. The minimum token count to start caching is 4096.") on every cold-prefix request whose chat history happens to push the full token count over 4096 while the prefix itself stays below.Worse, on failure the manager returns
CacheMetadata(fingerprint=..., contents_count=..., cache_name=None)(nocache_name), so the next request with the same fingerprint hits the same failure path and tries to create again. Infinite create-fail loop, zero cache hits, flood ofFailed to create cache: 400 INVALID_ARGUMENTwarnings.Affected version
google-adk == 1.34.1Environment
gemini-3.5-flashon Vertex AIContextCacheConfig(cache_intervals=10, ttl_seconds=1800, min_tokens=4096)App.context_cache_config)cache_hit_pct ~= 0%across all routed modelsReproduction (code path walkthrough)
google/adk/flows/llm_flows/context_cache_processor.py:81previous_token_countis the prior response'susage_metadata.prompt_token_count— i.e. the full prompt the model just saw (incl. chat history + tools + system_instruction + the latest user turn).google/adk/models/gemini_context_cache_manager.py:312-324(gate inside_create_new_cache_with_contents):This uses the full prompt count from step 1 as the gate.
google/adk/models/gemini_context_cache_manager.py:372-441(_create_gemini_cache):The actual
caches.create()call only sends the prefix slice. The server measures that prefix, finds it < 4096, and rejects with 400.On failure (
gemini_context_cache_manager.py:~133), the manager returns:with no
cache_name. The next request matches the same fingerprint, sees nocache_name, and goes back through_create_new_cache_with_contents— same gate, same failure.Observed log spam (production)
Token counts in real failures: 1820, 1857, 1991, 1829 — all well below 4096 because the prefix (system_instruction + tools + few-content sub-agents) is small even when the full chat history pushes the gate's measurement above 4096.
Suggested fix
The gate must be measured on the same payload
caches.create()will receive. Two options:system_instruction + tools + contents[:cache_contents_count]token count locally and compare againstmin_tokens/_GEMINI_MIN_CACHE_TOKENS. This avoids the extracount_tokensAPI round trip if you can use the on-device tokenizer; otherwise oneclient.models.count_tokenscall is cheap relative to letting the request fail at the server.ttl_secondswould prevent the infinite-loop log flood.Either fix on its own breaks the loop; both together would be ideal.
Workaround for users
Set
App.context_cache_config = Noneto disable caching entirely until a fix lands. There is no per-agent disable in 1.34.1 —App.context_cache_configis global (llm_agent.py:266), and the only safe workaround is full disable.