perf(llm): prime prompt cache and disable slot-fill reasoning#111
Open
jdsika wants to merge 1 commit into
Open
perf(llm): prime prompt cache and disable slot-fill reasoning#111jdsika wants to merge 1 commit into
jdsika wants to merge 1 commit into
Conversation
Two measured speedups for the Copilot slot-filling path. 1. Prompt-cache priming. Every request runs a single LLM turn over a ~100k-token system prompt (the embedded SHACL). The backend caches that prefix, but warmup only CREATED sessions — it never ran inference, so the cache stayed cold until the first real query (measured +10-12s cold prefill), and the ~5-min cache TTL meant any idle period re-triggered it. Warmup now runs one throwaway slot-fill in the background (non-blocking, so readiness is not delayed) to write the cache, plus a 4-min keep-alive to beat the TTL. First and post-idle queries now land warm. 2. Disable reasoning for slot-filling. The task is deterministic (fill slots from SHACL, validated by the SHACL gate downstream), so chain-of-thought only adds latency and output tokens. Set reasoningEffort to "none" on reasoning-capable Copilot models (confirmed reason=0 live). The SDK exported ReasoningEffort union omits "none"; the session.create wire schema documents "none" as disabling reasoning, so the adapter forwards it with a cast. Measured: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s (warm, reason=0). Tests: new agent-policy.test.ts pins the reasoningEffort mapping; a priming regression test in copilot-agent-boundary.test.ts asserts primeCacheOnce issues a round-trip and is best-effort. Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two measured speedups for the Copilot slot-filling path, from a deep latency investigation. Every search is a single LLM turn over a ~100k-token system prompt (the embedded SHACL), so the levers are cache warmth, reasoning overhead, and output generation.
Investigation (measured, per-turn token/timing)
Bottlenecks: (1) cold-cache prefill (~+10-12s, hit on first/idle query because warmup never ran inference), (2) output generation (~70 tok/s; largely irreducible —
mappedTermsfeeds the editable-filter refinement UI).Changes
1.
perf(llm): prime the prompt cacheWarmup now runs one throwaway slot-fill in the background (non-blocking — readiness is not delayed) so the backend writes the ~100k-token system-prompt prefix to its cache. A 4-min keep-alive re-primes ahead of the ~5-min cache TTL so idle deployments stay warm. Confirmed live: the first real query landed warm (
cacheRead=103561) instead of cold.2.
perf(llm): disable reasoning for slot-fillingSlot-filling is deterministic (fill slots from SHACL, validated by the SHACL gate downstream), so chain-of-thought only adds latency + output tokens.
reasoningEffortis set to'none'on reasoning-capable Copilot models. Confirmed live:reason=0(was 175-406 tokens/call). The SDK's exportedReasoningEffortunion omits"none", but thesession.createwire schema documents"none"as disabling reasoning, so the adapter forwards it with a documented cast.Measured result: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s (warm, reason=0).
Tests
agent-policy.test.ts: pins thereasoningEffortmapping (nonefor copilot 4.6/opus,nullotherwise).copilot-agent-boundary.test.ts:primeCacheOnceissues a real round-trip and is best-effort (never throws).pnpm run validategreen: typecheck + lint + format + all unit tests (llm 122).Tradeoffs / notes
reasoning:'none'is a slight accuracy tradeoff on ambiguous cross-reference queries (the SHACL gate cushions it); it is a one-line revert.Follow-ups surfaced by the investigation (not in this PR)
createSession(~6-14s) under rapid/burst load: instrumenting showed a complex query's 33.8s was ~19.5s LLM + ~14ssession-acquireon an empty pool. Worth raisingPOOL_SIZE/ decoupling priming from the pool.claude-haiku-4.5) to attack the ~15-20s output-generation floor, with an accuracy A/B.