Skip to content

perf(llm): prime prompt cache and disable slot-fill reasoning#111

Open
jdsika wants to merge 1 commit into
mainfrom
perf/llm-cache-priming-and-disable-reasoning
Open

perf(llm): prime prompt cache and disable slot-fill reasoning#111
jdsika wants to merge 1 commit into
mainfrom
perf/llm-cache-priming-and-disable-reasoning

Conversation

@jdsika

@jdsika jdsika commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Two measured speedups for the Copilot slot-filling path, from a deep latency investigation. Every search is a single LLM turn over a ~100k-token system prompt (the embedded SHACL), so the levers are cache warmth, reasoning overhead, and output generation.

Investigation (measured, per-turn token/timing)

Case Wall LLM dur cacheRead cacheWrite out reason
Cold cache 21.6s 19.5s 0 103960 1062 356
Warm, small output 10.5s 9.9s 103562 403 599 175
Warm, large output 33.3s 32.7s 103562 402 1893 406

Bottlenecks: (1) cold-cache prefill (~+10-12s, hit on first/idle query because warmup never ran inference), (2) output generation (~70 tok/s; largely irreducible — mappedTerms feeds the editable-filter refinement UI).

Changes

1. perf(llm): prime the prompt cache

Warmup now runs one throwaway slot-fill in the background (non-blocking — readiness is not delayed) so the backend writes the ~100k-token system-prompt prefix to its cache. A 4-min keep-alive re-primes ahead of the ~5-min cache TTL so idle deployments stay warm. Confirmed live: the first real query landed warm (cacheRead=103561) instead of cold.

2. perf(llm): disable reasoning for slot-filling

Slot-filling is deterministic (fill slots from SHACL, validated by the SHACL gate downstream), so chain-of-thought only adds latency + output tokens. reasoningEffort is set to 'none' on reasoning-capable Copilot models. Confirmed live: reason=0 (was 175-406 tokens/call). The SDK's exported ReasoningEffort union omits "none", but the session.create wire schema documents "none" as disabling reasoning, so the adapter forwards it with a documented cast.

Measured result: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s (warm, reason=0).

Tests

  • New agent-policy.test.ts: pins the reasoningEffort mapping (none for copilot 4.6/opus, null otherwise).
  • New priming regression test in copilot-agent-boundary.test.ts: primeCacheOnce issues a real round-trip and is best-effort (never throws).
  • pnpm run validate green: typecheck + lint + format + all unit tests (llm 122).

Tradeoffs / notes

  • reasoning:'none' is a slight accuracy tradeoff on ambiguous cross-reference queries (the SHACL gate cushions it); it is a one-line revert.
  • The keep-alive adds a small recurring background call every 4 min (unref'd, best-effort, errors swallowed).

Follow-ups surfaced by the investigation (not in this PR)

  • Pool exhaustion → slow inline createSession (~6-14s) under rapid/burst load: instrumenting showed a complex query's 33.8s was ~19.5s LLM + ~14s session-acquire on an empty pool. Worth raising POOL_SIZE / decoupling priming from the pool.
  • Faster model (e.g. claude-haiku-4.5) to attack the ~15-20s output-generation floor, with an accuracy A/B.
  • Shrink the system prompt (strip SHACL boilerplate; openlabel+v2 alone are ~30% of the 298KB) to cut cold prefill and cache-write cost.

Two measured speedups for the Copilot slot-filling path.

1. Prompt-cache priming. Every request runs a single LLM turn over a ~100k-token
   system prompt (the embedded SHACL). The backend caches that prefix, but
   warmup only CREATED sessions — it never ran inference, so the cache stayed
   cold until the first real query (measured +10-12s cold prefill), and the
   ~5-min cache TTL meant any idle period re-triggered it. Warmup now runs one
   throwaway slot-fill in the background (non-blocking, so readiness is not
   delayed) to write the cache, plus a 4-min keep-alive to beat the TTL. First
   and post-idle queries now land warm.

2. Disable reasoning for slot-filling. The task is deterministic (fill slots
   from SHACL, validated by the SHACL gate downstream), so chain-of-thought only
   adds latency and output tokens. Set reasoningEffort to "none" on
   reasoning-capable Copilot models (confirmed reason=0 live). The SDK exported
   ReasoningEffort union omits "none"; the session.create wire schema documents
   "none" as disabling reasoning, so the adapter forwards it with a cast.

Measured: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s
(warm, reason=0).

Tests: new agent-policy.test.ts pins the reasoningEffort mapping; a priming
regression test in copilot-agent-boundary.test.ts asserts primeCacheOnce issues
a round-trip and is best-effort.

Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant