perf(llm): prime prompt cache and disable slot-fill reasoning by jdsika · Pull Request #111 · ASCS-eV/ontology-based-nl-search

jdsika · 2026-07-03T13:58:17Z

Summary

Two measured speedups for the Copilot slot-filling path, from a deep latency investigation. Every search is a single LLM turn over a ~100k-token system prompt (the embedded SHACL), so the levers are cache warmth, reasoning overhead, and output generation.

Investigation (measured, per-turn token/timing)

Case	Wall	LLM dur	cacheRead	cacheWrite	out	reason
Cold cache	21.6s	19.5s	0	103960	1062	356
Warm, small output	10.5s	9.9s	103562	403	599	175
Warm, large output	33.3s	32.7s	103562	402	1893	406

Bottlenecks: (1) cold-cache prefill (~+10-12s, hit on first/idle query because warmup never ran inference), (2) output generation (~70 tok/s; largely irreducible — mappedTerms feeds the editable-filter refinement UI).

Changes

1. `perf(llm)`: prime the prompt cache

Warmup now runs one throwaway slot-fill in the background (non-blocking — readiness is not delayed) so the backend writes the ~100k-token system-prompt prefix to its cache. A 4-min keep-alive re-primes ahead of the ~5-min cache TTL so idle deployments stay warm. Confirmed live: the first real query landed warm (cacheRead=103561) instead of cold.

2. `perf(llm)`: disable reasoning for slot-filling

Slot-filling is deterministic (fill slots from SHACL, validated by the SHACL gate downstream), so chain-of-thought only adds latency + output tokens. reasoningEffort is set to 'none' on reasoning-capable Copilot models. Confirmed live: reason=0 (was 175-406 tokens/call). The SDK's exported ReasoningEffort union omits "none", but the session.create wire schema documents "none" as disabling reasoning, so the adapter forwards it with a documented cast.

Measured result: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s (warm, reason=0).

Tests

New agent-policy.test.ts: pins the reasoningEffort mapping (none for copilot 4.6/opus, null otherwise).
New priming regression test in copilot-agent-boundary.test.ts: primeCacheOnce issues a real round-trip and is best-effort (never throws).
pnpm run validate green: typecheck + lint + format + all unit tests (llm 122).

Tradeoffs / notes

reasoning:'none' is a slight accuracy tradeoff on ambiguous cross-reference queries (the SHACL gate cushions it); it is a one-line revert.
The keep-alive adds a small recurring background call every 4 min (unref'd, best-effort, errors swallowed).

Follow-ups surfaced by the investigation (not in this PR)

Pool exhaustion → slow inline createSession (~6-14s) under rapid/burst load: instrumenting showed a complex query's 33.8s was ~19.5s LLM + ~14s session-acquire on an empty pool. Worth raising POOL_SIZE / decoupling priming from the pool.
Faster model (e.g. claude-haiku-4.5) to attack the ~15-20s output-generation floor, with an accuracy A/B.
Shrink the system prompt (strip SHACL boilerplate; openlabel+v2 alone are ~30% of the 298KB) to cut cold prefill and cache-write cost.

Two measured speedups for the Copilot slot-filling path. 1. Prompt-cache priming. Every request runs a single LLM turn over a ~100k-token system prompt (the embedded SHACL). The backend caches that prefix, but warmup only CREATED sessions — it never ran inference, so the cache stayed cold until the first real query (measured +10-12s cold prefill), and the ~5-min cache TTL meant any idle period re-triggered it. Warmup now runs one throwaway slot-fill in the background (non-blocking, so readiness is not delayed) to write the cache, plus a 4-min keep-alive to beat the TTL. First and post-idle queries now land warm. 2. Disable reasoning for slot-filling. The task is deterministic (fill slots from SHACL, validated by the SHACL gate downstream), so chain-of-thought only adds latency and output tokens. Set reasoningEffort to "none" on reasoning-capable Copilot models (confirmed reason=0 live). The SDK exported ReasoningEffort union omits "none"; the session.create wire schema documents "none" as disabling reasoning, so the adapter forwards it with a cast. Measured: a simple first-after-warmup query dropped from ~21s (cold) to ~15.6s (warm, reason=0). Tests: new agent-policy.test.ts pins the reasoningEffort mapping; a priming regression test in copilot-agent-boundary.test.ts asserts primeCacheOnce issues a round-trip and is best-effort. Signed-off-by: Carlo van Driesten <carlo.van-driesten@bmw.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(llm): prime prompt cache and disable slot-fill reasoning#111

perf(llm): prime prompt cache and disable slot-fill reasoning#111
jdsika wants to merge 1 commit into
mainfrom
perf/llm-cache-priming-and-disable-reasoning

jdsika commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jdsika commented Jul 3, 2026

Summary

Investigation (measured, per-turn token/timing)

Changes

1. perf(llm): prime the prompt cache

2. perf(llm): disable reasoning for slot-filling

Tests

Tradeoffs / notes

Follow-ups surfaced by the investigation (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `perf(llm)`: prime the prompt cache

2. `perf(llm)`: disable reasoning for slot-filling