feat: prompt prefix caching#279
Conversation
There was a problem hiding this comment.
Summary
This PR implements prompt-prefix cache tracking - a feature that measures how many leading tokens of a request body are shared with previously seen requests in the same (created_by, model) scope. The implementation adds:
- New
prompt_cache_blockstable storing content-addressed cumulative-prefix blocks - New
requests.cached_tokenscolumn for per-request cached token counts - Core logic in
src/prefix_cache.rsusing chained UUID v5 hashes to detect shared prefixes - Integration across all three ingestion paths (realtime inline, flex/batch via daemon)
- Per-user and per-model metrics/gauges with periodic summary logging
- TTL-based eviction with route-specific overrides
- Comprehensive integration tests
The feature is explicitly tracking-only - no billing discount is applied yet (discount_rate is stored but unused).
Verdict: Needs changes - there are several correctness and design issues that should be addressed before merging.
Research notes
- UUID v5: Uses SHA-1 based name-based UUIDs per RFC 9562. The chained hash approach (
acc = Uuid::new_v5(&acc, block.as_bytes())) correctly ensures that blocki's hash depends on all preceding blocks, so matching at blockiimplies the entire leading run matched. - Token counting approximation: The
chars / chars_per_tokenapproach is documented as a placeholder. OpenAI's actual prefix caching uses BPE tokenization - this approximation could diverge significantly for non-ASCII text or languages with different character-to-token ratios. - OWASP considerations: The
created_byfield is used as a cache scope key without validation. If this field can contain user-controlled input, ensure it's properly sanitized to prevent cache poisoning across tenant boundaries.
Suggested next steps
- Blocking: Fix the
Duration::ZEROedge case incutoff_for()- subtracting zero duration should not cause issues, but the comment about "lazy filter" needs verification. - Blocking: Add validation for empty
created_byinrecord_and_count()beyond just the enabled check. - Non-blocking: Consider documenting the token approximation limitations more prominently.
- Non-blocking: Evaluate whether
max_body_chars: 256KBdefault is appropriate for typical LLM payloads.
General findings
Correctness concerns
-
Chain hash property verification: The chained hash design is sound, but the test
shared_prefix_shares_leading_hashes_then_divergesonly asserts that some prefix matches exist, not the exact count. Consider adding a test that verifies the exact number of matching blocks for known inputs. -
TTL edge case: When
route_ttlscontainsDuration::ZERO, thecutoff_for()function computesnow - Duration::ZERO = now, meaning only blocks withlast_seen_at > nowwould match. Since newly inserted blocks havelast_seen_at = NOW(), this creates a race condition where the behavior depends on sub-millisecond timing. The testroute_ttl_controls_hits_and_purgeusesDuration::ZEROand expects immediate expiry, which appears to work due to the SELECT-before-INSERT ordering, but this should be documented. -
Cross-route matching semantics: The cache matches across routes (paths) but stores the path from the first request to create a block. This is intentional per the docs, but the purge logic deletes by path with different TTLs. If a block is created on route A (TTL=1hr) then accessed on route B (TTL=0), the row retains path=A, so it won't be purged by route B's zero-TTL policy. This seems correct but warrants a clarifying comment.
Metric naming consistency
The metric fusillade_user_prefix_cache_hit_ratio is emitted as a gauge but represents a ratio (0.0-1.0). Ensure downstream dashboards expect a normalized ratio rather than a percentage.
Test quality
The integration tests are well-structured and avoid sleep()-based timing by using Duration::ZERO for deterministic TTL expiry. However, the tests use tokio::time::sleep(Duration::from_millis(50)) in polling loops - while not ideal, these are bounded by a 10-second timeout and serve a legitimate purpose (waiting for daemon processing). Consider using a tokio::sync::watch channel or similar for more deterministic completion detection if test flakiness becomes an issue.
Database migration ordering
Two migrations are added:
20260601000000_add_cached_tokens_to_requests(adds column)20260601000001_add_prompt_cache_blocks(creates table)
The ordering is correct - the column must exist before any code tries to read/write it. However, note that deploying the migrations without the application code will result in queries failing (the column exists but no code populates it). This is acceptable for a tracking-only feature but should be noted in deployment docs.
Concurrency safety
The prefix_cache_user_stats DashMap uses Ordering::Relaxed for atomic operations. This is appropriate here because:
- Each user's counters are independent (no cross-user ordering requirements)
- The periodic flush reads each counter atomically via
swap(0, Relaxed) - There's no synchronization dependency between lookups/hits/cached_tokens counters
However, if future requirements add cross-counter consistency checks, consider upgrading to Ordering::SeqCst.
There was a problem hiding this comment.
Pull request overview
Adds a tracking-only “prompt prefix cache” mechanism to estimate and record how many prompt tokens were likely served from an upstream KV cache, surfacing this in request/batch read APIs and emitting metrics/log summaries.
Changes:
- Introduces
prefix_cachemodule + DB schema (prompt_cache_blocks,requests.cached_tokens) to compute and persist cached-prefix token counts. - Integrates cache accounting into realtime ingestion and daemon dispatch (flex/batch), plus periodic purge + per-user summary logging.
- Extends public read surfaces (
RequestDetail,BatchStatus) and adds integration tests covering scope/TTL/purge behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/prefix_cache.rs | New integration tests for cache scoping, daemon-recorded hits, batch aggregates, and TTL/purge behavior. |
| src/request/query.rs | Extends RequestDetail to include cached_tokens. |
| src/prefix_cache.rs | Implements hashing, TTL logic, record+count, purge, and basic unit tests for the prefix-cache mechanism. |
| src/manager/postgres.rs | Persists cached_tokens, emits metrics, adds per-user accumulators and batch-status aggregates, and implements purge/summary hooks. |
| src/manager/mod.rs | Adds trait hooks for daemon-time cache accounting and cache purge/summary (default no-ops). |
| src/lib.rs | Exposes the new prefix_cache module and config/stats types publicly. |
| src/daemon/mod.rs | Calls prefix-cache accounting at dispatch time and hooks cache purge + periodic per-user summary into daemon tasks. |
| src/batch/mod.rs | Extends BatchStatus with cached-requests/tokens aggregates. |
| migrations/20260601000001_add_prompt_cache_blocks.up.sql | Adds prompt_cache_blocks table + purge index. |
| migrations/20260601000001_add_prompt_cache_blocks.down.sql | Drops prompt_cache_blocks and its index. |
| migrations/20260601000000_add_cached_tokens_to_requests.up.sql | Adds requests.cached_tokens column + comment. |
| migrations/20260601000000_add_cached_tokens_to_requests.down.sql | Drops requests.cached_tokens. |
| Cargo.toml | Enables UUID v5 feature for prefix hashing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let cached_tokens = (live_blocks as usize * cfg.block_tokens) | ||
| .min(approx_token_count(body, cfg.chars_per_token)); |
| -- Tracking-only for now: records how many leading "tokens" of this request's | ||
| -- body matched a previously-seen request in the same (created_by, model, path) | ||
| -- scope. No discount is applied yet. Populated at dispatch time (inline for | ||
| -- realtime, in the daemon process task for flex/batch), so it defaults to 0 | ||
| -- and is filled in as requests are processed. |
| COMMENT ON COLUMN requests.cached_tokens IS | ||
| 'Approximate number of leading prompt tokens served from the prefix cache for this request (shared with a prior request in the same user/model/route scope). Tracking-only; no billing discount applied.'; |
| //! per-batch read surface, and route-TTL eviction. No `sleep`-based timing — TTL | ||
| //! behaviour is exercised with a zero-lifetime route so expiry is deterministic. |
| // Evict expired prompt-prefix cache blocks on the same cadence. | ||
| match storage.purge_prefix_cache().await { | ||
| Ok(deleted) if deleted > 0 => { | ||
| counter!("fusillade_prefix_cache_rows_purged_total").increment(deleted); | ||
| tracing::debug!(deleted, "Purged expired prefix cache blocks"); | ||
| } | ||
| Ok(_) => {} | ||
| Err(e) => { | ||
| counter!("fusillade_purge_errors_total").increment(1); | ||
| tracing::error!(error = %e, "Failed to purge prefix cache"); | ||
| } | ||
| } |
No description provided.