feat: prompt prefix caching by interview-arch · Pull Request #279 · doublewordai/fusillade

interview-arch · 2026-06-01T16:44:47Z

No description provided.

doubleword-code

Summary

This PR implements prompt-prefix cache tracking - a feature that measures how many leading tokens of a request body are shared with previously seen requests in the same (created_by, model) scope. The implementation adds:

New prompt_cache_blocks table storing content-addressed cumulative-prefix blocks
New requests.cached_tokens column for per-request cached token counts
Core logic in src/prefix_cache.rs using chained UUID v5 hashes to detect shared prefixes
Integration across all three ingestion paths (realtime inline, flex/batch via daemon)
Per-user and per-model metrics/gauges with periodic summary logging
TTL-based eviction with route-specific overrides
Comprehensive integration tests

The feature is explicitly tracking-only - no billing discount is applied yet (discount_rate is stored but unused).

Verdict: Needs changes - there are several correctness and design issues that should be addressed before merging.

Research notes

UUID v5: Uses SHA-1 based name-based UUIDs per RFC 9562. The chained hash approach (acc = Uuid::new_v5(&acc, block.as_bytes())) correctly ensures that block i's hash depends on all preceding blocks, so matching at block i implies the entire leading run matched.
Token counting approximation: The chars / chars_per_token approach is documented as a placeholder. OpenAI's actual prefix caching uses BPE tokenization - this approximation could diverge significantly for non-ASCII text or languages with different character-to-token ratios.
OWASP considerations: The created_by field is used as a cache scope key without validation. If this field can contain user-controlled input, ensure it's properly sanitized to prevent cache poisoning across tenant boundaries.

Suggested next steps

Blocking: Fix the Duration::ZERO edge case in cutoff_for() - subtracting zero duration should not cause issues, but the comment about "lazy filter" needs verification.
Blocking: Add validation for empty created_by in record_and_count() beyond just the enabled check.
Non-blocking: Consider documenting the token approximation limitations more prominently.
Non-blocking: Evaluate whether max_body_chars: 256KB default is appropriate for typical LLM payloads.

General findings

Correctness concerns

Chain hash property verification: The chained hash design is sound, but the test shared_prefix_shares_leading_hashes_then_diverges only asserts that some prefix matches exist, not the exact count. Consider adding a test that verifies the exact number of matching blocks for known inputs.
TTL edge case: When route_ttls contains Duration::ZERO, the cutoff_for() function computes now - Duration::ZERO = now, meaning only blocks with last_seen_at > now would match. Since newly inserted blocks have last_seen_at = NOW(), this creates a race condition where the behavior depends on sub-millisecond timing. The test route_ttl_controls_hits_and_purge uses Duration::ZERO and expects immediate expiry, which appears to work due to the SELECT-before-INSERT ordering, but this should be documented.
Cross-route matching semantics: The cache matches across routes (paths) but stores the path from the first request to create a block. This is intentional per the docs, but the purge logic deletes by path with different TTLs. If a block is created on route A (TTL=1hr) then accessed on route B (TTL=0), the row retains path=A, so it won't be purged by route B's zero-TTL policy. This seems correct but warrants a clarifying comment.

Metric naming consistency

The metric fusillade_user_prefix_cache_hit_ratio is emitted as a gauge but represents a ratio (0.0-1.0). Ensure downstream dashboards expect a normalized ratio rather than a percentage.

Test quality

The integration tests are well-structured and avoid sleep()-based timing by using Duration::ZERO for deterministic TTL expiry. However, the tests use tokio::time::sleep(Duration::from_millis(50)) in polling loops - while not ideal, these are bounded by a 10-second timeout and serve a legitimate purpose (waiting for daemon processing). Consider using a tokio::sync::watch channel or similar for more deterministic completion detection if test flakiness becomes an issue.

Database migration ordering

Two migrations are added:

20260601000000_add_cached_tokens_to_requests (adds column)
20260601000001_add_prompt_cache_blocks (creates table)

The ordering is correct - the column must exist before any code tries to read/write it. However, note that deploying the migrations without the application code will result in queries failing (the column exists but no code populates it). This is acceptable for a tracking-only feature but should be noted in deployment docs.

Concurrency safety

The prefix_cache_user_stats DashMap uses Ordering::Relaxed for atomic operations. This is appropriate here because:

Each user's counters are independent (no cross-user ordering requirements)
The periodic flush reads each counter atomically via swap(0, Relaxed)
There's no synchronization dependency between lookups/hits/cached_tokens counters

However, if future requirements add cross-counter consistency checks, consider upgrading to Ordering::SeqCst.

Copilot

Pull request overview

Adds a tracking-only “prompt prefix cache” mechanism to estimate and record how many prompt tokens were likely served from an upstream KV cache, surfacing this in request/batch read APIs and emitting metrics/log summaries.

Changes:

Introduces prefix_cache module + DB schema (prompt_cache_blocks, requests.cached_tokens) to compute and persist cached-prefix token counts.
Integrates cache accounting into realtime ingestion and daemon dispatch (flex/batch), plus periodic purge + per-user summary logging.
Extends public read surfaces (RequestDetail, BatchStatus) and adds integration tests covering scope/TTL/purge behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/prefix_cache.rs	New integration tests for cache scoping, daemon-recorded hits, batch aggregates, and TTL/purge behavior.
src/request/query.rs	Extends `RequestDetail` to include `cached_tokens`.
src/prefix_cache.rs	Implements hashing, TTL logic, record+count, purge, and basic unit tests for the prefix-cache mechanism.
src/manager/postgres.rs	Persists `cached_tokens`, emits metrics, adds per-user accumulators and batch-status aggregates, and implements purge/summary hooks.
src/manager/mod.rs	Adds trait hooks for daemon-time cache accounting and cache purge/summary (default no-ops).
src/lib.rs	Exposes the new `prefix_cache` module and config/stats types publicly.
src/daemon/mod.rs	Calls prefix-cache accounting at dispatch time and hooks cache purge + periodic per-user summary into daemon tasks.
src/batch/mod.rs	Extends `BatchStatus` with cached-requests/tokens aggregates.
migrations/20260601000001_add_prompt_cache_blocks.up.sql	Adds `prompt_cache_blocks` table + purge index.
migrations/20260601000001_add_prompt_cache_blocks.down.sql	Drops `prompt_cache_blocks` and its index.
migrations/20260601000000_add_cached_tokens_to_requests.up.sql	Adds `requests.cached_tokens` column + comment.
migrations/20260601000000_add_cached_tokens_to_requests.down.sql	Drops `requests.cached_tokens`.
Cargo.toml	Enables UUID v5 feature for prefix hashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    let cached_tokens = (live_blocks as usize * cfg.block_tokens)
+        .min(approx_token_count(body, cfg.chars_per_token));


+-- Tracking-only for now: records how many leading "tokens" of this request's
+-- body matched a previously-seen request in the same (created_by, model, path)
+-- scope. No discount is applied yet. Populated at dispatch time (inline for
+-- realtime, in the daemon process task for flex/batch), so it defaults to 0
+-- and is filled in as requests are processed.


+COMMENT ON COLUMN requests.cached_tokens IS
+    'Approximate number of leading prompt tokens served from the prefix cache for this request (shared with a prior request in the same user/model/route scope). Tracking-only; no billing discount applied.';


+//! per-batch read surface, and route-TTL eviction. No `sleep`-based timing — TTL
+//! behaviour is exercised with a zero-lifetime route so expiry is deterministic.


+                    // Evict expired prompt-prefix cache blocks on the same cadence.
+                    match storage.purge_prefix_cache().await {
+                        Ok(deleted) if deleted > 0 => {
+                            counter!("fusillade_prefix_cache_rows_purged_total").increment(deleted);
+                            tracing::debug!(deleted, "Purged expired prefix cache blocks");
+                        }
+                        Ok(_) => {}
+                        Err(e) => {
+                            counter!("fusillade_purge_errors_total").increment(1);
+                            tracing::error!(error = %e, "Failed to purge prefix cache");
+                        }
+                    }


progress

70f873a

doubleword-code Bot reviewed Jun 1, 2026

View reviewed changes

pjb157 marked this pull request as ready for review June 1, 2026 17:47

Copilot AI review requested due to automatic review settings June 1, 2026 17:47

Copilot started reviewing on behalf of pjb157 June 1, 2026 17:47 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: prompt prefix caching#279

feat: prompt prefix caching#279
interview-arch wants to merge 1 commit into
mainfrom
feat/prompt-prefix-caching

interview-arch commented Jun 1, 2026

Uh oh!

doubleword-code Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let cached_tokens = (live_blocks as usize * cfg.block_tokens)
		.min(approx_token_count(body, cfg.chars_per_token));

		COMMENT ON COLUMN requests.cached_tokens IS
		'Approximate number of leading prompt tokens served from the prefix cache for this request (shared with a prior request in the same user/model/route scope). Tracking-only; no billing discount applied.';

		//! per-batch read surface, and route-TTL eviction. No `sleep`-based timing — TTL
		//! behaviour is exercised with a zero-lifetime route so expiry is deterministic.

Conversation

interview-arch commented Jun 1, 2026

Uh oh!

doubleword-code Bot left a comment

Choose a reason for hiding this comment

Summary

Research notes

Suggested next steps

General findings

Correctness concerns

Metric naming consistency

Test quality

Database migration ordering

Concurrency safety

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants