Skip to content

feat: prompt prefix caching#279

Open
interview-arch wants to merge 1 commit into
mainfrom
feat/prompt-prefix-caching
Open

feat: prompt prefix caching#279
interview-arch wants to merge 1 commit into
mainfrom
feat/prompt-prefix-caching

Conversation

@interview-arch
Copy link
Copy Markdown

No description provided.

Copy link
Copy Markdown

@doubleword-code doubleword-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR implements prompt-prefix cache tracking - a feature that measures how many leading tokens of a request body are shared with previously seen requests in the same (created_by, model) scope. The implementation adds:

  1. New prompt_cache_blocks table storing content-addressed cumulative-prefix blocks
  2. New requests.cached_tokens column for per-request cached token counts
  3. Core logic in src/prefix_cache.rs using chained UUID v5 hashes to detect shared prefixes
  4. Integration across all three ingestion paths (realtime inline, flex/batch via daemon)
  5. Per-user and per-model metrics/gauges with periodic summary logging
  6. TTL-based eviction with route-specific overrides
  7. Comprehensive integration tests

The feature is explicitly tracking-only - no billing discount is applied yet (discount_rate is stored but unused).

Verdict: Needs changes - there are several correctness and design issues that should be addressed before merging.

Research notes

  • UUID v5: Uses SHA-1 based name-based UUIDs per RFC 9562. The chained hash approach (acc = Uuid::new_v5(&acc, block.as_bytes())) correctly ensures that block i's hash depends on all preceding blocks, so matching at block i implies the entire leading run matched.
  • Token counting approximation: The chars / chars_per_token approach is documented as a placeholder. OpenAI's actual prefix caching uses BPE tokenization - this approximation could diverge significantly for non-ASCII text or languages with different character-to-token ratios.
  • OWASP considerations: The created_by field is used as a cache scope key without validation. If this field can contain user-controlled input, ensure it's properly sanitized to prevent cache poisoning across tenant boundaries.

Suggested next steps

  1. Blocking: Fix the Duration::ZERO edge case in cutoff_for() - subtracting zero duration should not cause issues, but the comment about "lazy filter" needs verification.
  2. Blocking: Add validation for empty created_by in record_and_count() beyond just the enabled check.
  3. Non-blocking: Consider documenting the token approximation limitations more prominently.
  4. Non-blocking: Evaluate whether max_body_chars: 256KB default is appropriate for typical LLM payloads.

General findings

Correctness concerns

  1. Chain hash property verification: The chained hash design is sound, but the test shared_prefix_shares_leading_hashes_then_diverges only asserts that some prefix matches exist, not the exact count. Consider adding a test that verifies the exact number of matching blocks for known inputs.

  2. TTL edge case: When route_ttls contains Duration::ZERO, the cutoff_for() function computes now - Duration::ZERO = now, meaning only blocks with last_seen_at > now would match. Since newly inserted blocks have last_seen_at = NOW(), this creates a race condition where the behavior depends on sub-millisecond timing. The test route_ttl_controls_hits_and_purge uses Duration::ZERO and expects immediate expiry, which appears to work due to the SELECT-before-INSERT ordering, but this should be documented.

  3. Cross-route matching semantics: The cache matches across routes (paths) but stores the path from the first request to create a block. This is intentional per the docs, but the purge logic deletes by path with different TTLs. If a block is created on route A (TTL=1hr) then accessed on route B (TTL=0), the row retains path=A, so it won't be purged by route B's zero-TTL policy. This seems correct but warrants a clarifying comment.

Metric naming consistency

The metric fusillade_user_prefix_cache_hit_ratio is emitted as a gauge but represents a ratio (0.0-1.0). Ensure downstream dashboards expect a normalized ratio rather than a percentage.

Test quality

The integration tests are well-structured and avoid sleep()-based timing by using Duration::ZERO for deterministic TTL expiry. However, the tests use tokio::time::sleep(Duration::from_millis(50)) in polling loops - while not ideal, these are bounded by a 10-second timeout and serve a legitimate purpose (waiting for daemon processing). Consider using a tokio::sync::watch channel or similar for more deterministic completion detection if test flakiness becomes an issue.

Database migration ordering

Two migrations are added:

  • 20260601000000_add_cached_tokens_to_requests (adds column)
  • 20260601000001_add_prompt_cache_blocks (creates table)

The ordering is correct - the column must exist before any code tries to read/write it. However, note that deploying the migrations without the application code will result in queries failing (the column exists but no code populates it). This is acceptable for a tracking-only feature but should be noted in deployment docs.

Concurrency safety

The prefix_cache_user_stats DashMap uses Ordering::Relaxed for atomic operations. This is appropriate here because:

  • Each user's counters are independent (no cross-user ordering requirements)
  • The periodic flush reads each counter atomically via swap(0, Relaxed)
  • There's no synchronization dependency between lookups/hits/cached_tokens counters

However, if future requirements add cross-counter consistency checks, consider upgrading to Ordering::SeqCst.

@pjb157 pjb157 marked this pull request as ready for review June 1, 2026 17:47
Copilot AI review requested due to automatic review settings June 1, 2026 17:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a tracking-only “prompt prefix cache” mechanism to estimate and record how many prompt tokens were likely served from an upstream KV cache, surfacing this in request/batch read APIs and emitting metrics/log summaries.

Changes:

  • Introduces prefix_cache module + DB schema (prompt_cache_blocks, requests.cached_tokens) to compute and persist cached-prefix token counts.
  • Integrates cache accounting into realtime ingestion and daemon dispatch (flex/batch), plus periodic purge + per-user summary logging.
  • Extends public read surfaces (RequestDetail, BatchStatus) and adds integration tests covering scope/TTL/purge behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/prefix_cache.rs New integration tests for cache scoping, daemon-recorded hits, batch aggregates, and TTL/purge behavior.
src/request/query.rs Extends RequestDetail to include cached_tokens.
src/prefix_cache.rs Implements hashing, TTL logic, record+count, purge, and basic unit tests for the prefix-cache mechanism.
src/manager/postgres.rs Persists cached_tokens, emits metrics, adds per-user accumulators and batch-status aggregates, and implements purge/summary hooks.
src/manager/mod.rs Adds trait hooks for daemon-time cache accounting and cache purge/summary (default no-ops).
src/lib.rs Exposes the new prefix_cache module and config/stats types publicly.
src/daemon/mod.rs Calls prefix-cache accounting at dispatch time and hooks cache purge + periodic per-user summary into daemon tasks.
src/batch/mod.rs Extends BatchStatus with cached-requests/tokens aggregates.
migrations/20260601000001_add_prompt_cache_blocks.up.sql Adds prompt_cache_blocks table + purge index.
migrations/20260601000001_add_prompt_cache_blocks.down.sql Drops prompt_cache_blocks and its index.
migrations/20260601000000_add_cached_tokens_to_requests.up.sql Adds requests.cached_tokens column + comment.
migrations/20260601000000_add_cached_tokens_to_requests.down.sql Drops requests.cached_tokens.
Cargo.toml Enables UUID v5 feature for prefix hashing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/prefix_cache.rs
Comment on lines +226 to +227
let cached_tokens = (live_blocks as usize * cfg.block_tokens)
.min(approx_token_count(body, cfg.chars_per_token));
Comment on lines +2 to +6
-- Tracking-only for now: records how many leading "tokens" of this request's
-- body matched a previously-seen request in the same (created_by, model, path)
-- scope. No discount is applied yet. Populated at dispatch time (inline for
-- realtime, in the daemon process task for flex/batch), so it defaults to 0
-- and is filled in as requests are processed.
Comment on lines +11 to +12
COMMENT ON COLUMN requests.cached_tokens IS
'Approximate number of leading prompt tokens served from the prefix cache for this request (shared with a prior request in the same user/model/route scope). Tracking-only; no billing discount applied.';
Comment thread tests/prefix_cache.rs
Comment on lines +5 to +6
//! per-batch read surface, and route-TTL eviction. No `sleep`-based timing — TTL
//! behaviour is exercised with a zero-lifetime route so expiry is deterministic.
Comment thread src/daemon/mod.rs
Comment on lines +711 to +722
// Evict expired prompt-prefix cache blocks on the same cadence.
match storage.purge_prefix_cache().await {
Ok(deleted) if deleted > 0 => {
counter!("fusillade_prefix_cache_rows_purged_total").increment(deleted);
tracing::debug!(deleted, "Purged expired prefix cache blocks");
}
Ok(_) => {}
Err(e) => {
counter!("fusillade_purge_errors_total").increment(1);
tracing::error!(error = %e, "Failed to purge prefix cache");
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants