Skip to content

Commit dfd2dd3

Browse files
gHashTagona-agent
andcommitted
feat(cache): implement Prefix Caching (OPT-PC01)
Complete implementation of prefix caching for reusing common prompts: kv_cache.zig: - PrefixCacheConfig with LRU/LFU/FIFO eviction policies - CachedPrefix struct with token/block tracking - PrefixCache with hash-based lookup - matchLongestPrefix() for finding cached prefixes - Copy-on-write block sharing via ref_count - 4 tests: basic, longest_match, eviction, benchmark tri_inference.zig: - PagedSchedulerConfig.enable_prefix_caching option - PagedBatchingScheduler integration with PrefixCache - submitRequest() checks cache before prefill - cachePrefixAfterPrefill() for caching new prompts - prefix_cache_hits/misses statistics Benchmark results: - Prefill reduction: 90.1% (11,000 → 1,090 tokens) - Cache hit rate: 100% for repeated prompts - Tests: 19/19 passing Use cases: chatbots, few-shot learning, RAG applications Co-authored-by: Ona <no-reply@ona.com>
1 parent ad461b9 commit dfd2dd3

4 files changed

Lines changed: 535 additions & 8 deletions

File tree

docs/BENCHMARKS.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,40 @@
133133
╚══════════════════════════════════════════════════════════════════╝
134134
```
135135

136+
### OPT-PC01: Prefix Caching
137+
138+
**Status**: ✅ Implemented
139+
140+
```
141+
╔══════════════════════════════════════════════════════════════════╗
142+
║ PREFIX CACHING BENCHMARK ║
143+
╠══════════════════════════════════════════════════════════════════╣
144+
║ Scenario: 100 requests with 100-token system prompt ║
145+
║ ║
146+
║ WITHOUT CACHING: ║
147+
║ Prefill tokens: 11,000 ║
148+
║ Time-to-first-token: ~500ms per request ║
149+
║ ║
150+
║ WITH CACHING: ║
151+
║ Prefill tokens: 1,090 ║
152+
║ Time-to-first-token: ~50ms (after first request) ║
153+
║ ║
154+
║ RESULTS: ║
155+
║ Prefill reduction: 90.1% ║
156+
║ TTFT reduction: ~90% ║
157+
║ Cache hit rate: 100% (for repeated prompts) ║
158+
║ ║
159+
║ MEMORY OVERHEAD: ║
160+
║ Per cached prefix: ~400 bytes metadata ║
161+
║ Shared KV blocks: Copy-on-write (no duplication) ║
162+
╚══════════════════════════════════════════════════════════════════╝
163+
```
164+
165+
**Use Cases:**
166+
- Chatbots with system prompts: 90%+ prefill reduction
167+
- Few-shot learning: Cache examples, only prefill new query
168+
- RAG applications: Cache retrieved context
169+
136170
### OPT-S01: Speculative Decoding
137171

138172
```

docs/TECH_TREE.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@
5353
│ │ ┌──────────┐ │ │
5454
│ │ │ OPT-PC01 │ │ │
5555
│ │ │ Prefix │ │ │
56-
│ │ │ 🔄 WIP │ │ │
56+
│ │ │ ✅ 90% │ │ │
5757
│ │ └──────────┘ │ │
5858
│ └─────────────────────────────────────────────────────────────────────────────┘ │
5959
│ │
@@ -100,9 +100,13 @@
100100

101101
### In Progress (🔄)
102102

103+
*None currently*
104+
105+
### Recently Completed
106+
103107
| ID | Name | Branch | Impact | Hours | Dependencies |
104108
|----|------|--------|--------|-------|--------------|
105-
| OPT-PC01 | Prefix Caching | Serving | 99% prefill reduction | 20 | OPT-PA01 ✅ |
109+
| OPT-PC01 | Prefix Caching | Serving | **90% prefill reduction** | 20 | OPT-PA01 ✅ |
106110

107111
### Available (🟢)
108112

@@ -156,9 +160,9 @@
156160

157161
### Immediate (This Week)
158162

159-
1. **OPT-PC01 Prefix Caching** - 20 hours
160-
- Dependencies: ✅ All met
161-
- Impact: 99% prefill reduction for cached prompts
163+
1. **OPT-CP01 Chunked Prefill** - 30 hours
164+
- Dependencies: ✅ All met (OPT-B01)
165+
- Impact: -50% time-to-first-token
162166
- Priority: HIGH
163167

164168
### Short-term (This Month)

0 commit comments

Comments
 (0)