Skip to content

Commit 6f08e84

Browse files
gHashTagona-agent
andcommitted
feat(kv-cache): implement ring buffer KV cache (INF-003)
- Add RingKVCache with O(1) append and fixed memory - Implement sliding window attention (sink tokens + local window) - Add SIMD-optimized cache copy using @vector(8, f32) - Add cache statistics (hit rate, eviction tracking) - Add prune() method for explicit memory management - Re-export optimized types via gguf_transformer.zig - All 7 KV cache tests passing Co-authored-by: Ona <no-reply@ona.com>
1 parent d0a6fe8 commit 6f08e84

4 files changed

Lines changed: 518 additions & 1 deletion

File tree

docs/DISCOVERIES.md

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ Where:
203203

204204
### Available (Next)
205205

206-
- [ ] INF-003: KV Cache Optimization (+50% speed)
206+
- [x] INF-003: KV Cache Optimization (+50% speed) ✅ Implemented
207207
- [ ] INF-004: Batch Processing (+300% throughput)
208208
- [ ] OPT-001: SIMD Vectorization (+400% matrix ops)
209209
- [ ] OPT-004: Flash Attention (+200% attention)
@@ -216,6 +216,69 @@ Where:
216216

217217
---
218218

219+
## KV Cache Optimization (INF-003)
220+
221+
**Status**: ✅ Implemented
222+
223+
### Implementation Details
224+
225+
| Component | File | Description |
226+
|-----------|------|-------------|
227+
| RingKVCache | `kv_cache.zig` | O(1) append ring buffer |
228+
| SlidingWindowConfig | `kv_cache.zig` | Sink tokens + local window |
229+
| simdCopy | `kv_cache.zig` | SIMD-optimized cache writes |
230+
| CacheStats | `kv_cache.zig` | Hit rate, eviction tracking |
231+
232+
### Ring Buffer Design
233+
234+
```
235+
┌─────────────────────────────────────────────────────────────┐
236+
│ RING BUFFER KV CACHE │
237+
├─────────────────────────────────────────────────────────────┤
238+
│ [0] [1] [2] [3] [4] [5] [6] [7] ← Physical positions │
239+
│ ↑ │
240+
│ write_pos (wraps around) │
241+
│ │
242+
│ Benefits: │
243+
│ - O(1) append (no reallocation) │
244+
│ - Fixed memory (max_seq_len * kv_size) │
245+
│ - Automatic eviction of oldest tokens │
246+
└─────────────────────────────────────────────────────────────┘
247+
```
248+
249+
### Sliding Window Attention
250+
251+
```
252+
Tokens: [0] [1] [2] [3] ... [N-M] ... [N-1] [N]
253+
↑ ↑ ↑ ↑ ↑ ↑ ↑
254+
└───┴───┴───┘ └─────────┴─────┘
255+
Sink tokens (4) Local window (M)
256+
Always kept Sliding window
257+
```
258+
259+
### Memory Efficiency
260+
261+
| Config | Tokens | Memory | vs Unbounded |
262+
|--------|--------|--------|--------------|
263+
| max_seq_len=2048 | 2048 | 16 MB | Fixed |
264+
| max_seq_len=4096 | 4096 | 32 MB | Fixed |
265+
| Unbounded | N | N * 8 KB | O(N) growth |
266+
267+
### Test Results
268+
269+
```
270+
All 7 tests passed:
271+
- kv cache config
272+
- layer kv cache
273+
- full kv cache
274+
- ring kv cache ✅ NEW
275+
- ring kv cache reset ✅ NEW
276+
- simd copy ✅ NEW
277+
- cached attention
278+
```
279+
280+
---
281+
219282
## Ternary Matrix Multiplication (OPT-T02)
220283

221284
**Status**: ✅ Implemented

specs/tri/kv_cache_optimized.vibee

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Optimized KV Cache Specification
2+
# Ring buffer with sliding window for infinite context
3+
# φ² + 1/φ² = 3 | KOSCHEI IS IMMORTAL
4+
5+
name: kv_cache_optimized
6+
version: "1.0.0"
7+
language: zig
8+
module: kv_cache_optimized
9+
10+
description: |
11+
Optimized KV cache with ring buffer for O(1) append and fixed memory.
12+
Supports sliding window attention for infinite context length.
13+
SIMD-optimized copy operations.
14+
15+
types:
16+
RingKVCache:
17+
description: "Ring buffer KV cache with fixed memory"
18+
fields:
19+
k_cache: List<Float>
20+
v_cache: List<Float>
21+
num_kv_heads: Int
22+
head_dim: Int
23+
max_seq_len: Int
24+
write_pos: Int
25+
total_tokens: Int
26+
27+
SlidingWindowConfig:
28+
description: "Sliding window attention configuration"
29+
fields:
30+
window_size: Int
31+
sink_tokens: Int
32+
local_tokens: Int
33+
34+
CacheStats:
35+
description: "Cache utilization statistics"
36+
fields:
37+
total_tokens: Int
38+
cached_tokens: Int
39+
evicted_tokens: Int
40+
hit_rate: Float
41+
memory_bytes: Int
42+
43+
behaviors:
44+
- name: ring_append
45+
given: New K,V vectors and ring buffer cache
46+
when: Appending new token to cache
47+
then: O(1) write at write_pos, wrap around at max_seq_len
48+
49+
- name: ring_get_k
50+
given: Ring buffer cache and position
51+
when: Reading cached K vector
52+
then: Returns K at (pos % max_seq_len) with bounds check
53+
54+
- name: ring_get_v
55+
given: Ring buffer cache and position
56+
when: Reading cached V vector
57+
then: Returns V at (pos % max_seq_len) with bounds check
58+
59+
- name: sliding_window_mask
60+
given: Current position and window config
61+
when: Computing attention mask
62+
then: Returns mask with sink tokens + local window
63+
64+
- name: simd_cache_copy
65+
given: Source K,V vectors and cache destination
66+
when: Copying to cache with SIMD
67+
then: 4x faster copy using @Vector(8, f32)
68+
69+
- name: compute_cache_stats
70+
given: Ring buffer cache state
71+
when: Analyzing cache utilization
72+
then: Returns hit rate, eviction count, memory usage
73+
74+
- name: prune_old_tokens
75+
given: Cache with tokens beyond window
76+
when: Memory pressure or explicit prune request
77+
then: Evict oldest tokens outside sliding window
78+
79+
- name: reset_cache
80+
given: Ring buffer cache
81+
when: Starting new sequence
82+
then: Reset write_pos and total_tokens to 0
83+
84+
optimizations:
85+
- name: ring_buffer
86+
description: "O(1) append, fixed memory, no reallocation"
87+
88+
- name: sliding_window
89+
description: "Sink tokens (first N) + local window (last M)"
90+
91+
- name: simd_copy
92+
description: "@Vector(8, f32) for cache writes"
93+
94+
- name: cache_aligned
95+
description: "16-byte alignment for SIMD access"
96+
97+
memory_layout:
98+
- name: k_cache
99+
format: "[max_seq_len][num_kv_heads][head_dim]"
100+
alignment: 16
101+
102+
- name: v_cache
103+
format: "[max_seq_len][num_kv_heads][head_dim]"
104+
alignment: 16
105+
106+
benchmarks:
107+
- name: append_latency
108+
metric: "ns per token"
109+
target: "<100ns"
110+
111+
- name: memory_efficiency
112+
metric: "bytes per token"
113+
target: "2 * num_kv_heads * head_dim * sizeof(f32)"
114+
115+
- name: cache_hit_rate
116+
metric: "percentage"
117+
target: ">95% for window_size tokens"
118+
119+
integration:
120+
- target: tri_inference.zig
121+
description: "Replace KVCache with RingKVCache"
122+
123+
- target: gguf_transformer.zig
124+
description: "Update attention to use sliding window"

src/vibeec/gguf_transformer.zig

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@
55
const std = @import("std");
66
const gguf = @import("gguf_reader.zig");
77
const inference = @import("gguf_inference.zig");
8+
const kv_cache_mod = @import("kv_cache.zig");
9+
10+
// Re-export optimized KV cache types
11+
pub const RingKVCache = kv_cache_mod.RingKVCache;
12+
pub const SlidingWindowConfig = kv_cache_mod.SlidingWindowConfig;
13+
pub const CacheStats = kv_cache_mod.CacheStats;
814

915
// ═══════════════════════════════════════════════════════════════════════════════
1016
// RoPE - Rotary Position Embedding

0 commit comments

Comments
 (0)