You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
3.**Targeted regularization** by isolating QJL to the K-Cache only.
42
42
43
43
The result is a highly efficient unified KV Cache running at an average of **~3.6 bits/dim (~3.5x compression vs fp16)**, recovering the performance characteristics of V2 with the perplexity retention of V3.
44
+
45
+
## Implementation Status (March 2026)
46
+
47
+
### Hot-Window Eviction Design
48
+
49
+
The production implementation uses a **hot-window eviction** strategy rather than always-compress:
50
+
51
+
-**fp16 hot window (last 256 tokens):** Always kept at full precision. Short prompts (<256 tokens) receive zero compression — full fp16 quality preserved.
52
+
-**Cold history (older than 256 tokens):** Compressed to 3-bit PolarQuant in `step=256` chunks when enough cold tokens accumulate.
53
+
-**Attention path:** SDPA sees `[decoded_prior_history | fp16_hot_window]` — the two regions are disjoint by construction, eliminating any duplication risk.
54
+
55
+
This design was chosen over the reference's always-compress approach (`cache_v3.py`) for two reasons:
56
+
1. The reference uses an incremental `_key_centroids_cache` shadow buffer to amortize decode cost — this requires keeping a full fp16 dequantized copy in addition to the packed storage (more RAM total). Our approach evicts the fp16 cold tokens and decodes on demand.
57
+
2. Short context tool-use calls (100–400 tokens) need no compression and should not pay the decode latency penalty.
58
+
59
+
### Telemetry
60
+
61
+
Compression stats are aggregated into the 10-second SSD Stream log via C atomics:
0 commit comments