Skip to content

Commit 6ede853

Browse files
committed
docs(turbo-kv): add implementation status, hot-window design rationale, commit references
1 parent 3df7430 commit 6ede853

1 file changed

Lines changed: 33 additions & 0 deletions

File tree

docs/turboquant_hybrid_architecture.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,36 @@ Our hybrid approach guarantees:
4141
3. **Targeted regularization** by isolating QJL to the K-Cache only.
4242

4343
The result is a highly efficient unified KV Cache running at an average of **~3.6 bits/dim (~3.5x compression vs fp16)**, recovering the performance characteristics of V2 with the perplexity retention of V3.
44+
45+
## Implementation Status (March 2026)
46+
47+
### Hot-Window Eviction Design
48+
49+
The production implementation uses a **hot-window eviction** strategy rather than always-compress:
50+
51+
- **fp16 hot window (last 256 tokens):** Always kept at full precision. Short prompts (<256 tokens) receive zero compression — full fp16 quality preserved.
52+
- **Cold history (older than 256 tokens):** Compressed to 3-bit PolarQuant in `step=256` chunks when enough cold tokens accumulate.
53+
- **Attention path:** SDPA sees `[decoded_prior_history | fp16_hot_window]` — the two regions are disjoint by construction, eliminating any duplication risk.
54+
55+
This design was chosen over the reference's always-compress approach (`cache_v3.py`) for two reasons:
56+
1. The reference uses an incremental `_key_centroids_cache` shadow buffer to amortize decode cost — this requires keeping a full fp16 dequantized copy in addition to the packed storage (more RAM total). Our approach evicts the fp16 cold tokens and decodes on demand.
57+
2. Short context tool-use calls (100–400 tokens) need no compression and should not pay the decode latency penalty.
58+
59+
### Telemetry
60+
61+
Compression stats are aggregated into the 10-second SSD Stream log via C atomics:
62+
```
63+
[⚡️ SSD Stream] 8977 MB/s | 21698 chunks | avg 0.167 ms/chunk | 🗜 TurboKV 4.3x (5MB saved)
64+
```
65+
The `🗜 TurboKV` suffix only appears when compression was active in that 10s window.
66+
67+
### Commit References
68+
69+
| Repository | Branch | Commit | Description |
70+
|---|---|---|---|
71+
| `mlx-swift-lm` | `main` | `b7307a4` | Hot-window eviction design, AttentionUtils cleanup |
72+
| `mlx-swift-lm` | `main` | `2d885b8` | Fix context duplication in SDPA path |
73+
| `mlx-swift-lm` | `main` | `71678dd` | Fix origBytes telemetry calculation |
74+
| `mlx-swift-lm` | `main` | `c336189` | Remove double-counting record() call |
75+
| `mlx-swift` | `feature/api-parity-roadmap` | `3df7430` | 10s log: ratio + MB saved display |
76+
| `mlx-swift` | `feature/api-parity-roadmap` | `dc6af72` | C atomic aggregator + 10s log hook |

0 commit comments

Comments
 (0)