You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Honest memory reporting: K-only compression, V remains FP32
Critical fix: memory stats now correctly show K (compressed) + V (FP32)
separately. Previous "10.7x" was K-only — total K+V ratio is ~1x when
V is FP32. README updated with honest scope throughout:
- "10.7x key compression" (not "KV compression")
- Value quantization noted as planned
- Divergence at ~120 tokens documented
- Long context section shows K-only numbers clearly
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
> Keys only — values remain FP32. Greedy decode is byte-identical up to ~120 tokens; outputs diverge beyond that but remain coherent. Value quantization is planned.
56
+
57
+
### Key Compression at Long Context
58
+
59
+
Currently **keys are compressed, values remain FP32**. Value quantization is planned.
56
60
57
61
```
58
-
Gemma 3 4B, 32K tokens — KV cache only:
59
-
FP16 (llama.cpp): 4,352 MB
60
-
Uniform 4-bit: 1,156 MB
61
-
TurboQuant 3-bit: 952 MB
62
-
TurboQuant 1-bit: 408 MB ← 3.9 GB saved vs FP16
62
+
Gemma 3 4B, 32K tokens — key vectors only:
63
+
FP16 keys: 2,176 MB
64
+
Uniform 4-bit keys: 578 MB (3.8x)
65
+
TurboQuant 3-bit keys: 476 MB (4.6x)
66
+
TurboQuant 1-bit keys: 204 MB (10.7x)
63
67
```
64
68
69
+
Full K+V savings require V compression — with FP16 values + 1-bit keys: **~1.8x total K+V reduction**. With future V quantization, this grows to **~5x+**.
70
+
65
71
### Speed vs llama.cpp
66
72
67
73
```
@@ -82,7 +88,7 @@ bash scripts/quickstart.sh "What is deep learning?"
0 commit comments