Honest memory reporting: K-only compression, V remains FP32

unamedkr · claude · unamedkr · commit 10a73a4f8c8e · 2026-04-01T04:19:24.000+09:00
Critical fix: memory stats now correctly show K (compressed) + V (FP32)
separately. Previous "10.7x" was K-only — total K+V ratio is ~1x when
V is FP32. README updated with honest scope throughout:
- "10.7x key compression" (not "KV compression")
- Value quantization noted as planned
- Divergence at ~120 tokens documented
- Long context section shows K-only numbers clearly

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 [![Tests](https://img.shields.io/badge/tests-23%20suites-brightgreen)]()
 [![KV Quality](https://img.shields.io/badge/KV%20quality-30%2F30%20byte--identical-brightgreen)]()
 
-### 1-bit KV cache. 10.7x compression. Quality preserved.
+### 1-bit KV keys. 10.7x key compression. Quality preserved up to ~120 tokens.
 
 ```
 Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
@@ -45,23 +45,29 @@ The [TurboQuant paper](https://arxiv.org/abs/2504.19874) (Google Research, ICLR
 
 Gemma 3 4B, 100 tokens, greedy, 10 diverse prompts (math, knowledge, code, multilingual):
 
-| KV Type | Bits | Per-token KV | Compression | vs Uniform 4-bit |
-|---------|------|-------------|-------------|-------------------|
+| KV Type | Key bits | Key size/token | Key compression | Quality (100 tok) |
+|---------|----------|---------------|----------------|-------------------|
 | uniform_4b | 4 | 36.12 KB | 3.8x | baseline |
 | turbo_kv_4b | 4 | 38.25 KB | 3.6x | **byte-identical** |
 | turbo_kv_3b | 3 | 29.75 KB | 4.6x | **byte-identical** |
 | **turbo_kv_1b** | **1** | **12.75 KB** | **10.7x** | **byte-identical** |
 
-### Memory at Long Context
+> Keys only — values remain FP32. Greedy decode is byte-identical up to ~120 tokens; outputs diverge beyond that but remain coherent. Value quantization is planned.
+
+### Key Compression at Long Context
+
+Currently **keys are compressed, values remain FP32**. Value quantization is planned.
 
 ```
-Gemma 3 4B, 32K tokens — KV cache only:
-  FP16 (llama.cpp):       4,352 MB
-  Uniform 4-bit:          1,156 MB
-  TurboQuant 3-bit:         952 MB
-  TurboQuant 1-bit:         408 MB   ← 3.9 GB saved vs FP16
+Gemma 3 4B, 32K tokens — key vectors only:
+  FP16 keys:               2,176 MB
+  Uniform 4-bit keys:        578 MB  (3.8x)
+  TurboQuant 3-bit keys:     476 MB  (4.6x)
+  TurboQuant 1-bit keys:     204 MB  (10.7x)
 ```
 
+Full K+V savings require V compression — with FP16 values + 1-bit keys: **~1.8x total K+V reduction**. With future V quantization, this grows to **~5x+**.
+
 ### Speed vs llama.cpp
 
 ```
@@ -82,7 +88,7 @@ bash scripts/quickstart.sh "What is deep learning?"
 ### Choose Your KV Compression
 
 ```bash
-./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b   # 1-bit  (10.7x compression)
+./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b   # 1-bit keys (10.7x key compression)
 ./build/tq_run model.tqm -p "Hello" -k turbo_kv_3b   # 3-bit  (4.6x, recommended)
 ./build/tq_run model.tqm -p "Hello" -k turbo_kv_4b   # 4-bit TurboQuant
 ./build/tq_run model.tqm -p "Hello" -k uniform_4b     # 4-bit uniform (baseline)
diff --git a/docs/assets/hero.png b/docs/assets/hero.png
diff --git a/tools/tq_run.c b/tools/tq_run.c
@@ -237,17 +237,21 @@ int main(int argc, char** argv) {
          * 2 (K+V) * n_layers * n_kv_heads * head_dim * 2 bytes per token */
         size_t fp16_per_token = (size_t)2 * c->n_layers * c->n_kv_heads * c->head_dim * 2;
 
-        /* Compressed KV: both keys and values quantized to same type.
-         * blocks_per_head * type_size bytes per head per layer, times 2 for K+V */
+        /* Compressed KV: keys quantized, values remain FP32.
+         * K: blocks_per_head * type_size bytes per head per layer
+         * V: n_kv_heads * head_dim * 4 bytes (FP32) per layer */
         size_t block_size = tq_type_block_size(kv_type);
         size_t type_size_bytes = tq_type_type_size(kv_type);
-        if (block_size == 0) block_size = TQ_BK;
-        if (type_size_bytes == 0) type_size_bytes = sizeof(block_tq_uniform_4b);
+        if (block_size == 0) { block_size = TQ_BK; }
+        if (type_size_bytes == 0) { type_size_bytes = sizeof(block_tq_uniform_4b); }
         size_t blocks_per_head = ((size_t)c->head_dim + block_size - 1) / block_size;
 
-        /* Q4 K+V per token: 2 * n_layers * n_kv_heads * blocks_per_head * type_size */
-        size_t compressed_per_token = (size_t)2 * c->n_layers * c->n_kv_heads
-                                    * blocks_per_head * type_size_bytes;
+        /* K (compressed) + V (FP32) per token */
+        size_t k_per_token = (size_t)c->n_layers * c->n_kv_heads
+                            * blocks_per_head * type_size_bytes;
+        size_t v_per_token = (size_t)c->n_layers * c->n_kv_heads
+                            * c->head_dim * sizeof(float);
+        size_t compressed_per_token = k_per_token + v_per_token;
 
         /* If kv_type is fp32 (sentinel), both key and value are FP32 */
         if (kv_type >= TQ_TYPE_COUNT) {
@@ -267,15 +271,20 @@ int main(int argc, char** argv) {
                 c->n_layers, c->n_kv_heads, c->head_dim);
         fprintf(stderr, "KV type:              %s\n",
                 kv_type < TQ_TYPE_COUNT ? tq_type_name(kv_type) : "fp32");
-        fprintf(stderr, "Per-token KV (Q4):    %.2f KB\n",
+        fprintf(stderr, "Per-token K (%s): %.2f KB\n",
+                kv_type < TQ_TYPE_COUNT ? tq_type_name(kv_type) : "fp32",
+                (double)k_per_token / 1024.0);
+        fprintf(stderr, "Per-token V (FP32):   %.2f KB\n",
+                (double)v_per_token / 1024.0);
+        fprintf(stderr, "Per-token K+V total:  %.2f KB\n",
                 (double)compressed_per_token / 1024.0);
-        fprintf(stderr, "Per-token KV (FP16):  %.2f KB\n",
+        fprintf(stderr, "Per-token K+V (FP16): %.2f KB\n",
                 (double)fp16_per_token / 1024.0);
-        fprintf(stderr, "Total KV (Q4):        %.2f MB\n",
+        fprintf(stderr, "Total K+V:            %.2f MB\n",
                 (double)total_compressed / (1024.0 * 1024.0));
-        fprintf(stderr, "Total KV (FP16):      %.2f MB\n",
+        fprintf(stderr, "Total K+V (FP16):     %.2f MB\n",
                 (double)total_fp16 / (1024.0 * 1024.0));
-        fprintf(stderr, "Compression ratio:    %.2fx\n", ratio);
+        fprintf(stderr, "Compression ratio:    %.2fx (K+V combined)\n", ratio);
         fprintf(stderr, "Memory saved:         %.2f MB\n",
                 (double)(total_fp16 - total_compressed) / (1024.0 * 1024.0));
         fprintf(stderr, "=============================\n");