Skip to content

Commit 10a73a4

Browse files
unamedkrclaude
andcommitted
Honest memory reporting: K-only compression, V remains FP32
Critical fix: memory stats now correctly show K (compressed) + V (FP32) separately. Previous "10.7x" was K-only — total K+V ratio is ~1x when V is FP32. README updated with honest scope throughout: - "10.7x key compression" (not "KV compression") - Value quantization noted as planned - Divergence at ~120 tokens documented - Long context section shows K-only numbers clearly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b571fe0 commit 10a73a4

3 files changed

Lines changed: 37 additions & 22 deletions

File tree

README.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
[![Tests](https://img.shields.io/badge/tests-23%20suites-brightgreen)]()
88
[![KV Quality](https://img.shields.io/badge/KV%20quality-30%2F30%20byte--identical-brightgreen)]()
99

10-
### 1-bit KV cache. 10.7x compression. Quality preserved.
10+
### 1-bit KV keys. 10.7x key compression. Quality preserved up to ~120 tokens.
1111

1212
```
1313
Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
@@ -45,23 +45,29 @@ The [TurboQuant paper](https://arxiv.org/abs/2504.19874) (Google Research, ICLR
4545

4646
Gemma 3 4B, 100 tokens, greedy, 10 diverse prompts (math, knowledge, code, multilingual):
4747

48-
| KV Type | Bits | Per-token KV | Compression | vs Uniform 4-bit |
49-
|---------|------|-------------|-------------|-------------------|
48+
| KV Type | Key bits | Key size/token | Key compression | Quality (100 tok) |
49+
|---------|----------|---------------|----------------|-------------------|
5050
| uniform_4b | 4 | 36.12 KB | 3.8x | baseline |
5151
| turbo_kv_4b | 4 | 38.25 KB | 3.6x | **byte-identical** |
5252
| turbo_kv_3b | 3 | 29.75 KB | 4.6x | **byte-identical** |
5353
| **turbo_kv_1b** | **1** | **12.75 KB** | **10.7x** | **byte-identical** |
5454

55-
### Memory at Long Context
55+
> Keys only — values remain FP32. Greedy decode is byte-identical up to ~120 tokens; outputs diverge beyond that but remain coherent. Value quantization is planned.
56+
57+
### Key Compression at Long Context
58+
59+
Currently **keys are compressed, values remain FP32**. Value quantization is planned.
5660

5761
```
58-
Gemma 3 4B, 32K tokens — KV cache only:
59-
FP16 (llama.cpp): 4,352 MB
60-
Uniform 4-bit: 1,156 MB
61-
TurboQuant 3-bit: 952 MB
62-
TurboQuant 1-bit: 408 MB ← 3.9 GB saved vs FP16
62+
Gemma 3 4B, 32K tokens — key vectors only:
63+
FP16 keys: 2,176 MB
64+
Uniform 4-bit keys: 578 MB (3.8x)
65+
TurboQuant 3-bit keys: 476 MB (4.6x)
66+
TurboQuant 1-bit keys: 204 MB (10.7x)
6367
```
6468

69+
Full K+V savings require V compression — with FP16 values + 1-bit keys: **~1.8x total K+V reduction**. With future V quantization, this grows to **~5x+**.
70+
6571
### Speed vs llama.cpp
6672

6773
```
@@ -82,7 +88,7 @@ bash scripts/quickstart.sh "What is deep learning?"
8288
### Choose Your KV Compression
8389

8490
```bash
85-
./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b # 1-bit (10.7x compression)
91+
./build/tq_run model.tqm -p "Hello" -k turbo_kv_1b # 1-bit keys (10.7x key compression)
8692
./build/tq_run model.tqm -p "Hello" -k turbo_kv_3b # 3-bit (4.6x, recommended)
8793
./build/tq_run model.tqm -p "Hello" -k turbo_kv_4b # 4-bit TurboQuant
8894
./build/tq_run model.tqm -p "Hello" -k uniform_4b # 4-bit uniform (baseline)

docs/assets/hero.png

-308 KB
Loading

tools/tq_run.c

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -237,17 +237,21 @@ int main(int argc, char** argv) {
237237
* 2 (K+V) * n_layers * n_kv_heads * head_dim * 2 bytes per token */
238238
size_t fp16_per_token = (size_t)2 * c->n_layers * c->n_kv_heads * c->head_dim * 2;
239239

240-
/* Compressed KV: both keys and values quantized to same type.
241-
* blocks_per_head * type_size bytes per head per layer, times 2 for K+V */
240+
/* Compressed KV: keys quantized, values remain FP32.
241+
* K: blocks_per_head * type_size bytes per head per layer
242+
* V: n_kv_heads * head_dim * 4 bytes (FP32) per layer */
242243
size_t block_size = tq_type_block_size(kv_type);
243244
size_t type_size_bytes = tq_type_type_size(kv_type);
244-
if (block_size == 0) block_size = TQ_BK;
245-
if (type_size_bytes == 0) type_size_bytes = sizeof(block_tq_uniform_4b);
245+
if (block_size == 0) { block_size = TQ_BK; }
246+
if (type_size_bytes == 0) { type_size_bytes = sizeof(block_tq_uniform_4b); }
246247
size_t blocks_per_head = ((size_t)c->head_dim + block_size - 1) / block_size;
247248

248-
/* Q4 K+V per token: 2 * n_layers * n_kv_heads * blocks_per_head * type_size */
249-
size_t compressed_per_token = (size_t)2 * c->n_layers * c->n_kv_heads
250-
* blocks_per_head * type_size_bytes;
249+
/* K (compressed) + V (FP32) per token */
250+
size_t k_per_token = (size_t)c->n_layers * c->n_kv_heads
251+
* blocks_per_head * type_size_bytes;
252+
size_t v_per_token = (size_t)c->n_layers * c->n_kv_heads
253+
* c->head_dim * sizeof(float);
254+
size_t compressed_per_token = k_per_token + v_per_token;
251255

252256
/* If kv_type is fp32 (sentinel), both key and value are FP32 */
253257
if (kv_type >= TQ_TYPE_COUNT) {
@@ -267,15 +271,20 @@ int main(int argc, char** argv) {
267271
c->n_layers, c->n_kv_heads, c->head_dim);
268272
fprintf(stderr, "KV type: %s\n",
269273
kv_type < TQ_TYPE_COUNT ? tq_type_name(kv_type) : "fp32");
270-
fprintf(stderr, "Per-token KV (Q4): %.2f KB\n",
274+
fprintf(stderr, "Per-token K (%s): %.2f KB\n",
275+
kv_type < TQ_TYPE_COUNT ? tq_type_name(kv_type) : "fp32",
276+
(double)k_per_token / 1024.0);
277+
fprintf(stderr, "Per-token V (FP32): %.2f KB\n",
278+
(double)v_per_token / 1024.0);
279+
fprintf(stderr, "Per-token K+V total: %.2f KB\n",
271280
(double)compressed_per_token / 1024.0);
272-
fprintf(stderr, "Per-token KV (FP16): %.2f KB\n",
281+
fprintf(stderr, "Per-token K+V (FP16): %.2f KB\n",
273282
(double)fp16_per_token / 1024.0);
274-
fprintf(stderr, "Total KV (Q4): %.2f MB\n",
283+
fprintf(stderr, "Total K+V: %.2f MB\n",
275284
(double)total_compressed / (1024.0 * 1024.0));
276-
fprintf(stderr, "Total KV (FP16): %.2f MB\n",
285+
fprintf(stderr, "Total K+V (FP16): %.2f MB\n",
277286
(double)total_fp16 / (1024.0 * 1024.0));
278-
fprintf(stderr, "Compression ratio: %.2fx\n", ratio);
287+
fprintf(stderr, "Compression ratio: %.2fx (K+V combined)\n", ratio);
279288
fprintf(stderr, "Memory saved: %.2f MB\n",
280289
(double)(total_fp16 - total_compressed) / (1024.0 * 1024.0));
281290
fprintf(stderr, "=============================\n");

0 commit comments

Comments
 (0)