Honest limits: document divergence at ~120 tokens, V remains FP32

unamedkr · claude · unamedkr · commit b571fe0f712b · 2026-04-01T03:36:17.000+09:00
Analysis found:
- 1-bit KV: byte-identical up to ~117 tokens (context ~132 = TQ_BK)
- 3-bit KV: byte-identical up to ~140 tokens
- Beyond divergence point: output differs but stays coherent English
- Root cause: only key vectors quantized, values remain FP32
- Divergence expected — quantized attention scores shift softmax

README updated with honest scope note.
Benchmark script header clarified.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 [![Tests](https://img.shields.io/badge/tests-23%20suites-brightgreen)]()
 [![KV Quality](https://img.shields.io/badge/KV%20quality-30%2F30%20byte--identical-brightgreen)]()
 
-### 1-bit KV cache. 10.7x compression. Zero quality loss.
+### 1-bit KV cache. 10.7x compression. Quality preserved.
 
 ```
 Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
@@ -19,6 +19,11 @@ Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
   bash bench/kv_quality_bench.sh gemma3-4b.tqm  ← reproduce it yourself
 ```
 
+> **Scope:** Key vectors are quantized; value vectors remain FP32. Greedy decode is
+> byte-identical up to ~120 tokens on Gemma 4B. Beyond that, outputs diverge but
+> remain coherent and comparable quality. This is expected — quantized attention
+> scores produce slightly different softmax distributions over longer contexts.
+
 ---
 
 ## Why This Matters
diff --git a/bench/kv_quality_bench.sh b/bench/kv_quality_bench.sh
@@ -1,7 +1,12 @@
 #!/bin/bash
 # KV Cache Quality Benchmark — Reproducible verification
 #
-# Proves that 1-bit KV produces byte-identical output to 4-bit uniform.
+# Tests TurboQuant KV cache quality at short (100 tokens) and longer (200+) contexts.
+# Short context: typically byte-identical to uniform baseline.
+# Longer context: outputs diverge but remain coherent (expected behavior).
+#
+# Note: Only key vectors are quantized; value vectors remain FP32.
+#
 # Run: bash bench/kv_quality_bench.sh <model.tqm>
 #
 # Requirements: built tq_run binary in build/

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,12 @@`
`1`	`1`	`#!/bin/bash`
`2`	`2`	`# KV Cache Quality Benchmark — Reproducible verification`
`3`	`3`	`#`
`4`		`-# Proves that 1-bit KV produces byte-identical output to 4-bit uniform.`
	`4`	`+# Tests TurboQuant KV cache quality at short (100 tokens) and longer (200+) contexts.`
	`5`	`+# Short context: typically byte-identical to uniform baseline.`
	`6`	`+# Longer context: outputs diverge but remain coherent (expected behavior).`
	`7`	`+#`
	`8`	`+# Note: Only key vectors are quantized; value vectors remain FP32.`
	`9`	`+#`
`5`	`10`	`# Run: bash bench/kv_quality_bench.sh <model.tqm>`
`6`	`11`	`#`
`7`	`12`	`# Requirements: built tq_run binary in build/`