Skip to content

Commit b571fe0

Browse files
unamedkrclaude
andcommitted
Honest limits: document divergence at ~120 tokens, V remains FP32
Analysis found: - 1-bit KV: byte-identical up to ~117 tokens (context ~132 = TQ_BK) - 3-bit KV: byte-identical up to ~140 tokens - Beyond divergence point: output differs but stays coherent English - Root cause: only key vectors quantized, values remain FP32 - Divergence expected — quantized attention scores shift softmax README updated with honest scope note. Benchmark script header clarified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1afb716 commit b571fe0

2 files changed

Lines changed: 12 additions & 2 deletions

File tree

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
[![Tests](https://img.shields.io/badge/tests-23%20suites-brightgreen)]()
88
[![KV Quality](https://img.shields.io/badge/KV%20quality-30%2F30%20byte--identical-brightgreen)]()
99

10-
### 1-bit KV cache. 10.7x compression. Zero quality loss.
10+
### 1-bit KV cache. 10.7x compression. Quality preserved.
1111

1212
```
1313
Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
@@ -19,6 +19,11 @@ Gemma 3 4B, greedy decode, 10 prompts × 100 tokens:
1919
bash bench/kv_quality_bench.sh gemma3-4b.tqm ← reproduce it yourself
2020
```
2121

22+
> **Scope:** Key vectors are quantized; value vectors remain FP32. Greedy decode is
23+
> byte-identical up to ~120 tokens on Gemma 4B. Beyond that, outputs diverge but
24+
> remain coherent and comparable quality. This is expected — quantized attention
25+
> scores produce slightly different softmax distributions over longer contexts.
26+
2227
---
2328

2429
## Why This Matters

bench/kv_quality_bench.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,12 @@
11
#!/bin/bash
22
# KV Cache Quality Benchmark — Reproducible verification
33
#
4-
# Proves that 1-bit KV produces byte-identical output to 4-bit uniform.
4+
# Tests TurboQuant KV cache quality at short (100 tokens) and longer (200+) contexts.
5+
# Short context: typically byte-identical to uniform baseline.
6+
# Longer context: outputs diverge but remain coherent (expected behavior).
7+
#
8+
# Note: Only key vectors are quantized; value vectors remain FP32.
9+
#
510
# Run: bash bench/kv_quality_bench.sh <model.tqm>
611
#
712
# Requirements: built tq_run binary in build/

0 commit comments

Comments
 (0)