docs: v0.2 announcement — every claim now has a number (EN/KO)

unamedkr · claude · unamedkr · commit 55bc078a4b9b · 2026-04-01T17:35:48.000+09:00
Honest summary of v0.2: V quantization (4.9x-7.1x total K+V),
validation suite, bug fixes, measured results with reproduction steps.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/pr/2026-04-01-v02-validated-ko.md b/docs/pr/2026-04-01-v02-validated-ko.md
@@ -0,0 +1,66 @@
+# TurboQuant.cpp v0.2 — 모든 주장에 숫자가 붙었습니다
+
+V 캐시 양자화와 전수 검증 스위트를 출시합니다. 바뀐 것들을 정리합니다.
+
+## v0.2에서 추가된 것
+
+**V 양자화.** Key는 이미 1-bit였습니다. 이제 Value도 Q4 또는 Q2입니다.
+
+```
+Gemma 3 4B — 토큰당 총 K+V:
+
+  FP16 기준:          136.00 KB
+  1-bit K + Q4 V:      27.62 KB   (4.9x 압축)
+  1-bit K + Q2 V:      19.12 KB   (7.1x 압축)
+```
+
+32K 컨텍스트에서 FP16 대비 3.7 GB 절약. "Paris"는 여전히 "Paris"입니다.
+
+**검증.** NEON 버그를 발견하고 수정한 뒤, 전부 검증했습니다:
+
+- 모든 NEON 경로를 스칼라 참조와 비교하는 14개 테스트
+- Lloyd-Max 코드북 centroid가 이론값과 0.001 이내 일치 확인 5개 테스트
+- Attention score 분포 보존 측정 8개 테스트
+- 엣지케이스 29개 (NaN, Inf, 단일 토큰, 영차원, 만 개 키)
+- ASan + UBSan 전체 26개 스위트 클린
+
+## 핵심 수치
+
+| 항목 | 측정값 | 재현 방법 |
+|------|--------|----------|
+| Attention 코사인 (1-bit) | 0.634 | `test_attention_distribution` |
+| 이론 한계 (2/pi) | 0.637 | JL 문헌에서 증명 |
+| 랜덤 K 코사인 | 0.089 | `test_attention_distribution` |
+| 코드북 MSE vs 최적 | < 1.18배 | `test_codebook_theory` |
+| RHT 오버헤드 | 147 ns/벡터 | `bench_kv_overhead` |
+| 1-bit attention | 1.2 ns/키 | `bench_kv_overhead` |
+
+1-bit 코사인 0.634는 2/pi = 0.637과 일치합니다. 이것은 결함이 아닙니다 — 부호 양자화의 정보이론적 최대값입니다. 우리 구현이 이론적 벽에 도달했습니다.
+
+## 수정한 것
+
+- **Q4 dequant NEON 버그**: nibble 인터리빙이 잘못되어 MSE가 300배 악화. 테스트로 발견, `vzip_u8`로 수정.
+- **QJL sign bias**: `>= 0.0f` → `> 0.0f`, 11곳 (CPU/CUDA/Metal).
+- **Norm overflow**: 큰 벡터에서 `sum += x*x` overflow 가능. max-abs rescaling 추가.
+- **스레드 안전성**: 글로벌 워크스페이스 realloc에 mutex 보호.
+
+## 정직한 부분
+
+- 7.1x는 총 K+V입니다. 이전 "10.7x"는 K-only였고, 지금은 명확히 구분합니다.
+- V 양자화(Q4/Q2) 시 출력이 기준에서 발산합니다. coherent하고 사실적으로 정확하지만, 바이트 동일은 아닙니다.
+- 30/30 바이트 동일 결과는 K-only 모드(V는 FP16)에 해당합니다.
+- 1-bit attention 코사인 = 0.634이지, 0.99가 아닙니다. 1비트의 최적값입니다. 더 높은 값이 필요하면 3-bit(0.918)를 쓰세요.
+
+## 사용법
+
+```bash
+git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON
+cmake --build build -j$(nproc)
+ctest --test-dir build                          # 26/26 통과해야 합니다
+./build/tq_run gemma3-4b.tqm -p "1+1=" -j 6 -n 5 -T 0.0 -k turbo_kv_1b -v q4 -M
+```
+
+---
+
+[GitHub](https://github.com/quantumaikr/TurboQuant.cpp) | [릴리즈 노트](../RELEASE_NOTES.md) | [논문](https://arxiv.org/abs/2504.19874)
diff --git a/docs/pr/2026-04-01-v02-validated.md b/docs/pr/2026-04-01-v02-validated.md
@@ -0,0 +1,66 @@
+# TurboQuant.cpp v0.2 — Every Claim Now Has a Number
+
+We shipped V cache quantization and a full validation suite. Here's what changed.
+
+## What v0.2 adds
+
+**V quantization.** Keys were already 1-bit. Now values are Q4 or Q2.
+
+```
+Gemma 3 4B — total K+V per token:
+
+  FP16 baseline:     136.00 KB
+  1-bit K + Q4 V:     27.62 KB   (4.9x compression)
+  1-bit K + Q2 V:     19.12 KB   (7.1x compression)
+```
+
+At 32K context, that's 3.7 GB saved vs FP16. "Paris" still comes out as "Paris."
+
+**Validation.** We found a NEON bug, fixed it, then validated everything:
+
+- 14 tests comparing every NEON path against scalar reference
+- 5 tests proving Lloyd-Max codebook centroids match theory within 0.001
+- 8 tests measuring attention score distribution preservation
+- 29 edge-case tests (NaN, Inf, single token, zero dim, 10K keys)
+- ASan + UBSan clean on all 26 test suites
+
+## The numbers that matter
+
+| What | Measured | How to reproduce |
+|------|----------|------------------|
+| Attention cosine (1-bit) | 0.634 | `test_attention_distribution` |
+| Theoretical limit (2/pi) | 0.637 | proven in JL literature |
+| Random K cosine | 0.089 | `test_attention_distribution` |
+| Codebook MSE vs optimal | < 1.18x | `test_codebook_theory` |
+| RHT overhead | 147 ns/vec | `bench_kv_overhead` |
+| 1-bit attention | 1.2 ns/key | `bench_kv_overhead` |
+
+The 1-bit cosine of 0.634 matches 2/pi = 0.637. This is not a deficiency — it's the information-theoretic maximum for sign-only quantization. Our implementation reaches the theoretical wall.
+
+## What we fixed
+
+- **Q4 dequant NEON bug**: Nibble interleaving was wrong, causing 300x worse MSE. Found by testing, fixed with `vzip_u8`.
+- **QJL sign bias**: `>= 0.0f` → `> 0.0f` across 11 call sites (CPU/CUDA/Metal).
+- **Norm overflow**: Large vectors could overflow `sum += x*x`. Added max-abs rescaling.
+- **Thread safety**: Mutex guards on global workspace realloc.
+
+## What's honest
+
+- 7.1x is total K+V, not K-only. Previous "10.7x" was K-only — now clearly labeled.
+- With V quantization (Q4/Q2), outputs diverge from baseline. They remain coherent and factually correct, but are not byte-identical.
+- The 30/30 byte-identical result applies to K-only mode (V stays FP16).
+- 1-bit attention cosine = 0.634, not 0.99. This is optimal for 1 bit. Want higher? Use 3-bit (0.918).
+
+## Try it
+
+```bash
+git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON
+cmake --build build -j$(nproc)
+ctest --test-dir build                          # 26/26 should pass
+./build/tq_run gemma3-4b.tqm -p "1+1=" -j 6 -n 5 -T 0.0 -k turbo_kv_1b -v q4 -M
+```
+
+---
+
+[GitHub](https://github.com/quantumaikr/TurboQuant.cpp) | [Release Notes](../RELEASE_NOTES.md) | [Paper](https://arxiv.org/abs/2504.19874)