Skip to content

Commit 33a98cd

Browse files
unamedkrclaude
andcommitted
docs: release notes system + verification summary in README
- docs/RELEASE_NOTES.md: v0.1.0 (initial) and v0.2.0 (V quant + validation) - Keep a Changelog format, semantic versioning - Release process checklist for future releases - Measured results table with test/script sources - README: Verification Summary table (14 NEON, 8 attention, 5 codebook, 29 edge cases, 26 ASan, mutex, numerical stability) - README.ko.md: synced with EN version Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1937933 commit 33a98cd

3 files changed

Lines changed: 189 additions & 4 deletions

File tree

README.ko.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,8 +120,22 @@ Value (가중합 — MSE 최적 복원 필요):
120120
- **K+V 독립 압축** — 1-bit key (XOR+popcount) + Q4/Q2 value
121121
- **ICLR 2026 논문 충실 구현** — RHT + Lloyd-Max + QJL 잔차
122122
- **멀티 아키텍처** — Qwen3.5 (DeltaNet) + Gemma 3 (슬라이딩 윈도우 + GeGLU)
123-
- **NEON 벡터화** — matmul, attention, Hamming distance, FP16 변환
124-
- **26개 테스트 스위트** — KV 라운드트립, attention 정확도, 코드북, Q2 가중치, NEON 일치성, attention 분포
123+
- **NEON 벡터화** — matmul, attention, RHT butterfly, Hamming distance, Q4 dequant, FP16 변환
124+
- **26개 테스트 스위트** — KV 라운드트립, attention 분포, 코드북 이론, NEON/스칼라 일치성, 엣지케이스, Q2 가중치
125+
126+
### 검증 요약
127+
128+
| 범주 | 테스트 | 검증 내용 |
129+
|------|--------|----------|
130+
| NEON/스칼라 일치성 | 14개 | 모든 NEON 경로가 스칼라 참조와 일치 (Q4, Q2, RHT, RoPE, matmul, RMSNorm, Hamming) |
131+
| Attention 분포 | 8개 | 코사인 유사도, Spearman 순위, top-k 겹침 vs FP32 참조 |
132+
| 코드북 이론 | 5개 | Lloyd-Max centroid 문헌 일치, MSE가 정보이론 최적의 1.18배 이내 |
133+
| 엣지케이스 | 29개 | n=1, dim=0, NaN, Inf, 동일값, 영벡터, n=10000 |
134+
| ASan + UBSan | 26개 | 전체 스위트 sanitizer 통과, 메모리 오류 없음 |
135+
| 스레드 안전성 | mutex | 글로벌 워크스페이스 realloc 동시 접근 보호 |
136+
| 수치 안정성 | 4개 | overflow 안전 norm (max-abs rescaling), NaN/Inf 입력 가드 |
137+
138+
상세: [docs/RELEASE_NOTES.md](docs/RELEASE_NOTES.md)
125139

126140
---
127141

README.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,8 +120,22 @@ Multi-architecture: Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window). Gemma
120120
- **K+V independent compression** — 1-bit keys (XOR+popcount) + Q4/Q2 values
121121
- **Faithful ICLR 2026 implementation** — RHT + Lloyd-Max + QJL residual
122122
- **Multi-architecture** — Qwen3.5 (DeltaNet) + Gemma 3 (sliding window + GeGLU)
123-
- **NEON vectorized** — matmul, attention, Hamming distance, FP16 conversion
124-
- **26 test suites** — KV roundtrip, attention accuracy, codebook, Q2 weights, NEON consistency, attention distribution
123+
- **NEON vectorized** — matmul, attention, RHT butterfly, Hamming distance, Q4 dequant, FP16 conversion
124+
- **26 test suites** — KV roundtrip, attention distribution, codebook theory, NEON/scalar consistency, edge cases, Q2 weights
125+
126+
### Verification Summary
127+
128+
| Category | Tests | What's Verified |
129+
|----------|-------|-----------------|
130+
| NEON/scalar consistency | 14 | Every NEON path matches scalar reference (Q4, Q2, RHT, RoPE, matmul, RMSNorm, Hamming) |
131+
| Attention distribution | 8 | Cosine similarity, Spearman rank, top-k overlap vs FP32 reference |
132+
| Codebook theory | 5 | Lloyd-Max centroids match literature, MSE within 1.18x of info-theoretic optimal |
133+
| Edge cases | 29 | n=1, dim=0, NaN, Inf, all-same, all-zero, n=10000 |
134+
| ASan + UBSan | 26 | Full suite under sanitizers, zero memory errors |
135+
| Thread safety | mutex | Global workspace realloc protected against concurrent access |
136+
| Numerical stability | 4 | Overflow-safe norm (max-abs rescaling), NaN/Inf input guards |
137+
138+
Full details: [docs/RELEASE_NOTES.md](docs/RELEASE_NOTES.md)
125139

126140
---
127141

docs/RELEASE_NOTES.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Release Notes
2+
3+
All notable changes to TurboQuant.cpp are documented here.
4+
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
5+
Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6+
7+
---
8+
9+
## [v0.2.0] — 2026-04-01
10+
11+
### Highlights
12+
13+
**V cache quantization** and **expert-grade validation** — total K+V compression reaches 4.9x (Q4) to 7.1x (Q2), with every claim backed by measured data.
14+
15+
### Added
16+
17+
#### V Cache Quantization
18+
- **Q4 value quantization** (`-v q4`): 4-bit per-block scale + packed nibbles. V compression 3.8x.
19+
- **Q2 value quantization** (`-v q2`): 2-bit Lloyd-Max codebook. V compression 7.6x.
20+
- **FP16 value auto-enable**: Values stored as FP16 when KV quantization is active (was FP32).
21+
- Combined 1-bit K + Q4 V: **27.62 KB/token, 4.9x total K+V** (was 136 KB FP16).
22+
- Combined 1-bit K + Q2 V: **19.12 KB/token, 7.1x total K+V**.
23+
- CLI flag `-v q4|q2|fp16` for value quantization control.
24+
- Memory reporting (`-M`) shows K and V breakdown separately.
25+
26+
#### Validation Suite
27+
- **NEON/scalar consistency** (`tests/test_neon_scalar.cpp`): 14 tests verify every NEON path against pure C reference — Q4 dequant, Q2 dequant, RHT butterfly, RoPE, matmul, RMSNorm, Hamming attention.
28+
- **Attention distribution** (`tests/test_attention_distribution.cpp`): 8 tests measure cosine similarity (0.996/0.918/0.634), Spearman rank correlation, top-k overlap. Proves compression is non-trivial (random K = 0.089).
29+
- **Codebook theory** (`tests/test_codebook_theory.cpp`): 5 tests verify Lloyd-Max centroids match N(0,1) literature values within 0.001, MSE within 1.18x of information-theoretic optimal.
30+
- **Edge cases** (`tests/test_edge_cases.cpp`): 29 tests — n=1 (single token), dim=0, NaN input, Inf input, all-same values, all-zero, n=10000 large sequence.
31+
- **Numerical stability**: 4 tests for overflow-safe norm computation and NaN/Inf input guards.
32+
33+
#### Benchmark Scripts
34+
- `bench/ablation_test.sh`: Divergence analysis at 50-300 tokens across KV types.
35+
- `bench/long_quality_test.sh`: Coherence at 200/500/1000 tokens.
36+
- `bench/sampling_test.sh`: Temperature sampling (T=0.3, T=0.7) comparison.
37+
- `bench/quant_time_bench.sh`: Quantization timing wrapper.
38+
- `bench/bench_kv_overhead.cpp`: Microbenchmark — uniform 148 ns, 1b 659 ns, 3b 11066 ns per vector.
39+
- `bench/attention_dist_test.sh`: Attention distribution analysis wrapper.
40+
- `scripts/sanitize.sh`: ASan + UBSan build and full test run.
41+
42+
### Fixed
43+
44+
- **Q4 dequant NEON nibble interleaving bug**: Lo/hi nibbles were written contiguously instead of interleaved, causing MSE 0.525 (300x worse than correct). Fixed with `vzip_u8` interleave.
45+
- **QJL sign bias**: `proj >= 0.0f``proj > 0.0f` across 11 occurrences (CPU, CUDA, Metal). Eliminates asymmetric bias at zero projection boundary.
46+
- **Norm overflow**: QJL norm computation now uses max-abs rescaling to prevent float overflow on large vectors.
47+
- **NaN/Inf input guard**: Quantization functions zero-fill output block on NaN/Inf input instead of producing undefined output.
48+
49+
### Changed
50+
51+
- **Thread safety**: Global Q8 workspace (`g_q8_buf`) and sampler probability index (`g_probindex`) protected by mutex against concurrent realloc races.
52+
- **RHT NEON vectorized**: Walsh-Hadamard butterfly uses `float32x4_t` for stages with len >= 4.
53+
- **Q4 dequant NEON restored**: Properly vectorized with `vzip_u8` after bug fix (was scalar fallback).
54+
- Test suite count: 23 → 26. Edge case count: 16 → 29.
55+
56+
### Measured Results
57+
58+
| Metric | Value | Source |
59+
|--------|-------|--------|
60+
| Total K+V compression (1b K + Q4 V) | 4.9x | `tq_run -M` |
61+
| Total K+V compression (1b K + Q2 V) | 7.1x | `tq_run -M` |
62+
| 32K context savings (Q4 V) | 3.4 GB | calculated |
63+
| Attention cosine (uniform_4b) | 0.996 | `test_attention_distribution` |
64+
| Attention cosine (turbo_kv_3b) | 0.918 | `test_attention_distribution` |
65+
| Attention cosine (turbo_kv_1b) | 0.634 (= 2/pi) | `test_attention_distribution` |
66+
| Random K cosine | 0.089 | `test_attention_distribution` |
67+
| Lloyd-Max MSE vs theory | < 1.18x | `test_codebook_theory` |
68+
| RHT overhead | 147 ns/vec | `bench_kv_overhead` |
69+
| 1-bit attention | 1.2 ns/key | `bench_kv_overhead` |
70+
| ASan + UBSan | 26/26 clean | `scripts/sanitize.sh` |
71+
72+
---
73+
74+
## [v0.1.0] — 2026-03-31
75+
76+
### Highlights
77+
78+
**Initial release** — pure C inference engine with TurboQuant KV cache compression. 1-bit keys, 10.7x key compression, byte-identical greedy output at 100 tokens.
79+
80+
### Added
81+
82+
#### Core Engine
83+
- Complete transformer inference engine in pure C11 (10,000+ lines).
84+
- Multi-architecture support: Gemma 3 (sliding window, GeGLU, dual RoPE) + Qwen3.5 (DeltaNet hybrid).
85+
- Multi-shard safetensors loading (Gemma 4B = 2 shards, 883 tensors).
86+
- Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect.
87+
- TQM binary format: pre-quantized mmap, instant loading.
88+
89+
#### KV Cache Quantization (11 types)
90+
- **TurboQuant KV 1-bit**: Sign-only after RHT. XOR + popcount attention (NEON `vcntq_u8`).
91+
- **TurboQuant KV 3-bit**: 2-bit Lloyd-Max codebook + 1-bit QJL residual.
92+
- **TurboQuant KV 4-bit**: 3-bit codebook + 1-bit QJL.
93+
- **Uniform 4-bit / 2-bit**: Standard min-max quantization.
94+
- **PolarQuant**: Polar coordinate (theta + radius) quantization.
95+
- **QJL**: Quantized Johnson-Lindenstrauss sign hash.
96+
- **Mixed / TurboQuant base**: Combined polar + QJL.
97+
98+
#### Weight Quantization
99+
- Q4 weight quantization (4-bit per-block).
100+
- Q2 weight quantization (2-bit Lloyd-Max codebook, Q2xQ8 integer matmul).
101+
- BF16 weight support.
102+
103+
#### Performance
104+
- NEON vectorized: 2-row matmul batching, fused dot products, Hamming distance.
105+
- Thread pool with configurable thread count.
106+
- Apple Silicon optimized.
107+
108+
#### Quality Verification
109+
- 30/30 byte-identical greedy matches (K-only, 100 tokens, 10 diverse prompts).
110+
- 23 test suites (Google Test).
111+
- Qwen3.5: 0.999 cosine similarity vs PyTorch reference.
112+
- Gemma 270M: per-layer exact match.
113+
114+
### Models Verified
115+
116+
| Model | Params | Speed (Q4, 6T) |
117+
|-------|--------|----------------|
118+
| Gemma 3 4B | 4B | 20.2 tok/s |
119+
| Qwen3.5-0.8B | 752M | 80.1 tok/s |
120+
| Gemma 3 270M | 270M | 176 tok/s |
121+
122+
---
123+
124+
## Release Process
125+
126+
### Version Scheme
127+
128+
```
129+
v{MAJOR}.{MINOR}.{PATCH}
130+
131+
MAJOR: Breaking API changes
132+
MINOR: New features, backward-compatible
133+
PATCH: Bug fixes, performance improvements
134+
```
135+
136+
### Checklist for New Releases
137+
138+
1. Update version in `CMakeLists.txt` (`project(turboquant VERSION x.y.z)`)
139+
2. Add release section to this file (newest first)
140+
3. Update badge version in `README.md` and `README.ko.md`
141+
4. Run full validation:
142+
```bash
143+
cmake --build build -j$(nproc) && ctest --test-dir build
144+
bash scripts/sanitize.sh
145+
./build/tq_run gemma3-4b.tqm -p "The capital of France is" -j 6 -n 20 -T 0.0 -k turbo_kv_1b -v q4
146+
```
147+
5. Tag: `git tag -a v0.x.0 -m "Release v0.x.0"`
148+
6. Push: `git push origin v0.x.0`
149+
7. Create GitHub release with this section's content
150+
151+
### What Goes in Release Notes
152+
153+
- **Added**: New features, new tests, new benchmarks
154+
- **Fixed**: Bug fixes (with root cause and impact)
155+
- **Changed**: Behavior changes, performance improvements
156+
- **Measured Results**: Table of key metrics with source (test name or script)
157+
- **Breaking**: API changes that require user code modification

0 commit comments

Comments
 (0)