|
| 1 | +# Release Notes |
| 2 | + |
| 3 | +All notable changes to TurboQuant.cpp are documented here. |
| 4 | +Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). |
| 5 | +Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## [v0.2.0] — 2026-04-01 |
| 10 | + |
| 11 | +### Highlights |
| 12 | + |
| 13 | +**V cache quantization** and **expert-grade validation** — total K+V compression reaches 4.9x (Q4) to 7.1x (Q2), with every claim backed by measured data. |
| 14 | + |
| 15 | +### Added |
| 16 | + |
| 17 | +#### V Cache Quantization |
| 18 | +- **Q4 value quantization** (`-v q4`): 4-bit per-block scale + packed nibbles. V compression 3.8x. |
| 19 | +- **Q2 value quantization** (`-v q2`): 2-bit Lloyd-Max codebook. V compression 7.6x. |
| 20 | +- **FP16 value auto-enable**: Values stored as FP16 when KV quantization is active (was FP32). |
| 21 | +- Combined 1-bit K + Q4 V: **27.62 KB/token, 4.9x total K+V** (was 136 KB FP16). |
| 22 | +- Combined 1-bit K + Q2 V: **19.12 KB/token, 7.1x total K+V**. |
| 23 | +- CLI flag `-v q4|q2|fp16` for value quantization control. |
| 24 | +- Memory reporting (`-M`) shows K and V breakdown separately. |
| 25 | + |
| 26 | +#### Validation Suite |
| 27 | +- **NEON/scalar consistency** (`tests/test_neon_scalar.cpp`): 14 tests verify every NEON path against pure C reference — Q4 dequant, Q2 dequant, RHT butterfly, RoPE, matmul, RMSNorm, Hamming attention. |
| 28 | +- **Attention distribution** (`tests/test_attention_distribution.cpp`): 8 tests measure cosine similarity (0.996/0.918/0.634), Spearman rank correlation, top-k overlap. Proves compression is non-trivial (random K = 0.089). |
| 29 | +- **Codebook theory** (`tests/test_codebook_theory.cpp`): 5 tests verify Lloyd-Max centroids match N(0,1) literature values within 0.001, MSE within 1.18x of information-theoretic optimal. |
| 30 | +- **Edge cases** (`tests/test_edge_cases.cpp`): 29 tests — n=1 (single token), dim=0, NaN input, Inf input, all-same values, all-zero, n=10000 large sequence. |
| 31 | +- **Numerical stability**: 4 tests for overflow-safe norm computation and NaN/Inf input guards. |
| 32 | + |
| 33 | +#### Benchmark Scripts |
| 34 | +- `bench/ablation_test.sh`: Divergence analysis at 50-300 tokens across KV types. |
| 35 | +- `bench/long_quality_test.sh`: Coherence at 200/500/1000 tokens. |
| 36 | +- `bench/sampling_test.sh`: Temperature sampling (T=0.3, T=0.7) comparison. |
| 37 | +- `bench/quant_time_bench.sh`: Quantization timing wrapper. |
| 38 | +- `bench/bench_kv_overhead.cpp`: Microbenchmark — uniform 148 ns, 1b 659 ns, 3b 11066 ns per vector. |
| 39 | +- `bench/attention_dist_test.sh`: Attention distribution analysis wrapper. |
| 40 | +- `scripts/sanitize.sh`: ASan + UBSan build and full test run. |
| 41 | + |
| 42 | +### Fixed |
| 43 | + |
| 44 | +- **Q4 dequant NEON nibble interleaving bug**: Lo/hi nibbles were written contiguously instead of interleaved, causing MSE 0.525 (300x worse than correct). Fixed with `vzip_u8` interleave. |
| 45 | +- **QJL sign bias**: `proj >= 0.0f` → `proj > 0.0f` across 11 occurrences (CPU, CUDA, Metal). Eliminates asymmetric bias at zero projection boundary. |
| 46 | +- **Norm overflow**: QJL norm computation now uses max-abs rescaling to prevent float overflow on large vectors. |
| 47 | +- **NaN/Inf input guard**: Quantization functions zero-fill output block on NaN/Inf input instead of producing undefined output. |
| 48 | + |
| 49 | +### Changed |
| 50 | + |
| 51 | +- **Thread safety**: Global Q8 workspace (`g_q8_buf`) and sampler probability index (`g_probindex`) protected by mutex against concurrent realloc races. |
| 52 | +- **RHT NEON vectorized**: Walsh-Hadamard butterfly uses `float32x4_t` for stages with len >= 4. |
| 53 | +- **Q4 dequant NEON restored**: Properly vectorized with `vzip_u8` after bug fix (was scalar fallback). |
| 54 | +- Test suite count: 23 → 26. Edge case count: 16 → 29. |
| 55 | + |
| 56 | +### Measured Results |
| 57 | + |
| 58 | +| Metric | Value | Source | |
| 59 | +|--------|-------|--------| |
| 60 | +| Total K+V compression (1b K + Q4 V) | 4.9x | `tq_run -M` | |
| 61 | +| Total K+V compression (1b K + Q2 V) | 7.1x | `tq_run -M` | |
| 62 | +| 32K context savings (Q4 V) | 3.4 GB | calculated | |
| 63 | +| Attention cosine (uniform_4b) | 0.996 | `test_attention_distribution` | |
| 64 | +| Attention cosine (turbo_kv_3b) | 0.918 | `test_attention_distribution` | |
| 65 | +| Attention cosine (turbo_kv_1b) | 0.634 (= 2/pi) | `test_attention_distribution` | |
| 66 | +| Random K cosine | 0.089 | `test_attention_distribution` | |
| 67 | +| Lloyd-Max MSE vs theory | < 1.18x | `test_codebook_theory` | |
| 68 | +| RHT overhead | 147 ns/vec | `bench_kv_overhead` | |
| 69 | +| 1-bit attention | 1.2 ns/key | `bench_kv_overhead` | |
| 70 | +| ASan + UBSan | 26/26 clean | `scripts/sanitize.sh` | |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## [v0.1.0] — 2026-03-31 |
| 75 | + |
| 76 | +### Highlights |
| 77 | + |
| 78 | +**Initial release** — pure C inference engine with TurboQuant KV cache compression. 1-bit keys, 10.7x key compression, byte-identical greedy output at 100 tokens. |
| 79 | + |
| 80 | +### Added |
| 81 | + |
| 82 | +#### Core Engine |
| 83 | +- Complete transformer inference engine in pure C11 (10,000+ lines). |
| 84 | +- Multi-architecture support: Gemma 3 (sliding window, GeGLU, dual RoPE) + Qwen3.5 (DeltaNet hybrid). |
| 85 | +- Multi-shard safetensors loading (Gemma 4B = 2 shards, 883 tensors). |
| 86 | +- Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect. |
| 87 | +- TQM binary format: pre-quantized mmap, instant loading. |
| 88 | + |
| 89 | +#### KV Cache Quantization (11 types) |
| 90 | +- **TurboQuant KV 1-bit**: Sign-only after RHT. XOR + popcount attention (NEON `vcntq_u8`). |
| 91 | +- **TurboQuant KV 3-bit**: 2-bit Lloyd-Max codebook + 1-bit QJL residual. |
| 92 | +- **TurboQuant KV 4-bit**: 3-bit codebook + 1-bit QJL. |
| 93 | +- **Uniform 4-bit / 2-bit**: Standard min-max quantization. |
| 94 | +- **PolarQuant**: Polar coordinate (theta + radius) quantization. |
| 95 | +- **QJL**: Quantized Johnson-Lindenstrauss sign hash. |
| 96 | +- **Mixed / TurboQuant base**: Combined polar + QJL. |
| 97 | + |
| 98 | +#### Weight Quantization |
| 99 | +- Q4 weight quantization (4-bit per-block). |
| 100 | +- Q2 weight quantization (2-bit Lloyd-Max codebook, Q2xQ8 integer matmul). |
| 101 | +- BF16 weight support. |
| 102 | + |
| 103 | +#### Performance |
| 104 | +- NEON vectorized: 2-row matmul batching, fused dot products, Hamming distance. |
| 105 | +- Thread pool with configurable thread count. |
| 106 | +- Apple Silicon optimized. |
| 107 | + |
| 108 | +#### Quality Verification |
| 109 | +- 30/30 byte-identical greedy matches (K-only, 100 tokens, 10 diverse prompts). |
| 110 | +- 23 test suites (Google Test). |
| 111 | +- Qwen3.5: 0.999 cosine similarity vs PyTorch reference. |
| 112 | +- Gemma 270M: per-layer exact match. |
| 113 | + |
| 114 | +### Models Verified |
| 115 | + |
| 116 | +| Model | Params | Speed (Q4, 6T) | |
| 117 | +|-------|--------|----------------| |
| 118 | +| Gemma 3 4B | 4B | 20.2 tok/s | |
| 119 | +| Qwen3.5-0.8B | 752M | 80.1 tok/s | |
| 120 | +| Gemma 3 270M | 270M | 176 tok/s | |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## Release Process |
| 125 | + |
| 126 | +### Version Scheme |
| 127 | + |
| 128 | +``` |
| 129 | +v{MAJOR}.{MINOR}.{PATCH} |
| 130 | +
|
| 131 | +MAJOR: Breaking API changes |
| 132 | +MINOR: New features, backward-compatible |
| 133 | +PATCH: Bug fixes, performance improvements |
| 134 | +``` |
| 135 | + |
| 136 | +### Checklist for New Releases |
| 137 | + |
| 138 | +1. Update version in `CMakeLists.txt` (`project(turboquant VERSION x.y.z)`) |
| 139 | +2. Add release section to this file (newest first) |
| 140 | +3. Update badge version in `README.md` and `README.ko.md` |
| 141 | +4. Run full validation: |
| 142 | + ```bash |
| 143 | + cmake --build build -j$(nproc) && ctest --test-dir build |
| 144 | + bash scripts/sanitize.sh |
| 145 | + ./build/tq_run gemma3-4b.tqm -p "The capital of France is" -j 6 -n 20 -T 0.0 -k turbo_kv_1b -v q4 |
| 146 | + ``` |
| 147 | +5. Tag: `git tag -a v0.x.0 -m "Release v0.x.0"` |
| 148 | +6. Push: `git push origin v0.x.0` |
| 149 | +7. Create GitHub release with this section's content |
| 150 | + |
| 151 | +### What Goes in Release Notes |
| 152 | + |
| 153 | +- **Added**: New features, new tests, new benchmarks |
| 154 | +- **Fixed**: Bug fixes (with root cause and impact) |
| 155 | +- **Changed**: Behavior changes, performance improvements |
| 156 | +- **Measured Results**: Table of key metrics with source (test name or script) |
| 157 | +- **Breaking**: API changes that require user code modification |
0 commit comments