|
1 | 1 | # TurboQuant.cpp — Session State |
2 | 2 |
|
3 | | -**Last updated**: 2026-03-29 (Q8 weight quantization implemented) |
4 | | -**Last commit**: pending |
| 3 | +**Last updated**: 2026-03-29 (grow round 8) |
| 4 | +**Last commit**: d3e02cd |
5 | 5 | **Score**: 99.7% |
6 | 6 |
|
7 | 7 | ## Current Status |
8 | 8 |
|
9 | 9 | ### What Works |
10 | | -- ✅ Self-contained inference engine (0 dependencies, pure C) |
11 | | -- ✅ Multi-threaded matmul (4 threads: 31 tok/s inference, 1.56x speedup) |
12 | | -- ✅ Qwen3.5-0.8B: loads, tokenizes, generates correct text |
13 | | -- ✅ DeltaNet + Self-Attention hybrid forward pass (layer-by-layer validated) |
14 | | -- ✅ KV cache quantization library (8 types, integer Q4×Q8 attention) |
15 | | -- ✅ **KV cache quantization integrated into inference forward pass** (quantize-on-store, Q4xQ8 integer attention for seq_len > 32) |
16 | | -- ✅ **tok/s display** in tq_run output (timing via clock_gettime) |
17 | | -- ✅ **Streaming BF16**: embed_tokens + lm_head kept as mmap'd BF16, converted on demand (saves ~2GB for Qwen3.5-0.8B) |
18 | | -- ✅ **Q8 weight quantization**: `-q` flag converts layer weights to int8 + per-block scale (block_size=32), ~2x memory reduction with NEON-optimized Q8 matmul |
19 | | -- ✅ 19 C++ test suites (42 test cases in test_ops), 22 Python tests |
20 | | -- ✅ CLI tools: tq_run (-j threads), tq, tq_chat, tq_realtime_demo |
| 10 | +- ✅ **Self-contained LLM inference engine** (pure C, 0 dependencies) |
| 11 | +- ✅ **15.6 tok/s** on CPU (Qwen3.5-0.8B, 4 threads, Q8 weights) |
| 12 | +- ✅ **17x faster than PyTorch CPU**, 1.5x faster than PyTorch+GPU |
| 13 | +- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q` flag |
| 14 | +- ✅ Streaming BF16: embed/lm_head mmap'd, ~1 GB saved |
| 15 | +- ✅ Multi-threaded matmul: pthread, 4 threads, NEON optimized |
| 16 | +- ✅ DeltaNet + Self-Attention hybrid forward pass (Qwen3.5) |
| 17 | +- ✅ HuggingFace BPE tokenizer (248K vocab) |
| 18 | +- ✅ KV cache quantization in inference (Q4, 7.5x compression) |
| 19 | +- ✅ Integer Q4×Q8 attention (2.9x faster than FP32) |
| 20 | +- ✅ tq_chat.py uses native C engine (not PyTorch) |
| 21 | +- ✅ 19 C++ test suites (48+ sub-tests), 22 Python tests |
| 22 | +- ✅ CLI: tq_run, tq, tq_chat (native + pytorch fallback) |
21 | 23 |
|
22 | 24 | ### What Needs Work (Priority Order) |
23 | | -1. **Memory**: ~~3.3GB~~ ~1.3GB for BF16->FP32 conversion (embed_tokens + lm_head kept as BF16, saving ~2GB). With `-q` flag, layer weights quantized to Q8 (~0.65GB for weights, total ~0.8GB). |
24 | | -2. **Weight quantization**: ~~Q8/Q4 weights for 2x memory reduction~~ Q8 implemented. Q4 weights for further 2x reduction. |
25 | | -3. **Metal GPU inference**: Apple GPU for matmul |
26 | | -4. **Value cache quantization**: currently only keys are quantized in the cache |
| 25 | +1. Metal GPU matmul — Apple GPU for further speed |
| 26 | +2. Q4 weight quantization — additional 2x memory savings |
| 27 | +3. Value cache quantization — currently keys only |
| 28 | +4. More models — Llama, Phi architecture support |
27 | 29 |
|
28 | 30 | ### Key Metrics |
29 | 31 | | Metric | Value | |
30 | 32 | |--------|-------| |
31 | | -| CPU inference (4 threads) | ~31 tok/s (Qwen3.5-0.8B, excl. loading) | |
32 | | -| CPU inference (1 thread) | 12.8 tok/s | |
33 | | -| PyTorch CPU | 0.8 tok/s (16-39x slower) | |
34 | | -| PyTorch MPS | 10 tok/s (3x slower than our CPU) | |
| 33 | +| CPU inference (4 threads, Q8) | 15.6 tok/s | |
| 34 | +| CPU inference (1 thread) | 7.8 tok/s | |
| 35 | +| PyTorch CPU | 0.8 tok/s (17-20x slower) | |
| 36 | +| PyTorch MPS | 10 tok/s (1.5x slower than our CPU) | |
| 37 | +| Weight memory (Q8) | 533 MB (4x savings) | |
35 | 38 | | KV compression | 7.5x (uniform_4b) | |
36 | 39 | | Integer attention | 2.9-4.8x faster than FP32 | |
37 | | -| Real model cosine | 0.994 (A+) | |
38 | | -| Q8 weight mem | ~1.125 bytes/value (vs 4 FP32) | |
39 | | -| Tests | 19 C++ (42 in test_ops) + 22 Python | |
40 | | - |
41 | | -### Files to Read First |
42 | | -- `.claude/state.md` — THIS FILE (session state) |
43 | | -- `program.md` — Agent task specification |
44 | | -- `CLAUDE.md` — Project guide + methodology |
| 40 | +| Logits cosine vs PyTorch | 0.999 | |
| 41 | +| Tests | 19 C++ + 22 Python = 70+ | |
| 42 | +| Code | 8,500+ lines C, 191 files | |
| 43 | +| Commits | 27 | |
0 commit comments