|
1 | 1 | # TurboQuant.cpp — Session State |
2 | 2 |
|
3 | | -**Last updated**: 2026-03-29 (grow round 8) |
4 | | -**Last commit**: d3e02cd |
5 | | -**Score**: 99.7% |
| 3 | +**Last updated**: 2026-03-29 (v0.9 Q4 weights — 38 tok/s) |
| 4 | +**Last commit**: 4415bcb |
6 | 5 |
|
7 | | -## Current Status |
| 6 | +## Speed Progression |
| 7 | +``` |
| 8 | +PyTorch CPU: 0.8 tok/s |
| 9 | +v0.8 FP32: 5 tok/s (6x PyTorch) |
| 10 | +v0.8 Q8+threads: 21 tok/s (26x) |
| 11 | +v0.9 Q4+threads: 38 tok/s (48x) ← current |
| 12 | +llama.cpp Q4_K_M: ~50 tok/s ← target |
| 13 | +``` |
8 | 14 |
|
9 | | -### What Works |
10 | | -- ✅ **Self-contained LLM inference engine** (pure C, 0 dependencies) |
11 | | -- ✅ **15.6 tok/s** on CPU (Qwen3.5-0.8B, 4 threads, Q8 weights) |
12 | | -- ✅ **17x faster than PyTorch CPU**, 1.5x faster than PyTorch+GPU |
13 | | -- ✅ Q4 weight quantization: 2.1 GB → ~280 MB (7x savings), `-q q4` flag (default) |
14 | | -- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q q8` flag |
15 | | -- ✅ Streaming BF16: embed/lm_head mmap'd, ~1 GB saved |
16 | | -- ✅ Multi-threaded matmul: pthread, 4 threads, NEON optimized |
17 | | -- ✅ DeltaNet + Self-Attention hybrid forward pass (Qwen3.5) |
18 | | -- ✅ HuggingFace BPE tokenizer (248K vocab) |
19 | | -- ✅ KV cache quantization in inference (Q4, 7.5x compression) |
20 | | -- ✅ Integer Q4×Q8 attention (2.9x faster than FP32) |
21 | | -- ✅ tq_chat.py uses native C engine (not PyTorch) |
22 | | -- ✅ 19 C++ test suites (48+ sub-tests), 22 Python tests |
23 | | -- ✅ CLI: tq_run, tq, tq_chat (native + pytorch fallback) |
| 15 | +## What Works |
| 16 | +- ✅ 38.2 tok/s CPU (Q4 weights, 4 threads, Qwen3.5-0.8B) |
| 17 | +- ✅ Q4 weights: 270 MB, Q8: 533 MB (vs 2.1 GB FP32) |
| 18 | +- ✅ Self-contained C inference engine, 0 dependencies |
| 19 | +- ✅ DeltaNet + Self-Attention hybrid forward pass |
| 20 | +- ✅ KV cache quantization (Q4, 7.5x compression) |
| 21 | +- ✅ Integer Q4×Q8 attention |
| 22 | +- ✅ 19 C++ + 22 Python tests |
24 | 23 |
|
25 | | -### What Needs Work (Priority Order) |
26 | | -1. Metal GPU matmul — Apple GPU for further speed |
27 | | -2. Value cache quantization — currently keys only |
28 | | -3. More models — Llama, Phi architecture support |
29 | | - |
30 | | -### Key Metrics |
31 | | -| Metric | Value | |
32 | | -|--------|-------| |
33 | | -| CPU inference (4 threads, Q8) | 15.6 tok/s | |
34 | | -| CPU inference (1 thread) | 7.8 tok/s | |
35 | | -| PyTorch CPU | 0.8 tok/s (17-20x slower) | |
36 | | -| PyTorch MPS | 10 tok/s (1.5x slower than our CPU) | |
37 | | -| Weight memory (Q4) | ~280 MB (7x savings) | |
38 | | -| Weight memory (Q8) | 533 MB (4x savings) | |
39 | | -| KV compression | 7.5x (uniform_4b) | |
40 | | -| Integer attention | 2.9-4.8x faster than FP32 | |
41 | | -| Logits cosine vs PyTorch | 0.999 | |
42 | | -| Tests | 19 C++ + 22 Python = 70+ | |
43 | | -| Code | 8,500+ lines C, 191 files | |
44 | | -| Commits | 27 | |
| 24 | +## What Needs Work |
| 25 | +1. Close llama.cpp gap: 38 → 50 tok/s (matmul tiling) |
| 26 | +2. Q4 quality on short prompts |
| 27 | +3. Metal GPU inference |
| 28 | +4. More model architectures |
0 commit comments