Skip to content

Commit f8f286d

Browse files
unamedkrclaude
andcommitted
state: 38 tok/s (Q4), tracking llama.cpp gap
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4415bcb commit f8f286d

1 file changed

Lines changed: 23 additions & 39 deletions

File tree

.claude/state.md

Lines changed: 23 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,28 @@
11
# TurboQuant.cpp — Session State
22

3-
**Last updated**: 2026-03-29 (grow round 8)
4-
**Last commit**: d3e02cd
5-
**Score**: 99.7%
3+
**Last updated**: 2026-03-29 (v0.9 Q4 weights — 38 tok/s)
4+
**Last commit**: 4415bcb
65

7-
## Current Status
6+
## Speed Progression
7+
```
8+
PyTorch CPU: 0.8 tok/s
9+
v0.8 FP32: 5 tok/s (6x PyTorch)
10+
v0.8 Q8+threads: 21 tok/s (26x)
11+
v0.9 Q4+threads: 38 tok/s (48x) ← current
12+
llama.cpp Q4_K_M: ~50 tok/s ← target
13+
```
814

9-
### What Works
10-
-**Self-contained LLM inference engine** (pure C, 0 dependencies)
11-
-**15.6 tok/s** on CPU (Qwen3.5-0.8B, 4 threads, Q8 weights)
12-
-**17x faster than PyTorch CPU**, 1.5x faster than PyTorch+GPU
13-
- ✅ Q4 weight quantization: 2.1 GB → ~280 MB (7x savings), `-q q4` flag (default)
14-
- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q q8` flag
15-
- ✅ Streaming BF16: embed/lm_head mmap'd, ~1 GB saved
16-
- ✅ Multi-threaded matmul: pthread, 4 threads, NEON optimized
17-
- ✅ DeltaNet + Self-Attention hybrid forward pass (Qwen3.5)
18-
- ✅ HuggingFace BPE tokenizer (248K vocab)
19-
- ✅ KV cache quantization in inference (Q4, 7.5x compression)
20-
- ✅ Integer Q4×Q8 attention (2.9x faster than FP32)
21-
- ✅ tq_chat.py uses native C engine (not PyTorch)
22-
- ✅ 19 C++ test suites (48+ sub-tests), 22 Python tests
23-
- ✅ CLI: tq_run, tq, tq_chat (native + pytorch fallback)
15+
## What Works
16+
- ✅ 38.2 tok/s CPU (Q4 weights, 4 threads, Qwen3.5-0.8B)
17+
- ✅ Q4 weights: 270 MB, Q8: 533 MB (vs 2.1 GB FP32)
18+
- ✅ Self-contained C inference engine, 0 dependencies
19+
- ✅ DeltaNet + Self-Attention hybrid forward pass
20+
- ✅ KV cache quantization (Q4, 7.5x compression)
21+
- ✅ Integer Q4×Q8 attention
22+
- ✅ 19 C++ + 22 Python tests
2423

25-
### What Needs Work (Priority Order)
26-
1. Metal GPU matmul — Apple GPU for further speed
27-
2. Value cache quantization — currently keys only
28-
3. More models — Llama, Phi architecture support
29-
30-
### Key Metrics
31-
| Metric | Value |
32-
|--------|-------|
33-
| CPU inference (4 threads, Q8) | 15.6 tok/s |
34-
| CPU inference (1 thread) | 7.8 tok/s |
35-
| PyTorch CPU | 0.8 tok/s (17-20x slower) |
36-
| PyTorch MPS | 10 tok/s (1.5x slower than our CPU) |
37-
| Weight memory (Q4) | ~280 MB (7x savings) |
38-
| Weight memory (Q8) | 533 MB (4x savings) |
39-
| KV compression | 7.5x (uniform_4b) |
40-
| Integer attention | 2.9-4.8x faster than FP32 |
41-
| Logits cosine vs PyTorch | 0.999 |
42-
| Tests | 19 C++ + 22 Python = 70+ |
43-
| Code | 8,500+ lines C, 191 files |
44-
| Commits | 27 |
24+
## What Needs Work
25+
1. Close llama.cpp gap: 38 → 50 tok/s (matmul tiling)
26+
2. Q4 quality on short prompts
27+
3. Metal GPU inference
28+
4. More model architectures

0 commit comments

Comments
 (0)