Skip to content

Commit af7342c

Browse files
unamedkrclaude
andcommitted
Update state.md — grow round 8 complete
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d3e02cd commit af7342c

1 file changed

Lines changed: 28 additions & 29 deletions

File tree

.claude/state.md

Lines changed: 28 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,43 @@
11
# TurboQuant.cpp — Session State
22

3-
**Last updated**: 2026-03-29 (Q8 weight quantization implemented)
4-
**Last commit**: pending
3+
**Last updated**: 2026-03-29 (grow round 8)
4+
**Last commit**: d3e02cd
55
**Score**: 99.7%
66

77
## Current Status
88

99
### What Works
10-
- ✅ Self-contained inference engine (0 dependencies, pure C)
11-
- ✅ Multi-threaded matmul (4 threads: 31 tok/s inference, 1.56x speedup)
12-
- ✅ Qwen3.5-0.8B: loads, tokenizes, generates correct text
13-
- ✅ DeltaNet + Self-Attention hybrid forward pass (layer-by-layer validated)
14-
- ✅ KV cache quantization library (8 types, integer Q4×Q8 attention)
15-
-**KV cache quantization integrated into inference forward pass** (quantize-on-store, Q4xQ8 integer attention for seq_len > 32)
16-
-**tok/s display** in tq_run output (timing via clock_gettime)
17-
-**Streaming BF16**: embed_tokens + lm_head kept as mmap'd BF16, converted on demand (saves ~2GB for Qwen3.5-0.8B)
18-
-**Q8 weight quantization**: `-q` flag converts layer weights to int8 + per-block scale (block_size=32), ~2x memory reduction with NEON-optimized Q8 matmul
19-
- ✅ 19 C++ test suites (42 test cases in test_ops), 22 Python tests
20-
- ✅ CLI tools: tq_run (-j threads), tq, tq_chat, tq_realtime_demo
10+
-**Self-contained LLM inference engine** (pure C, 0 dependencies)
11+
-**15.6 tok/s** on CPU (Qwen3.5-0.8B, 4 threads, Q8 weights)
12+
-**17x faster than PyTorch CPU**, 1.5x faster than PyTorch+GPU
13+
- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q` flag
14+
- ✅ Streaming BF16: embed/lm_head mmap'd, ~1 GB saved
15+
- ✅ Multi-threaded matmul: pthread, 4 threads, NEON optimized
16+
- ✅ DeltaNet + Self-Attention hybrid forward pass (Qwen3.5)
17+
- ✅ HuggingFace BPE tokenizer (248K vocab)
18+
- ✅ KV cache quantization in inference (Q4, 7.5x compression)
19+
- ✅ Integer Q4×Q8 attention (2.9x faster than FP32)
20+
- ✅ tq_chat.py uses native C engine (not PyTorch)
21+
- ✅ 19 C++ test suites (48+ sub-tests), 22 Python tests
22+
- ✅ CLI: tq_run, tq, tq_chat (native + pytorch fallback)
2123

2224
### What Needs Work (Priority Order)
23-
1. **Memory**: ~~3.3GB~~ ~1.3GB for BF16->FP32 conversion (embed_tokens + lm_head kept as BF16, saving ~2GB). With `-q` flag, layer weights quantized to Q8 (~0.65GB for weights, total ~0.8GB).
24-
2. **Weight quantization**: ~~Q8/Q4 weights for 2x memory reduction~~ Q8 implemented. Q4 weights for further 2x reduction.
25-
3. **Metal GPU inference**: Apple GPU for matmul
26-
4. **Value cache quantization**: currently only keys are quantized in the cache
25+
1. Metal GPU matmul — Apple GPU for further speed
26+
2. Q4 weight quantization — additional 2x memory savings
27+
3. Value cache quantization — currently keys only
28+
4. More models — Llama, Phi architecture support
2729

2830
### Key Metrics
2931
| Metric | Value |
3032
|--------|-------|
31-
| CPU inference (4 threads) | ~31 tok/s (Qwen3.5-0.8B, excl. loading) |
32-
| CPU inference (1 thread) | 12.8 tok/s |
33-
| PyTorch CPU | 0.8 tok/s (16-39x slower) |
34-
| PyTorch MPS | 10 tok/s (3x slower than our CPU) |
33+
| CPU inference (4 threads, Q8) | 15.6 tok/s |
34+
| CPU inference (1 thread) | 7.8 tok/s |
35+
| PyTorch CPU | 0.8 tok/s (17-20x slower) |
36+
| PyTorch MPS | 10 tok/s (1.5x slower than our CPU) |
37+
| Weight memory (Q8) | 533 MB (4x savings) |
3538
| KV compression | 7.5x (uniform_4b) |
3639
| Integer attention | 2.9-4.8x faster than FP32 |
37-
| Real model cosine | 0.994 (A+) |
38-
| Q8 weight mem | ~1.125 bytes/value (vs 4 FP32) |
39-
| Tests | 19 C++ (42 in test_ops) + 22 Python |
40-
41-
### Files to Read First
42-
- `.claude/state.md` — THIS FILE (session state)
43-
- `program.md` — Agent task specification
44-
- `CLAUDE.md` — Project guide + methodology
40+
| Logits cosine vs PyTorch | 0.999 |
41+
| Tests | 19 C++ + 22 Python = 70+ |
42+
| Code | 8,500+ lines C, 191 files |
43+
| Commits | 27 |

0 commit comments

Comments
 (0)