22
33![ TurboQuant Hero] ( docs/assets/hero.png )
44
5- ** LLM inference engine in pure C. 82 tok/s . Zero dependencies.**
5+ ** Multi-architecture LLM inference engine in pure C. Zero dependencies.**
66
7- Load → Generate → Done. No Python. No GPU. Just one binary .
7+ Qwen3.5 + Gemma 3 supported. Gemma 4 ready .
88
99[ ![ Build] ( https://img.shields.io/badge/build-passing-brightgreen )] ( )
1010[ ![ Tests] ( https://img.shields.io/badge/tests-70%2B%20pass-brightgreen )] ( )
1111[ ![ License] ( https://img.shields.io/badge/license-Apache%202.0-blue )] ( )
12- [ ![ Speed] ( https://img.shields.io/badge/82%20tok%2Fs%20(Q4)-Qwen3.5--0.8B-blue )] ( )
12+ [ ![ Qwen3.5] ( https://img.shields.io/badge/Qwen3.5--0.8B-82%20tok%2Fs-blue )] ( )
13+ [ ![ Gemma3] ( https://img.shields.io/badge/Gemma3--270M-176%20tok%2Fs-blue )] ( )
14+
15+ ### Supported Models
16+
17+ | Model | Params | Speed (Q4, 6T) | Verified |
18+ | -------| --------| ----------------| ----------|
19+ | ** Qwen3.5-0.8B** | 752M | 82 tok/s | logits 0.999 cosine vs PyTorch |
20+ | ** Gemma 3 270M** | 270M | 176 tok/s | per-layer exact match vs PyTorch |
1321
1422### llama.cpp vs TurboQuant — Fair Q4 Benchmark
1523
@@ -82,15 +90,14 @@ that uses artificial neural networks to learn complex patterns...
8290│ tq_run │
8391│ TQM → mmap load → forward → stream tokens │
8492│ │
85- │ ┌─── Forward Pass ────────────────────────────┐ │
86- │ │ DeltaNet (18 layers, recurrent) │ │
87- │ │ Self-Attention (6 layers, GQA + RoPE) │ │
88- │ │ SwiGLU FFN (all 24 layers) │ │
93+ │ ┌─── Architecture Dispatch ─────────────────┐ │
94+ │ │ Qwen3.5: DeltaNet + Self-Attention + SwiGLU│ │
95+ │ │ Gemma 3: Sliding Window + GQA + GeGLU │ │
8996│ │ KV Cache: TurboQuant Q4 quantized │ │
9097│ │ Attention: Integer Q4×Q8 (2.9x vs FP32) │ │
9198│ └─────────────────────────────────────────────┘ │
9299│ │
93- │ Q4 Weights ─── NEON matmul ─── Multi-threaded │
100+ │ Q4 Weights ─── NEON matmul ─── Thread pool │
94101└─────────────────────────────────────────────────────┘
95102```
96103
@@ -106,13 +113,18 @@ that uses artificial neural networks to learn complex patterns...
106113
107114### Real Model Validated
108115
109- Tested on [ Qwen3.5-0.8B ] ( https://huggingface.co/Qwen/Qwen3.5-0.8B ) — actual inference, not synthetic:
116+ Both architectures verified against PyTorch — actual inference, not synthetic:
110117
111118```
112- "1+1=" → "2" ✓
113- "The capital of France is" → "Paris" ✓
114- "What is deep learning?" → correct paragraph ✓
115- Logits cosine vs PyTorch → 0.999 ✓
119+ Qwen3.5-0.8B:
120+ "1+1=" → "2" ✓
121+ "What is deep learning?" → correct paragraph ✓
122+ Logits cosine vs PyTorch → 0.999 ✓
123+
124+ Gemma 3 270M:
125+ "1+1=" → "2" ✓
126+ Forward pass → per-layer exact match ✓
127+ 176 tok/s (Q4, 6 threads) ✓
116128```
117129
118130---
@@ -175,14 +187,13 @@ scores = tq.attention(query, compressed, seq_len, dim, TurboQuant.UNIFORM_4B)
175187
176188## Under the Hood
177189
178- - ** 8,500+ lines of C** — complete inference engine, no wrappers
190+ - ** Multi-architecture** — Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window), Gemma 4 ready
191+ - ** 9,000+ lines of C** — complete inference engine, no wrappers
179192- ** 8 quantization types** — Uniform, Mixed Precision, PolarQuant, QJL, TurboQuant
180193- ** TQM format** — pre-quantized binary model, mmap instant load
181- - ** DeltaNet + Self-Attention** — Qwen3.5 hybrid architecture in pure C
182- - ** BPE tokenizer** — HuggingFace compatible (248K vocab, embedded in TQM)
194+ - ** Dual tokenizer** — GPT2 byte-level BPE + SentencePiece auto-detect
183195- ** Q4×Q8 integer attention** — ARM vdotq_s32, no float dequantization
184196- ** Thread pool** — zero-overhead dispatch with NEON 2-row batching
185- - ** Repetition penalty** — prevents degenerate output loops
186197- ** 20 test suites, 70+ tests** — ASan + UBSan + TSan clean
187198
188199---
@@ -194,11 +205,12 @@ Day 1 morning: Empty directory
194205Day 1 noon: KV cache compression library (8 types, A/B tested)
195206Day 1 evening: Full inference engine (model load → generate)
196207Day 1 night: 82 tok/s, matching llama.cpp on single-thread
208+ Day 2: Gemma 3 support, multi-architecture engine
197209
198- Lines of C: 8,500 +
210+ Lines of C: 9,000 +
199211Test suites: 20 (70+ tests)
200- Commits : 55+
201- Speed: 0.8 → 82 tok/s (Q4, llama.cpp parity )
212+ Architectures : Qwen3.5 + Gemma 3 (Gemma 4 ready)
213+ Speed: 82 tok/s (Qwen3.5), 176 tok/s (Gemma3 )
202214```
203215
204216---
0 commit comments