Skip to content

Commit 9015a23

Browse files
unamedkrclaude
andcommitted
README: multi-architecture engine with Gemma 3 + Qwen3.5
- Supported models table with verified speeds - Architecture dispatch diagram updated - Dual tokenizer, Gemma 4 ready messaging - Updated stats: 9,000+ lines, 2 architectures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6542ec6 commit 9015a23

2 files changed

Lines changed: 50 additions & 30 deletions

File tree

README.ko.md

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,22 @@
22

33
![TurboQuant Hero](docs/assets/hero.png)
44

5-
**순수 C LLM 추론 엔진. 82 tok/s. 외부 의존성 없음.**
5+
**멀티 아키텍처 LLM 추론 엔진. 순수 C. 외부 의존성 없음.**
66

7-
로드 → 생성 → 끝. Python 없이. GPU 없이. 바이너리 하나로.
7+
Qwen3.5 + Gemma 3 지원. Gemma 4 대응 준비 완료.
88

99
[![Build](https://img.shields.io/badge/build-passing-brightgreen)]()
1010
[![Tests](https://img.shields.io/badge/tests-70%2B%20pass-brightgreen)]()
1111
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
12-
[![Speed](https://img.shields.io/badge/82%20tok%2Fs%20(Q4)-Qwen3.5--0.8B-blue)]()
12+
[![Qwen3.5](https://img.shields.io/badge/Qwen3.5--0.8B-82%20tok%2Fs-blue)]()
13+
[![Gemma3](https://img.shields.io/badge/Gemma3--270M-176%20tok%2Fs-blue)]()
14+
15+
### 지원 모델
16+
17+
| 모델 | 파라미터 | 속도 (Q4, 6T) | 검증 |
18+
|------|----------|---------------|------|
19+
| **Qwen3.5-0.8B** | 752M | 82 tok/s | PyTorch 대비 코사인 0.999 |
20+
| **Gemma 3 270M** | 270M | 176 tok/s | PyTorch 대비 레이어별 일치 |
1321

1422
### llama.cpp vs TurboQuant — Q4 공정 벤치마크
1523

@@ -174,14 +182,13 @@ scores = tq.attention(query, compressed, seq_len, dim, TurboQuant.UNIFORM_4B)
174182

175183
## 기술 상세
176184

177-
- **8,500줄 이상의 C** — 완전한 추론 엔진, 래퍼 아님
185+
- **멀티 아키텍처** — Qwen3.5 (DeltaNet 하이브리드) + Gemma 3 (슬라이딩 윈도우), Gemma 4 대응
186+
- **9,000줄 이상의 C** — 완전한 추론 엔진, 래퍼 아님
178187
- **8개 양자화 타입** — Uniform, Mixed Precision, PolarQuant, QJL, TurboQuant
179188
- **TQM 포맷** — 사전 양자화 바이너리, mmap 즉시 로딩
180-
- **DeltaNet + Self-Attention** — Qwen3.5 하이브리드 아키텍처 순수 C
181-
- **BPE 토크나이저** — HuggingFace 호환 (248K 어휘, TQM 내장)
189+
- **듀얼 토크나이저** — GPT2 바이트 BPE + SentencePiece 자동 감지
182190
- **Q4×Q8 정수 attention** — ARM vdotq_s32, float 역양자화 없음
183191
- **스레드 풀** — 제로 오버헤드 디스패치 + NEON 2-row 배치
184-
- **반복 방지** — repetition penalty로 퇴화 방지
185192
- **20 테스트 스위트, 70+ 테스트** — ASan + UBSan + TSan 클린
186193

187194
---
@@ -193,11 +200,12 @@ scores = tq.attention(query, compressed, seq_len, dim, TurboQuant.UNIFORM_4B)
193200
1일차 오후: KV 캐시 압축 라이브러리 (8개 타입, A/B 테스트)
194201
1일차 저녁: 완전한 추론 엔진 (모델 로드 → 텍스트 생성)
195202
1일차 밤: 82 tok/s, llama.cpp 단일 스레드 동등
203+
2일차: Gemma 3 지원, 멀티 아키텍처 엔진
196204
197-
C 코드: 8,500줄 이상
205+
C 코드: 9,000줄 이상
198206
테스트: 20개 스위트 (70+ 테스트)
199-
커밋: 55+개
200-
속도: 0.8 → 82 tok/s (Q4, llama.cpp 동등)
207+
아키텍처: Qwen3.5 + Gemma 3 (Gemma 4 대응)
208+
속도: 82 tok/s (Qwen3.5), 176 tok/s (Gemma3)
201209
```
202210

203211
---

README.md

Lines changed: 32 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,22 @@
22

33
![TurboQuant Hero](docs/assets/hero.png)
44

5-
**LLM inference engine in pure C. 82 tok/s. Zero dependencies.**
5+
**Multi-architecture LLM inference engine in pure C. Zero dependencies.**
66

7-
Load → Generate → Done. No Python. No GPU. Just one binary.
7+
Qwen3.5 + Gemma 3 supported. Gemma 4 ready.
88

99
[![Build](https://img.shields.io/badge/build-passing-brightgreen)]()
1010
[![Tests](https://img.shields.io/badge/tests-70%2B%20pass-brightgreen)]()
1111
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
12-
[![Speed](https://img.shields.io/badge/82%20tok%2Fs%20(Q4)-Qwen3.5--0.8B-blue)]()
12+
[![Qwen3.5](https://img.shields.io/badge/Qwen3.5--0.8B-82%20tok%2Fs-blue)]()
13+
[![Gemma3](https://img.shields.io/badge/Gemma3--270M-176%20tok%2Fs-blue)]()
14+
15+
### Supported Models
16+
17+
| Model | Params | Speed (Q4, 6T) | Verified |
18+
|-------|--------|----------------|----------|
19+
| **Qwen3.5-0.8B** | 752M | 82 tok/s | logits 0.999 cosine vs PyTorch |
20+
| **Gemma 3 270M** | 270M | 176 tok/s | per-layer exact match vs PyTorch |
1321

1422
### llama.cpp vs TurboQuant — Fair Q4 Benchmark
1523

@@ -82,15 +90,14 @@ that uses artificial neural networks to learn complex patterns...
8290
│ tq_run │
8391
│ TQM → mmap load → forward → stream tokens │
8492
│ │
85-
│ ┌─── Forward Pass ────────────────────────────┐ │
86-
│ │ DeltaNet (18 layers, recurrent) │ │
87-
│ │ Self-Attention (6 layers, GQA + RoPE) │ │
88-
│ │ SwiGLU FFN (all 24 layers) │ │
93+
│ ┌─── Architecture Dispatch ─────────────────┐ │
94+
│ │ Qwen3.5: DeltaNet + Self-Attention + SwiGLU│ │
95+
│ │ Gemma 3: Sliding Window + GQA + GeGLU │ │
8996
│ │ KV Cache: TurboQuant Q4 quantized │ │
9097
│ │ Attention: Integer Q4×Q8 (2.9x vs FP32) │ │
9198
│ └─────────────────────────────────────────────┘ │
9299
│ │
93-
│ Q4 Weights ─── NEON matmul ─── Multi-threaded
100+
│ Q4 Weights ─── NEON matmul ─── Thread pool
94101
└─────────────────────────────────────────────────────┘
95102
```
96103

@@ -106,13 +113,18 @@ that uses artificial neural networks to learn complex patterns...
106113

107114
### Real Model Validated
108115

109-
Tested on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) — actual inference, not synthetic:
116+
Both architectures verified against PyTorch — actual inference, not synthetic:
110117

111118
```
112-
"1+1=" → "2" ✓
113-
"The capital of France is" → "Paris" ✓
114-
"What is deep learning?" → correct paragraph ✓
115-
Logits cosine vs PyTorch → 0.999 ✓
119+
Qwen3.5-0.8B:
120+
"1+1=" → "2" ✓
121+
"What is deep learning?" → correct paragraph ✓
122+
Logits cosine vs PyTorch → 0.999 ✓
123+
124+
Gemma 3 270M:
125+
"1+1=" → "2" ✓
126+
Forward pass → per-layer exact match ✓
127+
176 tok/s (Q4, 6 threads) ✓
116128
```
117129

118130
---
@@ -175,14 +187,13 @@ scores = tq.attention(query, compressed, seq_len, dim, TurboQuant.UNIFORM_4B)
175187

176188
## Under the Hood
177189

178-
- **8,500+ lines of C** — complete inference engine, no wrappers
190+
- **Multi-architecture** — Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window), Gemma 4 ready
191+
- **9,000+ lines of C** — complete inference engine, no wrappers
179192
- **8 quantization types** — Uniform, Mixed Precision, PolarQuant, QJL, TurboQuant
180193
- **TQM format** — pre-quantized binary model, mmap instant load
181-
- **DeltaNet + Self-Attention** — Qwen3.5 hybrid architecture in pure C
182-
- **BPE tokenizer** — HuggingFace compatible (248K vocab, embedded in TQM)
194+
- **Dual tokenizer** — GPT2 byte-level BPE + SentencePiece auto-detect
183195
- **Q4×Q8 integer attention** — ARM vdotq_s32, no float dequantization
184196
- **Thread pool** — zero-overhead dispatch with NEON 2-row batching
185-
- **Repetition penalty** — prevents degenerate output loops
186197
- **20 test suites, 70+ tests** — ASan + UBSan + TSan clean
187198

188199
---
@@ -194,11 +205,12 @@ Day 1 morning: Empty directory
194205
Day 1 noon: KV cache compression library (8 types, A/B tested)
195206
Day 1 evening: Full inference engine (model load → generate)
196207
Day 1 night: 82 tok/s, matching llama.cpp on single-thread
208+
Day 2: Gemma 3 support, multi-architecture engine
197209
198-
Lines of C: 8,500+
210+
Lines of C: 9,000+
199211
Test suites: 20 (70+ tests)
200-
Commits: 55+
201-
Speed: 0.8 → 82 tok/s (Q4, llama.cpp parity)
212+
Architectures: Qwen3.5 + Gemma 3 (Gemma 4 ready)
213+
Speed: 82 tok/s (Qwen3.5), 176 tok/s (Gemma3)
202214
```
203215

204216
---

0 commit comments

Comments
 (0)