Skip to content

Commit 4415bcb

Browse files
unamedkrclaude
andcommitted
v0.9: Q4 weight quantization — 38.2 tok/s (2.5x from Q8)
Q4_0 format: 32 values in 20 bytes (0.625 bytes/value) Q4×Q8 integer dot product with ARM vdotq_s32 Speed progression: FP32: ~5 tok/s Q8: 20.8 tok/s (4x memory savings) Q4: 38.2 tok/s (8x memory savings, approaching llama.cpp) Correctness: "capital of France = Paris" ✓ Quality: Q4 introduces some noise on short prompts (expected) tq_run -q flag: q4 (default), q8, none Weight memory: 2.1 GB FP32 → 533 MB Q8 → ~270 MB Q4 19/19 tests pass, 5 new Q4 test cases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent af7342c commit 4415bcb

8 files changed

Lines changed: 743 additions & 29 deletions

File tree

.claude/state.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010
-**Self-contained LLM inference engine** (pure C, 0 dependencies)
1111
-**15.6 tok/s** on CPU (Qwen3.5-0.8B, 4 threads, Q8 weights)
1212
-**17x faster than PyTorch CPU**, 1.5x faster than PyTorch+GPU
13-
- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q` flag
13+
- ✅ Q4 weight quantization: 2.1 GB → ~280 MB (7x savings), `-q q4` flag (default)
14+
- ✅ Q8 weight quantization: 2.1 GB → 533 MB (4x savings), `-q q8` flag
1415
- ✅ Streaming BF16: embed/lm_head mmap'd, ~1 GB saved
1516
- ✅ Multi-threaded matmul: pthread, 4 threads, NEON optimized
1617
- ✅ DeltaNet + Self-Attention hybrid forward pass (Qwen3.5)
@@ -23,9 +24,8 @@
2324

2425
### What Needs Work (Priority Order)
2526
1. Metal GPU matmul — Apple GPU for further speed
26-
2. Q4 weight quantization — additional 2x memory savings
27-
3. Value cache quantization — currently keys only
28-
4. More models — Llama, Phi architecture support
27+
2. Value cache quantization — currently keys only
28+
3. More models — Llama, Phi architecture support
2929

3030
### Key Metrics
3131
| Metric | Value |
@@ -34,6 +34,7 @@
3434
| CPU inference (1 thread) | 7.8 tok/s |
3535
| PyTorch CPU | 0.8 tok/s (17-20x slower) |
3636
| PyTorch MPS | 10 tok/s (1.5x slower than our CPU) |
37+
| Weight memory (Q4) | ~280 MB (7x savings) |
3738
| Weight memory (Q8) | 533 MB (4x savings) |
3839
| KV compression | 7.5x (uniform_4b) |
3940
| Integer attention | 2.9-4.8x faster than FP32 |

docs/plan/prd/prd_v0.9.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# TurboQuant.cpp — PRD v0.9: llama.cpp 속도 돌파
2+
3+
**Target**: 현재 15 tok/s → **40+ tok/s** (llama.cpp 수준)
4+
5+
## 병목 분석
6+
7+
```
8+
레이어 matmul: 194 ms (94.3%) ← 이것을 4x 빨리 만들어야 함
9+
출력 projection: 12 ms (5.7%)
10+
나머지: 0 ms
11+
```
12+
13+
24개 레이어 × 레이어당 ~8ms = 194ms. 목표: 레이어당 2ms = 48ms 총.
14+
15+
## 최적화 전략 (임팩트 순)
16+
17+
### 1. Q4 가중치 (예상 2x)
18+
Q8 → Q4: 데이터 2x 작음 → 메모리 대역폭 2x 절약
19+
llama.cpp Q4_K_M 패턴: int4 × int8 dot product
20+
21+
### 2. matmul 타일링 (예상 1.5x)
22+
현재: 행 단위 처리 (cache miss 빈번)
23+
개선: 타일 크기 최적화 (L1=128KB에 맞춤)
24+
25+
### 3. 가중치 레이아웃 전치 (예상 1.3x)
26+
현재: row-major [n, d] → 열 방향 접근 시 cache miss
27+
개선: 가중치를 [d, n] 전치 저장 → 순차 접근
28+
29+
### 4. NEON matmul 극한 최적화 (예상 1.2x)
30+
현재: 8-wide FMA (2 accumulators)
31+
개선: 16-wide (4 accumulators), 프리페치, 언롤링
32+
33+
### 목표 달성 경로
34+
```
35+
현재: 15 tok/s (Q8, 206ms/token)
36+
+ Q4 가중치: ~30 tok/s (2x, 103ms/token)
37+
+ 타일링: ~40 tok/s (1.3x, 79ms/token)
38+
+ 레이아웃: ~45 tok/s (1.1x, 72ms/token)
39+
```

include/turboquant/tq_engine.h

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,25 @@ typedef struct {
7575
int8_t* delta_in_proj_b_q8; float* delta_in_proj_b_q8s;
7676
int8_t* delta_out_proj_q8; float* delta_out_proj_q8s;
7777

78+
/* Q4_0 quantized weights: packed 4-bit data + per-block float scale (block_size=32)
79+
* Each block of 32 values stored as 16 packed bytes + 1 float scale.
80+
* Values are unsigned [0,15], centered at 8: actual = (q - 8) * scale.
81+
* When use_q4 is set, these replace FP32 pointers (set to NULL). */
82+
uint8_t* wq_q4; float* wq_q4s; /* Q4 q_proj */
83+
uint8_t* wk_q4; float* wk_q4s; /* Q4 k_proj */
84+
uint8_t* wv_q4; float* wv_q4s; /* Q4 v_proj */
85+
uint8_t* wo_q4; float* wo_q4s; /* Q4 o_proj */
86+
uint8_t* w_gate_q4; float* w_gate_q4s;/* Q4 gate_proj */
87+
uint8_t* w_up_q4; float* w_up_q4s; /* Q4 up_proj */
88+
uint8_t* w_down_q4; float* w_down_q4s;/* Q4 down_proj */
89+
90+
/* DeltaNet Q4 weights */
91+
uint8_t* delta_in_proj_qkv_q4; float* delta_in_proj_qkv_q4s;
92+
uint8_t* delta_in_proj_z_q4; float* delta_in_proj_z_q4s;
93+
uint8_t* delta_in_proj_a_q4; float* delta_in_proj_a_q4s;
94+
uint8_t* delta_in_proj_b_q4; float* delta_in_proj_b_q4s;
95+
uint8_t* delta_out_proj_q4; float* delta_out_proj_q4s;
96+
7897
/* DeltaNet (linear_attention) weights (NULL for self_attn layers) */
7998
float* delta_a_log; /* [delta_n_heads] decay parameter (log scale) */
8099
float* delta_conv1d; /* [qkv_dim, 1, conv_width] */
@@ -114,6 +133,11 @@ typedef struct {
114133
void* _q8_data; /* heap buffer for all Q8 quantized weights */
115134
size_t _q8_size;
116135

136+
/* Q4 weight quantization */
137+
int use_q4_weights; /* 1 if layer weights are Q4-quantized */
138+
void* _q4_data; /* heap buffer for all Q4 quantized weights */
139+
size_t _q4_size;
140+
117141
/* Memory management */
118142
void* _mmap_data;
119143
size_t _mmap_size;
@@ -231,6 +255,10 @@ void tq_matmul_q8(float* out, const float* x, const int8_t* w_qs, const float* w
231255
int n, int d);
232256
void tq_quantize_row_q8(const float* src, int8_t* dst_qs, float* dst_scales, int n);
233257
void tq_quantize_weights(tq_model_t* model);
258+
void tq_matmul_q4(float* out, const float* x, const uint8_t* w_qs, const float* w_scales,
259+
int n, int d);
260+
void tq_quantize_row_q4(const float* src, uint8_t* dst_qs, float* dst_scales, int n);
261+
void tq_quantize_weights_q4(tq_model_t* model);
234262
void tq_rmsnorm(float* out, const float* x, const float* weight, int n, float eps);
235263
void tq_rope(float* q, float* k, int pos, int head_dim,
236264
int n_heads, int n_kv_heads, float freq_base);

src/engine/tq_model.c

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1336,6 +1336,227 @@ void tq_quantize_weights(tq_model_t* model) {
13361336
used / (1024 * 1024), used * 4 / (1024 * 1024));
13371337
}
13381338

1339+
/* ============================================================
1340+
* Q4_0 weight quantization — quantize all layer weights post-load
1341+
*
1342+
* Converts FP32 weight matrices to Q4_0 (packed 4-bit + per-block float scale,
1343+
* block_size=32). This reduces memory ~7x: FP32 uses 4 bytes/value,
1344+
* Q4_0 uses 0.5 byte + 4 bytes/32 = 0.625 bytes/value.
1345+
*
1346+
* Each weight matrix [rows, cols] gets:
1347+
* - uint8_t qs[rows * (cols/32) * 16] — packed 4-bit values (2 per byte)
1348+
* - float scales[rows * (cols/32)] — per-block scales
1349+
*
1350+
* After quantization, the original FP32 pointer is set to NULL.
1351+
* ============================================================ */
1352+
1353+
/* Helper: quantize a single weight matrix to Q4 and store into pre-allocated buffer */
1354+
static void quantize_matrix_q4(const float* src, int rows, int cols,
1355+
uint8_t** out_qs, float** out_scales,
1356+
char** buf, size_t* used) {
1357+
if (!src || rows <= 0 || cols <= 0) {
1358+
*out_qs = NULL;
1359+
*out_scales = NULL;
1360+
return;
1361+
}
1362+
int n_blocks_per_row = (cols + 31) / 32;
1363+
size_t qs_bytes = (size_t)rows * n_blocks_per_row * 16; /* 16 packed bytes per block */
1364+
size_t sc_bytes = (size_t)rows * n_blocks_per_row * sizeof(float);
1365+
1366+
uint8_t* qs = (uint8_t*)(*buf + *used);
1367+
*used += qs_bytes;
1368+
float* sc = (float*)(*buf + *used);
1369+
*used += sc_bytes;
1370+
1371+
for (int r = 0; r < rows; r++) {
1372+
tq_quantize_row_q4(src + (size_t)r * cols,
1373+
qs + (size_t)r * n_blocks_per_row * 16,
1374+
sc + (size_t)r * n_blocks_per_row,
1375+
cols);
1376+
}
1377+
*out_qs = qs;
1378+
*out_scales = sc;
1379+
}
1380+
1381+
/* Calculate total Q4 buffer size needed for all layer weights */
1382+
static size_t calc_q4_buffer_size(const tq_model_t* model) {
1383+
size_t total = 0;
1384+
const tq_model_config_t* c = &model->config;
1385+
int dim = c->hidden_dim;
1386+
int q_dim = c->n_heads * c->head_dim;
1387+
int kv_dim = c->n_kv_heads * c->head_dim;
1388+
int inter = c->intermediate_dim;
1389+
int qg_dim = c->attn_output_gate ? q_dim * 2 : q_dim;
1390+
1391+
/* DeltaNet dimensions */
1392+
int delta_qkv_dim = 3 * c->delta_n_heads * c->delta_key_head_dim;
1393+
int delta_z_dim = c->delta_n_heads * c->delta_value_head_dim;
1394+
int delta_dn = c->delta_n_heads;
1395+
1396+
for (int l = 0; l < c->n_layers; l++) {
1397+
const tq_layer_weights_t* layer = &model->layers[l];
1398+
1399+
/* Self-attention weights */
1400+
if (layer->wq) {
1401+
int nb = (dim + 31) / 32;
1402+
total += (size_t)qg_dim * nb * 16; /* packed Q4 data */
1403+
total += (size_t)qg_dim * nb * 4; /* float scales */
1404+
}
1405+
if (layer->wk) {
1406+
int nb = (dim + 31) / 32;
1407+
total += (size_t)kv_dim * nb * 16;
1408+
total += (size_t)kv_dim * nb * 4;
1409+
}
1410+
if (layer->wv) {
1411+
int nb = (dim + 31) / 32;
1412+
total += (size_t)kv_dim * nb * 16;
1413+
total += (size_t)kv_dim * nb * 4;
1414+
}
1415+
if (layer->wo) {
1416+
int nb = (q_dim + 31) / 32;
1417+
total += (size_t)dim * nb * 16;
1418+
total += (size_t)dim * nb * 4;
1419+
}
1420+
1421+
/* FFN weights */
1422+
if (layer->w_gate) {
1423+
int nb = (dim + 31) / 32;
1424+
total += (size_t)inter * nb * 16;
1425+
total += (size_t)inter * nb * 4;
1426+
}
1427+
if (layer->w_up) {
1428+
int nb = (dim + 31) / 32;
1429+
total += (size_t)inter * nb * 16;
1430+
total += (size_t)inter * nb * 4;
1431+
}
1432+
if (layer->w_down) {
1433+
int nb = (inter + 31) / 32;
1434+
total += (size_t)dim * nb * 16;
1435+
total += (size_t)dim * nb * 4;
1436+
}
1437+
1438+
/* DeltaNet weights */
1439+
if (layer->delta_in_proj_qkv) {
1440+
int nb = (dim + 31) / 32;
1441+
total += (size_t)delta_qkv_dim * nb * 16;
1442+
total += (size_t)delta_qkv_dim * nb * 4;
1443+
}
1444+
if (layer->delta_in_proj_z) {
1445+
int nb = (dim + 31) / 32;
1446+
total += (size_t)delta_z_dim * nb * 16;
1447+
total += (size_t)delta_z_dim * nb * 4;
1448+
}
1449+
if (layer->delta_in_proj_a) {
1450+
int nb = (dim + 31) / 32;
1451+
total += (size_t)delta_dn * nb * 16;
1452+
total += (size_t)delta_dn * nb * 4;
1453+
}
1454+
if (layer->delta_in_proj_b) {
1455+
int nb = (dim + 31) / 32;
1456+
total += (size_t)delta_dn * nb * 16;
1457+
total += (size_t)delta_dn * nb * 4;
1458+
}
1459+
if (layer->delta_out_proj) {
1460+
int nb = (delta_z_dim + 31) / 32;
1461+
total += (size_t)dim * nb * 16;
1462+
total += (size_t)dim * nb * 4;
1463+
}
1464+
}
1465+
return total;
1466+
}
1467+
1468+
void tq_quantize_weights_q4(tq_model_t* model) {
1469+
if (!model || model->use_q4_weights) return;
1470+
1471+
const tq_model_config_t* c = &model->config;
1472+
int dim = c->hidden_dim;
1473+
int q_dim = c->n_heads * c->head_dim;
1474+
int kv_dim = c->n_kv_heads * c->head_dim;
1475+
int inter = c->intermediate_dim;
1476+
int qg_dim = c->attn_output_gate ? q_dim * 2 : q_dim;
1477+
1478+
/* DeltaNet dimensions */
1479+
int delta_qkv_dim = 3 * c->delta_n_heads * c->delta_key_head_dim;
1480+
int delta_z_dim = c->delta_n_heads * c->delta_value_head_dim;
1481+
int delta_dn = c->delta_n_heads;
1482+
1483+
size_t buf_size = calc_q4_buffer_size(model);
1484+
char* buf = (char*)malloc(buf_size);
1485+
if (!buf) {
1486+
fprintf(stderr, "tq_quantize_weights_q4: failed to allocate %zu MB for Q4\n",
1487+
buf_size / (1024 * 1024));
1488+
return;
1489+
}
1490+
size_t used = 0;
1491+
1492+
for (int l = 0; l < c->n_layers; l++) {
1493+
tq_layer_weights_t* layer = &model->layers[l];
1494+
1495+
/* Self-attention */
1496+
quantize_matrix_q4(layer->wq, qg_dim, dim,
1497+
&layer->wq_q4, &layer->wq_q4s, &buf, &used);
1498+
if (layer->wq_q4) layer->wq = NULL;
1499+
1500+
quantize_matrix_q4(layer->wk, kv_dim, dim,
1501+
&layer->wk_q4, &layer->wk_q4s, &buf, &used);
1502+
if (layer->wk_q4) layer->wk = NULL;
1503+
1504+
quantize_matrix_q4(layer->wv, kv_dim, dim,
1505+
&layer->wv_q4, &layer->wv_q4s, &buf, &used);
1506+
if (layer->wv_q4) layer->wv = NULL;
1507+
1508+
quantize_matrix_q4(layer->wo, dim, q_dim,
1509+
&layer->wo_q4, &layer->wo_q4s, &buf, &used);
1510+
if (layer->wo_q4) layer->wo = NULL;
1511+
1512+
/* FFN */
1513+
quantize_matrix_q4(layer->w_gate, inter, dim,
1514+
&layer->w_gate_q4, &layer->w_gate_q4s, &buf, &used);
1515+
if (layer->w_gate_q4) layer->w_gate = NULL;
1516+
1517+
quantize_matrix_q4(layer->w_up, inter, dim,
1518+
&layer->w_up_q4, &layer->w_up_q4s, &buf, &used);
1519+
if (layer->w_up_q4) layer->w_up = NULL;
1520+
1521+
quantize_matrix_q4(layer->w_down, dim, inter,
1522+
&layer->w_down_q4, &layer->w_down_q4s, &buf, &used);
1523+
if (layer->w_down_q4) layer->w_down = NULL;
1524+
1525+
/* DeltaNet */
1526+
quantize_matrix_q4(layer->delta_in_proj_qkv, delta_qkv_dim, dim,
1527+
&layer->delta_in_proj_qkv_q4, &layer->delta_in_proj_qkv_q4s,
1528+
&buf, &used);
1529+
if (layer->delta_in_proj_qkv_q4) layer->delta_in_proj_qkv = NULL;
1530+
1531+
quantize_matrix_q4(layer->delta_in_proj_z, delta_z_dim, dim,
1532+
&layer->delta_in_proj_z_q4, &layer->delta_in_proj_z_q4s,
1533+
&buf, &used);
1534+
if (layer->delta_in_proj_z_q4) layer->delta_in_proj_z = NULL;
1535+
1536+
quantize_matrix_q4(layer->delta_in_proj_a, delta_dn, dim,
1537+
&layer->delta_in_proj_a_q4, &layer->delta_in_proj_a_q4s,
1538+
&buf, &used);
1539+
if (layer->delta_in_proj_a_q4) layer->delta_in_proj_a = NULL;
1540+
1541+
quantize_matrix_q4(layer->delta_in_proj_b, delta_dn, dim,
1542+
&layer->delta_in_proj_b_q4, &layer->delta_in_proj_b_q4s,
1543+
&buf, &used);
1544+
if (layer->delta_in_proj_b_q4) layer->delta_in_proj_b = NULL;
1545+
1546+
quantize_matrix_q4(layer->delta_out_proj, dim, delta_z_dim,
1547+
&layer->delta_out_proj_q4, &layer->delta_out_proj_q4s,
1548+
&buf, &used);
1549+
if (layer->delta_out_proj_q4) layer->delta_out_proj = NULL;
1550+
}
1551+
1552+
model->use_q4_weights = 1;
1553+
model->_q4_data = buf;
1554+
model->_q4_size = used;
1555+
1556+
fprintf(stderr, "tq_quantize_weights_q4: quantized to Q4 (%zu MB, was ~%zu MB FP32)\n",
1557+
used / (1024 * 1024), used * 8 / (1024 * 1024));
1558+
}
1559+
13391560
/* ============================================================
13401561
* Free model
13411562
* ============================================================ */
@@ -1350,6 +1571,7 @@ void tq_free_model(tq_model_t* model) {
13501571

13511572
free(model->_converted_data);
13521573
free(model->_q8_data);
1574+
free(model->_q4_data);
13531575
free(model->attn_layer_indices);
13541576
free(model->layers);
13551577
free(model);

0 commit comments

Comments
 (0)