|
2 | 2 |
|
3 | 3 |  |
4 | 4 |
|
5 | | -**Extreme KV cache compression for LLM inference. Zero dependencies. Pure C.** |
| 5 | +**LLM inference engine with extreme KV cache compression. Zero dependencies. Pure C.** |
6 | 6 |
|
7 | | -Run **3x longer contexts** on the same hardware — or serve **3x more users** at the same cost. |
| 7 | +**14 tok/s on CPU** — 17x faster than PyTorch, 3x faster than PyTorch+GPU. |
8 | 8 |
|
9 | 9 | []() |
10 | | -[]() |
| 10 | +[]() |
11 | 11 | []() |
12 | | -[]() |
| 12 | +[]() |
13 | 13 |
|
14 | 14 | --- |
15 | 15 |
|
16 | 16 | ## Results at a Glance |
17 | 17 |
|
18 | | -| | FP16 (Baseline) | TurboQuant | |
| 18 | +| | PyTorch | TurboQuant.cpp | |
19 | 19 | |---|---|---| |
| 20 | +| **Inference Speed (CPU)** | 0.8 tok/s | **14 tok/s** (17x faster) | |
| 21 | +| **Inference Speed (GPU)** | 10 tok/s (MPS) | **14 tok/s (CPU only!)** | |
20 | 22 | | **KV Cache Size** | 7.00 GB | **0.93 GB** (87% saved) | |
21 | | -| **Attention Speed** | 1.0x | **2.9-4.8x faster** | |
22 | | -| **Max Context (24GB GPU)** | 164K tokens | **540K tokens** | |
23 | | -| **Quality (Cosine)** | 1.000 | **0.994** (A+) | |
| 23 | +| **Dependencies** | PyTorch + transformers | **0** (pure C) | |
| 24 | +| **Quality** | 1.000 | **0.994** (A+) | |
24 | 25 |
|
25 | | -> Measured on Llama-3.2-3B @ 64K context. Validated on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) real inference. |
| 26 | +> Measured on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B). Our CPU-only engine is faster than PyTorch on Apple GPU. |
26 | 27 |
|
27 | 28 | --- |
28 | 29 |
|
|
0 commit comments