You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The "30-second experience" that was missing:
from quantcpp import Model
m = Model.from_pretrained("SmolLM2-135M")
print(m.ask("What is gravity?"))
Three changes:
1. Model.from_pretrained(name) — auto-downloads a GGUF model from
HuggingFace Hub on first use, caches at ~/.cache/quantcpp/. Uses
only urllib (zero new dependencies). Progress bar included.
Currently ships SmolLM2-135M (~135 MB) as the starter model.
2. quantcpp.download(name) — standalone download function for scripts.
3. Auto chat template wrapping — ask() and generate() now wrap the
user prompt with ChatML-style tokens (<|im_start|>user/assistant)
so instruct models actually produce output instead of 0 tokens.
Previously the raw prompt went to the model without template,
causing instruct-tuned models to generate nothing.
README rewritten:
- First line: "The SQLite of LLMs" (not "KV-compressed inference")
- Quick start: 3 lines of Python, no model download instructions
- Drop "LLM의 SQLite" metaphor into the hero section
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
<h3align="center">The single-header C engine for KV-compressed LLM inference</h3>
5
+
<h3align="center">The SQLite of LLMs</h3>
6
+
<palign="center"><b>Add AI to any C project with a single 16K-line file. Zero dependencies.</b></p>
6
7
7
8
<palign="center">
8
-
Production: <code>uniform_4b</code> KV cache (4–7x compression at +6% PPL on Llama 3.2 3B).<br>
9
-
Research: building blocks for <ahref="https://arxiv.org/abs/2504.19874">TurboQuant</a>, <ahref="https://arxiv.org/abs/2502.02617">PolarQuant</a>, <ahref="https://arxiv.org/abs/2406.03482">QJL</a> — 7 KV quantization types in one engine.<br>
10
-
72K LOC pure C, zero dependencies. Ships as <ahref="#-single-header-mode"><b>quant.h</b></a> — drop one file into any project.<br>
11
-
Runs everywhere a C compiler does: <b>iOS · Android · WASM · MSVC · microcontrollers</b>.
9
+
Drop <ahref="#-single-header-mode"><code>quant.h</code></a> (one file, 646 KB) into your project and get LLM inference.<br>
10
+
No CMake, no submodules, no package managers. Just <code>cc app.c -lm</code>.<br>
11
+
Runs everywhere a C compiler does: <b>iOS, Android, WASM, microcontrollers, MSVC</b>.<br>
12
+
Built-in <ahref="#kv-cache-compression">KV cache compression</a>: 7x memory reduction at fp32-parity speed.
12
13
</p>
13
14
14
15
<palign="center">
@@ -25,7 +26,7 @@
25
26
26
27
---
27
28
28
-
## Install
29
+
## Quick Start (30 seconds)
29
30
30
31
```bash
31
32
pip install quantcpp
@@ -34,17 +35,21 @@ pip install quantcpp
34
35
```python
35
36
from quantcpp import Model
36
37
37
-
m = Model("path/to/your.gguf") # any GGUF file you have on disk
38
-
print(m.ask("What is 2+2?"))
38
+
# Downloads a small model automatically (~135 MB, one-time)
39
+
m = Model.from_pretrained("SmolLM2-135M")
40
+
print(m.ask("What is gravity?"))
41
+
```
42
+
43
+
That's it. No API key, no GPU, no configuration. The model downloads once and is cached at `~/.cache/quantcpp/`.
39
44
40
-
# Streaming
45
+
**Bring your own model:**
46
+
```python
47
+
m = Model("path/to/any-model.gguf") # any GGUF file works
41
48
for tok in m.generate("Once upon a time"):
42
49
print(tok, end="", flush=True)
43
50
```
44
51
45
-
Pre-built wheels for Linux x86_64, Linux aarch64, macOS arm64 (Python 3.9–3.13). Other platforms fall back to source distribution which compiles `quant.h` automatically — no external dependencies, just a C compiler.
46
-
47
-
> **Note (v0.8.x):** the Python bindings currently default to `kv_compress=0` (no KV compression). KV compression is fully working in the CLI `quant` binary; bringing it to the bindings is tracked for v0.8.2 (regenerated single-header). See [CHANGELOG](CHANGELOG.md#081--2026-04-09-python-bindings-hotfix) for details.
52
+
Pre-built wheels for Linux x86_64/aarch64, macOS arm64 (Python 3.9-3.13). Other platforms compile from source automatically.
0 commit comments