Skip to content

Commit b4949fe

Browse files
unamedkrclaude
andcommitted
Add PRD v1.1 and WBS v1.1: long context proof strategy
Phase A: 3B+ model support (dynamic buffers, Gemma 4B) Phase B: Long context benchmark (KV memory comparison vs llama.cpp) Phase C: Release v0.1.0 + community relaunch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9015a23 commit b4949fe

2 files changed

Lines changed: 432 additions & 0 deletions

File tree

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# TurboQuant.cpp v1.1 PRD — Long Context Proof
2+
3+
**Version**: 1.1
4+
**Date**: 2026-03-31
5+
**Status**: Draft
6+
**Author**: Product / Engineering
7+
8+
---
9+
10+
## Overview
11+
12+
TurboQuant.cpp v1.1 proves the practical value of KV cache compression by supporting 3B+ parameter models and demonstrating measurable memory savings at long context lengths (8K-32K tokens). The release culminates in a public benchmark showing that TurboQuant continues inference where llama.cpp runs out of memory.
13+
14+
Current state: 47 stars, 9 forks, two toy-sized models (270M, 0.8B), no GitHub Release, no long-context proof. v1.1 fixes all three gaps.
15+
16+
---
17+
18+
## Objectives
19+
20+
1. **Practical model support**: Run at least one 3B+ model with verified output quality (cosine similarity > 0.99 vs reference).
21+
2. **Long context proof**: Produce a reproducible benchmark showing 5-7x KV memory reduction at 32K context, including an OOM crossover chart.
22+
3. **First public release**: Ship GitHub Release v0.1.0 with pre-built binaries and benchmark data.
23+
4. **Community traction**: Post benchmark results to r/LocalLLaMA and Hacker News with concrete data.
24+
25+
---
26+
27+
## Scope
28+
29+
### Phase A: Larger Model Support
30+
31+
**Goal**: Make TurboQuant useful beyond toy models.
32+
33+
**Target model** (pick one, in priority order):
34+
1. Llama 3.2 3B — widest community adoption
35+
2. Qwen3.5-3B — existing Qwen architecture support reduces work
36+
3. Gemma 3 4B — existing Gemma architecture support
37+
38+
**Required changes**:
39+
40+
| Area | Current Limitation | Required Change |
41+
|------|-------------------|-----------------|
42+
| Buffer allocation | Stack-allocated `float[4096]` arrays | Dynamic allocation based on model config (`n_embd`, `n_head`, `n_ff`) |
43+
| Intermediate dimensions | Hardcoded or capped at 4096 | Read from model config, allocate at init time |
44+
| Weight loading | Assumes small weight files | Streaming/mmap loading for multi-GB safetensors |
45+
| Memory budget | No tracking | Add peak memory tracking and reporting |
46+
| KV cache sizing | Sized for small models | Scale with `n_layers * n_heads * head_dim * max_seq_len` |
47+
48+
**Deliverables**:
49+
- [ ] Refactor all stack-allocated per-layer buffers to heap allocation sized from model config
50+
- [ ] Implement or extend architecture dispatch for the chosen 3B model
51+
- [ ] Converter script (`tq_convert`) handles 3B+ safetensors
52+
- [ ] End-to-end inference produces coherent text verified against PyTorch reference
53+
- [ ] Document supported model in README
54+
55+
### Phase B: Long Context Benchmark
56+
57+
**Goal**: Prove KV compression matters with hard numbers.
58+
59+
**Benchmark design**:
60+
- **Context lengths**: 1K, 2K, 4K, 8K, 16K, 32K tokens
61+
- **Measurements**: KV cache memory (bytes), peak RSS, tokens/sec, output quality
62+
- **Comparison**: TurboQuant (PolarQuant 3-bit KV) vs llama.cpp (FP16 KV)
63+
- **Hardware**: 8GB RAM machine (or constrained via `ulimit`)
64+
65+
**Key experiments**:
66+
67+
1. **Memory scaling chart**: X-axis = context length, Y-axis = KV cache memory. Two lines: TurboQuant vs llama.cpp. Should show ~7x gap widening linearly.
68+
69+
2. **OOM crossover**: Find the context length N where llama.cpp exceeds available memory but TurboQuant still runs. For a 3B model on 8GB RAM, this crossover should be around 16K-32K tokens.
70+
71+
3. **Quality preservation**: At each context length, measure output cosine similarity to prove compression does not degrade quality at long contexts.
72+
73+
**Deliverables**:
74+
- [ ] `bench/long_context.sh` — automated benchmark script
75+
- [ ] `bench/plot_memory.py` — generates PNG chart from benchmark data
76+
- [ ] CSV output with raw numbers for reproducibility
77+
- [ ] Chart showing memory crossover point
78+
- [ ] Quality metrics at each context length
79+
80+
### Phase C: Release and Community
81+
82+
**Goal**: Make the proof visible.
83+
84+
**GitHub Release v0.1.0**:
85+
- [ ] Tag `v0.1.0` on main
86+
- [ ] Pre-built binaries: macOS ARM64 (`tq_run`, `tq_convert`)
87+
- [ ] Pre-built binaries: Ubuntu x86-64 (via GitHub Actions or cross-compile)
88+
- [ ] CHANGELOG.md with feature summary
89+
- [ ] Release notes include benchmark chart and key numbers
90+
91+
**README update**:
92+
- [ ] Long context benchmark chart (the memory scaling PNG)
93+
- [ ] Updated model table with 3B model
94+
- [ ] "Why KV compression matters" section with concrete numbers
95+
96+
**Community posts**:
97+
- [ ] r/LocalLLaMA post: lead with the OOM crossover chart, explain what KV cache compression enables
98+
- [ ] Hacker News: "Show HN" with the benchmark as the hook
99+
- [ ] Prepare responses for expected questions (quality loss, model support, vs llama.cpp)
100+
101+
---
102+
103+
## Non-Goals
104+
105+
- GPU or Metal backend support
106+
- New quantization types beyond existing PolarQuant/QJL/Uniform
107+
- Speed improvements or multi-threading optimization
108+
- Supporting more than one new model architecture
109+
- GGUF format compatibility
110+
- Speculative decoding or other advanced inference features
111+
- Windows support
112+
113+
---
114+
115+
## Stakeholders
116+
117+
| Role | Who | Interest |
118+
|------|-----|----------|
119+
| Developer / Maintainer | Core team | Architecture decisions, implementation |
120+
| Early adopters | GitHub stargazers (47) | Want to run useful models, need proof of value |
121+
| r/LocalLLaMA community | ~500K members | Care about practical benchmarks, memory efficiency |
122+
| Potential contributors | Forks (9) | Need clear architecture, build instructions, tests |
123+
124+
---
125+
126+
## User Personas
127+
128+
**Alex — Memory-Constrained Hobbyist**
129+
Runs LLMs on a 8GB laptop. Cannot use llama.cpp for long conversations because KV cache eats all RAM. Wants to chat with a 3B model at 16K+ context without OOM.
130+
131+
**Sam — LLM Framework Evaluator**
132+
Evaluates inference engines for integration. Needs benchmark data comparing memory usage. Will not consider TurboQuant without reproducible numbers on a real model.
133+
134+
---
135+
136+
## Technical Constraints
137+
138+
- **Language**: Pure C11 core, no external dependencies (libc/libm only)
139+
- **Backward compatibility**: Must not break existing Qwen3.5-0.8B and Gemma 3 270M support
140+
- **SIMD**: All NEON code must have scalar fallback for x86 CI
141+
- **Platforms**: macOS ARM64 (primary), Ubuntu x86-64 (CI and release)
142+
- **Testing**: All new code must have unit tests. Existing tests must continue to pass.
143+
144+
---
145+
146+
## Success Criteria
147+
148+
| # | Criterion | Measurement | Target |
149+
|---|-----------|-------------|--------|
150+
| 1 | 3B+ model runs | End-to-end inference, coherent output | Cosine sim > 0.99 vs PyTorch |
151+
| 2 | KV memory reduction | Measured at 32K context | 5-7x less than llama.cpp |
152+
| 3 | OOM crossover | llama.cpp OOMs at N tokens, TurboQuant does not | Demonstrated on 8GB RAM |
153+
| 4 | GitHub Release | v0.1.0 published with binaries | macOS ARM64 + Ubuntu x86-64 |
154+
| 5 | Community reception | Reddit/HN post with benchmark data | Positive reception, not spam-filtered |
155+
156+
---
157+
158+
## Risks and Mitigations
159+
160+
| Risk | Likelihood | Impact | Mitigation |
161+
|------|-----------|--------|------------|
162+
| 3B model has unsupported ops (e.g., new attention variant) | Medium | High | Start with Qwen3.5-3B which shares architecture with existing 0.8B support |
163+
| Dynamic allocation refactor breaks existing models | Medium | High | Run existing test suite after every refactor step; keep old paths as fallback |
164+
| Quality degrades at long context with 3-bit KV | Low | High | Measure cosine similarity at each context length; fall back to 4-bit if needed |
165+
| llama.cpp does not actually OOM at expected context length | Medium | Medium | Use `ulimit -v` to constrain memory; pick hardware/model combo where crossover is clear |
166+
| Reddit post gets spam-filtered again | Medium | Low | Build karma first; post from established account; follow subreddit rules exactly |
167+
168+
---
169+
170+
## Execution Order
171+
172+
Phases are sequential with clear gates:
173+
174+
```
175+
Phase A (Week 1-2) Phase B (Week 3) Phase C (Week 4)
176+
───────────────── ────────────── ──────────────
177+
Buffer refactor --> Benchmark script --> Tag v0.1.0
178+
3B model support --> Memory measurement --> Build binaries
179+
Verify output quality --> OOM crossover test --> Update README
180+
Generate chart --> Community posts
181+
```
182+
183+
**Gate A->B**: 3B model produces coherent output with verified quality.
184+
**Gate B->C**: Benchmark chart shows clear memory advantage and OOM crossover.
185+
186+
---
187+
188+
## Appendix: KV Cache Memory Math
189+
190+
For a 3B model with typical config (32 layers, 32 heads, 128 head_dim):
191+
192+
```
193+
KV cache per token = 2 * n_layers * n_heads * head_dim * dtype_size
194+
= 2 * 32 * 32 * 128 * 2 bytes (FP16)
195+
= 524,288 bytes per token
196+
197+
At 32K context:
198+
FP16 KV: 524,288 * 32,768 = 16 GB (exceeds 8GB RAM)
199+
3-bit TQ KV: 524,288 * 32,768 / 5.3 = ~3 GB (fits in 8GB RAM)
200+
```
201+
202+
This is the crossover. At 32K context on 8GB RAM, llama.cpp cannot run. TurboQuant can.

0 commit comments

Comments
 (0)