|
| 1 | +# TurboQuant.cpp v1.1 PRD — Long Context Proof |
| 2 | + |
| 3 | +**Version**: 1.1 |
| 4 | +**Date**: 2026-03-31 |
| 5 | +**Status**: Draft |
| 6 | +**Author**: Product / Engineering |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +TurboQuant.cpp v1.1 proves the practical value of KV cache compression by supporting 3B+ parameter models and demonstrating measurable memory savings at long context lengths (8K-32K tokens). The release culminates in a public benchmark showing that TurboQuant continues inference where llama.cpp runs out of memory. |
| 13 | + |
| 14 | +Current state: 47 stars, 9 forks, two toy-sized models (270M, 0.8B), no GitHub Release, no long-context proof. v1.1 fixes all three gaps. |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## Objectives |
| 19 | + |
| 20 | +1. **Practical model support**: Run at least one 3B+ model with verified output quality (cosine similarity > 0.99 vs reference). |
| 21 | +2. **Long context proof**: Produce a reproducible benchmark showing 5-7x KV memory reduction at 32K context, including an OOM crossover chart. |
| 22 | +3. **First public release**: Ship GitHub Release v0.1.0 with pre-built binaries and benchmark data. |
| 23 | +4. **Community traction**: Post benchmark results to r/LocalLLaMA and Hacker News with concrete data. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +## Scope |
| 28 | + |
| 29 | +### Phase A: Larger Model Support |
| 30 | + |
| 31 | +**Goal**: Make TurboQuant useful beyond toy models. |
| 32 | + |
| 33 | +**Target model** (pick one, in priority order): |
| 34 | +1. Llama 3.2 3B — widest community adoption |
| 35 | +2. Qwen3.5-3B — existing Qwen architecture support reduces work |
| 36 | +3. Gemma 3 4B — existing Gemma architecture support |
| 37 | + |
| 38 | +**Required changes**: |
| 39 | + |
| 40 | +| Area | Current Limitation | Required Change | |
| 41 | +|------|-------------------|-----------------| |
| 42 | +| Buffer allocation | Stack-allocated `float[4096]` arrays | Dynamic allocation based on model config (`n_embd`, `n_head`, `n_ff`) | |
| 43 | +| Intermediate dimensions | Hardcoded or capped at 4096 | Read from model config, allocate at init time | |
| 44 | +| Weight loading | Assumes small weight files | Streaming/mmap loading for multi-GB safetensors | |
| 45 | +| Memory budget | No tracking | Add peak memory tracking and reporting | |
| 46 | +| KV cache sizing | Sized for small models | Scale with `n_layers * n_heads * head_dim * max_seq_len` | |
| 47 | + |
| 48 | +**Deliverables**: |
| 49 | +- [ ] Refactor all stack-allocated per-layer buffers to heap allocation sized from model config |
| 50 | +- [ ] Implement or extend architecture dispatch for the chosen 3B model |
| 51 | +- [ ] Converter script (`tq_convert`) handles 3B+ safetensors |
| 52 | +- [ ] End-to-end inference produces coherent text verified against PyTorch reference |
| 53 | +- [ ] Document supported model in README |
| 54 | + |
| 55 | +### Phase B: Long Context Benchmark |
| 56 | + |
| 57 | +**Goal**: Prove KV compression matters with hard numbers. |
| 58 | + |
| 59 | +**Benchmark design**: |
| 60 | +- **Context lengths**: 1K, 2K, 4K, 8K, 16K, 32K tokens |
| 61 | +- **Measurements**: KV cache memory (bytes), peak RSS, tokens/sec, output quality |
| 62 | +- **Comparison**: TurboQuant (PolarQuant 3-bit KV) vs llama.cpp (FP16 KV) |
| 63 | +- **Hardware**: 8GB RAM machine (or constrained via `ulimit`) |
| 64 | + |
| 65 | +**Key experiments**: |
| 66 | + |
| 67 | +1. **Memory scaling chart**: X-axis = context length, Y-axis = KV cache memory. Two lines: TurboQuant vs llama.cpp. Should show ~7x gap widening linearly. |
| 68 | + |
| 69 | +2. **OOM crossover**: Find the context length N where llama.cpp exceeds available memory but TurboQuant still runs. For a 3B model on 8GB RAM, this crossover should be around 16K-32K tokens. |
| 70 | + |
| 71 | +3. **Quality preservation**: At each context length, measure output cosine similarity to prove compression does not degrade quality at long contexts. |
| 72 | + |
| 73 | +**Deliverables**: |
| 74 | +- [ ] `bench/long_context.sh` — automated benchmark script |
| 75 | +- [ ] `bench/plot_memory.py` — generates PNG chart from benchmark data |
| 76 | +- [ ] CSV output with raw numbers for reproducibility |
| 77 | +- [ ] Chart showing memory crossover point |
| 78 | +- [ ] Quality metrics at each context length |
| 79 | + |
| 80 | +### Phase C: Release and Community |
| 81 | + |
| 82 | +**Goal**: Make the proof visible. |
| 83 | + |
| 84 | +**GitHub Release v0.1.0**: |
| 85 | +- [ ] Tag `v0.1.0` on main |
| 86 | +- [ ] Pre-built binaries: macOS ARM64 (`tq_run`, `tq_convert`) |
| 87 | +- [ ] Pre-built binaries: Ubuntu x86-64 (via GitHub Actions or cross-compile) |
| 88 | +- [ ] CHANGELOG.md with feature summary |
| 89 | +- [ ] Release notes include benchmark chart and key numbers |
| 90 | + |
| 91 | +**README update**: |
| 92 | +- [ ] Long context benchmark chart (the memory scaling PNG) |
| 93 | +- [ ] Updated model table with 3B model |
| 94 | +- [ ] "Why KV compression matters" section with concrete numbers |
| 95 | + |
| 96 | +**Community posts**: |
| 97 | +- [ ] r/LocalLLaMA post: lead with the OOM crossover chart, explain what KV cache compression enables |
| 98 | +- [ ] Hacker News: "Show HN" with the benchmark as the hook |
| 99 | +- [ ] Prepare responses for expected questions (quality loss, model support, vs llama.cpp) |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## Non-Goals |
| 104 | + |
| 105 | +- GPU or Metal backend support |
| 106 | +- New quantization types beyond existing PolarQuant/QJL/Uniform |
| 107 | +- Speed improvements or multi-threading optimization |
| 108 | +- Supporting more than one new model architecture |
| 109 | +- GGUF format compatibility |
| 110 | +- Speculative decoding or other advanced inference features |
| 111 | +- Windows support |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +## Stakeholders |
| 116 | + |
| 117 | +| Role | Who | Interest | |
| 118 | +|------|-----|----------| |
| 119 | +| Developer / Maintainer | Core team | Architecture decisions, implementation | |
| 120 | +| Early adopters | GitHub stargazers (47) | Want to run useful models, need proof of value | |
| 121 | +| r/LocalLLaMA community | ~500K members | Care about practical benchmarks, memory efficiency | |
| 122 | +| Potential contributors | Forks (9) | Need clear architecture, build instructions, tests | |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## User Personas |
| 127 | + |
| 128 | +**Alex — Memory-Constrained Hobbyist** |
| 129 | +Runs LLMs on a 8GB laptop. Cannot use llama.cpp for long conversations because KV cache eats all RAM. Wants to chat with a 3B model at 16K+ context without OOM. |
| 130 | + |
| 131 | +**Sam — LLM Framework Evaluator** |
| 132 | +Evaluates inference engines for integration. Needs benchmark data comparing memory usage. Will not consider TurboQuant without reproducible numbers on a real model. |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## Technical Constraints |
| 137 | + |
| 138 | +- **Language**: Pure C11 core, no external dependencies (libc/libm only) |
| 139 | +- **Backward compatibility**: Must not break existing Qwen3.5-0.8B and Gemma 3 270M support |
| 140 | +- **SIMD**: All NEON code must have scalar fallback for x86 CI |
| 141 | +- **Platforms**: macOS ARM64 (primary), Ubuntu x86-64 (CI and release) |
| 142 | +- **Testing**: All new code must have unit tests. Existing tests must continue to pass. |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +## Success Criteria |
| 147 | + |
| 148 | +| # | Criterion | Measurement | Target | |
| 149 | +|---|-----------|-------------|--------| |
| 150 | +| 1 | 3B+ model runs | End-to-end inference, coherent output | Cosine sim > 0.99 vs PyTorch | |
| 151 | +| 2 | KV memory reduction | Measured at 32K context | 5-7x less than llama.cpp | |
| 152 | +| 3 | OOM crossover | llama.cpp OOMs at N tokens, TurboQuant does not | Demonstrated on 8GB RAM | |
| 153 | +| 4 | GitHub Release | v0.1.0 published with binaries | macOS ARM64 + Ubuntu x86-64 | |
| 154 | +| 5 | Community reception | Reddit/HN post with benchmark data | Positive reception, not spam-filtered | |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Risks and Mitigations |
| 159 | + |
| 160 | +| Risk | Likelihood | Impact | Mitigation | |
| 161 | +|------|-----------|--------|------------| |
| 162 | +| 3B model has unsupported ops (e.g., new attention variant) | Medium | High | Start with Qwen3.5-3B which shares architecture with existing 0.8B support | |
| 163 | +| Dynamic allocation refactor breaks existing models | Medium | High | Run existing test suite after every refactor step; keep old paths as fallback | |
| 164 | +| Quality degrades at long context with 3-bit KV | Low | High | Measure cosine similarity at each context length; fall back to 4-bit if needed | |
| 165 | +| llama.cpp does not actually OOM at expected context length | Medium | Medium | Use `ulimit -v` to constrain memory; pick hardware/model combo where crossover is clear | |
| 166 | +| Reddit post gets spam-filtered again | Medium | Low | Build karma first; post from established account; follow subreddit rules exactly | |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## Execution Order |
| 171 | + |
| 172 | +Phases are sequential with clear gates: |
| 173 | + |
| 174 | +``` |
| 175 | +Phase A (Week 1-2) Phase B (Week 3) Phase C (Week 4) |
| 176 | +───────────────── ────────────── ────────────── |
| 177 | +Buffer refactor --> Benchmark script --> Tag v0.1.0 |
| 178 | +3B model support --> Memory measurement --> Build binaries |
| 179 | +Verify output quality --> OOM crossover test --> Update README |
| 180 | + Generate chart --> Community posts |
| 181 | +``` |
| 182 | + |
| 183 | +**Gate A->B**: 3B model produces coherent output with verified quality. |
| 184 | +**Gate B->C**: Benchmark chart shows clear memory advantage and OOM crossover. |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +## Appendix: KV Cache Memory Math |
| 189 | + |
| 190 | +For a 3B model with typical config (32 layers, 32 heads, 128 head_dim): |
| 191 | + |
| 192 | +``` |
| 193 | +KV cache per token = 2 * n_layers * n_heads * head_dim * dtype_size |
| 194 | + = 2 * 32 * 32 * 128 * 2 bytes (FP16) |
| 195 | + = 524,288 bytes per token |
| 196 | +
|
| 197 | +At 32K context: |
| 198 | + FP16 KV: 524,288 * 32,768 = 16 GB (exceeds 8GB RAM) |
| 199 | + 3-bit TQ KV: 524,288 * 32,768 / 5.3 = ~3 GB (fits in 8GB RAM) |
| 200 | +``` |
| 201 | + |
| 202 | +This is the crossover. At 32K context on 8GB RAM, llama.cpp cannot run. TurboQuant can. |
0 commit comments