|
1 | 1 | # TurboQuant.cpp — Session State |
2 | 2 |
|
3 | | -**Last updated**: 2026-03-29 (v0.9.1 non-matmul overhead optimization) |
| 3 | +**Last updated**: 2026-03-29 (v0.9.2 TQM format for instant model loading) |
4 | 4 | **Last commit**: pending |
5 | 5 |
|
6 | 6 | ## Speed Progression |
@@ -61,6 +61,60 @@ llama.cpp Q4_K_M: ~50 tok/s ← target |
61 | 61 | - `src/engine/tq_ops.c` — Added tq_matmul_q4_preq(), fixed unused var warning |
62 | 62 | - `include/turboquant/tq_engine.h` — Added tq_matmul_q4_preq() declaration |
63 | 63 |
|
| 64 | +## v0.9.2 Changes — TQM Format (Instant Model Loading) |
| 65 | + |
| 66 | +### Problem |
| 67 | +Loading safetensors BF16 models requires: mmap → parse JSON → BF16→FP32 convert → Q4 quantize. |
| 68 | +This takes ~6s for an 0.8B model. Goal: <0.5s via pre-quantized mmap-ready format. |
| 69 | + |
| 70 | +### Solution: TQM (TurboQuant Model) binary format |
| 71 | +- 512-byte packed header (tqm_header_t) with full model config |
| 72 | +- Embedded tokenizer.json (raw bytes, variable size) |
| 73 | +- Pre-quantized Q4 weights + FP32 norms + BF16 embeddings |
| 74 | +- All sections 64-byte aligned for efficient mmap access |
| 75 | +- Zero-copy loading: weight pointers point directly into mmap'd file |
| 76 | + |
| 77 | +### Components Implemented |
| 78 | +1. **Format definition** (`include/turboquant/tq_engine.h`) |
| 79 | + - `tqm_header_t` — 512-byte packed struct with magic, config, section offsets |
| 80 | + - `TQM_MAGIC` (0x4D515454 = "TTQM"), `TQM_VERSION` (1), `TQM_ALIGN` (64) |
| 81 | + |
| 82 | +2. **TQM loader** (`src/engine/tq_model.c`) |
| 83 | + - `tq_load_tqm()` — mmap file, cast header, set weight pointers directly |
| 84 | + - Zero malloc for weights, zero conversion — all pointers into mmap'd data |
| 85 | + - `tq_load_model()` auto-detects TQM vs safetensors by magic bytes |
| 86 | + |
| 87 | +3. **TQM saver** (`src/engine/tq_model.c`) |
| 88 | + - `tq_save_tqm()` — writes header + tokenizer + Q4 weights sequentially |
| 89 | + - Handles BF16 embed passthrough and FP32→BF16 on-the-fly conversion |
| 90 | + - Supports tied/untied output weights |
| 91 | + |
| 92 | +4. **Converter tool** (`tools/tq_convert.c`) |
| 93 | + - CLI: `tq_convert model.safetensors tokenizer.json -o model.tqm` |
| 94 | + - 3-step pipeline: load → quantize Q4 → write TQM |
| 95 | + |
| 96 | +5. **Tokenizer from memory** (`src/engine/tq_tokenizer.c`) |
| 97 | + - `tq_load_tokenizer_from_memory()` — parse JSON from buffer |
| 98 | + - `tq_load_tokenizer_from_tqm()` — extract embedded tokenizer from .tqm file |
| 99 | + - `tq_run` auto-loads embedded tokenizer when no -t flag given |
| 100 | + |
| 101 | +6. **Tests** (`tests/test_tqm.cpp`) |
| 102 | + - Header size verification (512 bytes) |
| 103 | + - Magic value verification |
| 104 | + - Save/load roundtrip with synthetic model (norm + Q4 weight byte-exact match) |
| 105 | + - Auto-detect format (tq_load_model dispatches correctly) |
| 106 | + - Tokenizer from-memory loading |
| 107 | + - All 20 tests pass (6 new TQM tests) |
| 108 | + |
| 109 | +### Files Modified/Created |
| 110 | +- `include/turboquant/tq_engine.h` — tqm_header_t, tq_load_tqm, tq_save_tqm, tq_load_tokenizer_from_memory/tqm |
| 111 | +- `src/engine/tq_model.c` — tq_load_tqm(), tq_save_tqm(), auto-detect in tq_load_model() |
| 112 | +- `src/engine/tq_tokenizer.c` — tq_load_tokenizer_from_memory(), tq_load_tokenizer_from_tqm() |
| 113 | +- `tools/tq_convert.c` — NEW converter tool |
| 114 | +- `tools/tq_run.c` — auto-load embedded tokenizer from TQM |
| 115 | +- `tests/test_tqm.cpp` — NEW test file (6 tests) |
| 116 | +- `CMakeLists.txt` — added tq_convert build target |
| 117 | + |
64 | 118 | ## What Needs Work |
65 | 119 | 1. Measure actual speed improvement (need model file for tq_run) |
66 | 120 | 2. Q4 quality on short prompts |
|
0 commit comments