Skip to content

Commit 2a20461

Browse files
TheTomwel97459
authored andcommitted
docs: tighten README — add turbo2, missing features, paper links
Cross-checked against commit log diff vs upstream + paper corpus: - add turbo2 KV-cache format (was already in codebase but missed) - add auto-asymmetric K/V compression policy (PR TheTom#53) - add Boundary V (layer-aware experimental V compression for turbo2-V) - add Sparse V dequantization (on all Metal targets) - add turbo VEC flash attention +9% decode on CUDA - add V2.1 fused Metal TQ kernels - add TurboFlash Apple10 known-limitation caveat - add Vulkan SET_ROWS support for turbo2/turbo4 - note turbo block size moved from 32 to 128 Direct paper links inline (TheTom/turboquant_plus papers) instead of vague references. Drop the tqkit cross-reference per scope.
1 parent 123e88e commit 2a20461

1 file changed

Lines changed: 16 additions & 9 deletions

File tree

README.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,43 +12,50 @@
1212

1313
### New quantization types
1414

15-
- **TQ3_1S** / **TQ4_1S** weight quantization formats (added to `GGMLQuantizationType` + `LlamaFileType` enums) — smaller VRAM than `q8_0` at the 3.5-bit / 4.5-bit tier.
16-
- **turbo3** / **turbo4** KV-cache formats — Walsh-Hadamard rotation + polar codebook, ~4.6× compression at <1.5% PPL loss.
15+
- **TQ3_1S** / **TQ4_1S** weight quantization formats — added to `GGMLQuantizationType` + `LlamaFileType` enums, with V2.1 fused Metal kernels. Smaller VRAM than `q8_0` at the 3.5-bit / 4.5-bit tier. ([weight-compression-tq4](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md))
16+
- **turbo2** / **turbo3** / **turbo4** KV-cache formats — Walsh-Hadamard rotation + polar codebook quantization. Block size 128 (originally 32, increased after the block-size experiment). turbo4 went through significant rehabilitation from a broken state to beating `q4_0`. ([attn-rotation-and-ppl-artifact](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/attn-rotation-and-ppl-artifact.md), [block-size-experiment](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/block-size-experiment.md), [turbo4-resurrection](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/turbo4-resurrection.md), [why-mse-fails-for-kv-quantization](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/why-mse-fails-for-kv-quantization.md))
17+
18+
### Compression policies
19+
20+
- **Auto-asymmetric K/V compression** — recognizes that V tolerates aggressive compression while K does not; defaults pick complementary K/V codecs rather than symmetric. ([asymmetric-kv-compression](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/asymmetric-kv-compression.md))
21+
- **Boundary V (experimental, layer-aware)** — auto-enabled for `turbo2-V`. Protects layers where aggressive V quantization breaks quality, leaves the rest aggressively compressed. ([layer-aware-v-compression](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/layer-aware-v-compression.md), [moe-v-compression-frontier](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/moe-v-compression-frontier.md))
22+
- **Sparse V dequantization** — skip V dequant for positions whose softmax attention weight falls below threshold. Enabled on all Metal targets. ([sparse-v-dequant](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/sparse-v-dequant.md))
1723

1824
### Backend kernels
1925

20-
- **Metal**: TurboFlash attention kernel for `turbo3` KV-cache decode, `dk=512` FA kernel instances for Gemma 4, TQ weight ops, concurrency handling.
21-
- **CUDA**: `dp4a` kernel for `TQ4_1S` (3.5× faster, 240 t/s vs 68 baseline), warp-cooperative TQ dequant (16× less compute per block), multi-token + multi-GPU kernels, load-time `TQ4_1S → q8_0` conversion path.
22-
- **HIP / ROCm**: AMD RDNA3 (gfx1100), RDNA4, CDNA3 (MI300X/gfx942), CDNA4 (MI355X/gfx950). `ggml_cuda_dp4a` replaces `__dp4a` for portability; scalar half path for `TQ4_1S` on AMD; pool bypass for FA f16 temp buffers; VEC FA path forced for quantized KV.
23-
- **Vulkan**: compute-shader `turbo3` KV cache + coopmat flash attention.
26+
- **Metal**: TurboFlash attention kernel for `turbo3` KV-cache decode (currently OFF by default due to Apple10 corruption — re-enable via env knob once verified on your hardware); `dk=512` FA kernel instances for Gemma 4; V2.1 fused TQ weight kernels; Sparse V enabled across the family; concurrency handling. ([m5-max-stress-test](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/m5-max-stress-test.md))
27+
- **CUDA**: turbo VEC flash attention (+9% decode); `dp4a` kernel for `TQ4_1S` (3.5× faster, 240 t/s vs 68 baseline); warp-cooperative TQ dequant (16× less compute per block); multi-token + multi-GPU kernels; load-time `TQ4_1S → q8_0` conversion path.
28+
- **HIP / ROCm**: AMD RDNA3 (gfx1100), RDNA4, CDNA3 (MI300X/gfx942), CDNA4 (MI355X/gfx950). `ggml_cuda_dp4a` replaces `__dp4a` for portability; scalar half path for `TQ4_1S` on AMD; pool bypass for FA f16 temp buffers; VEC FA path forced for quantized KV. ([cross-engine-mi300x](https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/cross-engine-mi300x.md))
29+
- **Vulkan**: compute-shader `turbo3` KV cache + coopmat flash attention; `SET_ROWS` support for `turbo2_0`/`turbo4_0`; `TQ4_1S` weight support.
2430

2531
### Flash Attention integration
2632

2733
- `TURBO2_0` added to the FA auto-enable check.
2834
- Mixed `f16/bf16 + q8_0` KV pairs work without requiring `GGML_CUDA_FA_ALL_QUANTS`.
2935
- CPU `vec_dot` heap-allocation fix for turbo / TQ types at `n > 4096`.
36+
- F16-K + TURBO-V dispatch cases on CUDA fattn.
3037

3138
### Model-family compatibility
3239

3340
- 256-expert MoE kernel instantiations (large MoE shape support).
3441
- Gemma 4 specifics: `dk512` Metal FA kernels, op concurrency, MoE token routing.
42+
- Speculative decoding cherry-picked for hybrid model architectures.
3543
- RPC `GGML_OP_COUNT` assertion fix.
3644

3745
### Memory + stability
3846

3947
- Apple Silicon memory-explosion fix.
40-
- Bench-suite for CUDA dp4a + Metal TurboFlash + Vulkan coopmat.
48+
- TurboFlash off-by-default on Apple10 (known corruption — see Metal section).
4149

4250
### Using TurboQuant types
4351

44-
`turbo3` / `turbo4` KV-cache quantization is selected at runtime via the existing `--cache-type-k` / `--cache-type-v` flags. `TQ3_1S` / `TQ4_1S` weight quantization is selected at conversion time by passing the type to `llama-quantize`.
52+
`turbo2` / `turbo3` / `turbo4` KV-cache quantization is selected at runtime via the existing `--cache-type-k` / `--cache-type-v` flags. `TQ3_1S` / `TQ4_1S` weight quantization is selected at conversion time by passing the type to `llama-quantize`.
4553

4654
Where this fork diverges, it diverges additively — existing `q4_K`, `q8_0`, `f16` paths are untouched. Everything below this point is the upstream llama.cpp README.
4755

4856
### Fork-specific references
4957

5058
- Codec design + calibration + papers: [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)
51-
- Cross-backend bench results: [TheTom/tqkit](https://github.com/TheTom/tqkit)
5259

5360
---
5461

0 commit comments

Comments
 (0)