Skip to content

Commit becc846

Browse files
spiritbuunclaude
andcommitted
docs: competitive analysis findings + experiments TheTom#69-84
Head-to-head benchmarks vs TheTom and Duster. Key finding: TBQ is accidentally 1-bit quantization with temperature scaling. 16 new experiment action items from analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c440e67 commit becc846

3 files changed

Lines changed: 392 additions & 0 deletions

File tree

benchmark-results.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1616,3 +1616,105 @@ Test: wikitext-2-raw, 64 chunks @2K, 8 chunks @8K, 4 chunks @32K/64K
16161616
When K and V used different quant types, V always decoded with compiled-in codebook regardless
16171617
of TURBO_TCQ_CB/CB2 env vars. This caused encode/decode mismatch (PPL 9.65 instead of 6.63).
16181618
Fix: added codebook loading to V dequant branches with separate static bool guards.
1619+
1620+
## Head-to-Head Comparison: Ours vs TheTom vs Duster (2026-03-31)
1621+
1622+
Same model (Qwen3.5-27B Q6_K), same wikitext-2 test file, same server (RTX 3090).
1623+
- Ours: `/root/llama-tcq-clean` (master, TCQ codebooks compiled-in)
1624+
- TheTom: `github.com/TheTom/llama-cpp-turboquant` feature/turboquant-kv-cache @ 8ad0f00
1625+
- Duster: `github.com/dusterbloom/llama-cpp-turboquant-cuda` feature/turboquant-kv-cache @ a7a6d10
1626+
1627+
### 3-bit PPL (turbo3 K+V uniform)
1628+
1629+
| Impl | @2K (64ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1630+
|------|------------|-----------|------------|------------|
1631+
| **ours (TCQ)** | **6.507** | **6.883** | **7.005** | 7.053 |
1632+
| TheTom turbo3 | 6.548 | 6.934 | 7.089 | 7.114 |
1633+
| Duster turbo3 | 6.562 | 6.917 | 7.088 | 7.115 |
1634+
| Duster TBQ3 | 6.565 | 6.921 | 7.056 | **7.034** |
1635+
1636+
TCQ wins at 2K-32K. Duster's TBQ3 (SRHT+Lloyd-Max) overtakes at 64K (7.034 vs 7.053).
1637+
1638+
### 2-bit PPL (turbo2 K+V uniform)
1639+
1640+
| Impl | @2K (64ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1641+
|------|------------|-----------|------------|------------|
1642+
| **ours (TCQ)** | 6.742 | 7.266 | 7.294 | 7.484 |
1643+
| TheTom turbo2 | **6.739** | 7.386 | 7.478 | 7.652 |
1644+
| Duster turbo2 | 16.558 | 18.560 | 18.435 | 17.302 |
1645+
| **Duster TBQ2** | 6.798 | **7.233** | **7.186** | **7.332** |
1646+
1647+
Duster's turbo2 is broken (PPL 16-18). Duster's TBQ2 beats everyone at 8K+ context.
1648+
Our TCQ with best codebook (not tested here): 6.708 @2K, 7.222 @64K — still behind TBQ2 at 32K.
1649+
TheTom's turbo2 degrades most at long context.
1650+
1651+
### 3-bit Speed (tok/s)
1652+
1653+
| Impl | pp=512 | pp=8K | pp=32K | decode |
1654+
|------|--------|-------|--------|--------|
1655+
| ours (TCQ) | 892 | 878 | 796 | 28.7 |
1656+
| **TheTom** | **1137** | **1109** | **989** | **30.8** |
1657+
| Duster turbo3 | 1131 | 1102 | 986 | 30.1 |
1658+
| Duster TBQ3 | FAIL | FAIL | FAIL | FAIL |
1659+
1660+
TheTom 27% faster prefill, 7% faster decode. Duster turbo3 matches TheTom.
1661+
Duster TBQ3 fails in llama-bench (context creation error).
1662+
1663+
### 2-bit Speed (tok/s)
1664+
1665+
| Impl | pp=512 | pp=8K | pp=32K | decode |
1666+
|------|--------|-------|--------|--------|
1667+
| ours (TCQ) | 981 | 957 | 859 | 29.4 |
1668+
| TheTom turbo2 | **1151** | FAIL | FAIL | **30.4** |
1669+
| Duster turbo2 | 1135 | 1105 | 988 | 30.6 |
1670+
| Duster TBQ2 | FAIL | FAIL | FAIL | FAIL |
1671+
1672+
TheTom turbo2 fails at 8K+ in llama-bench. Duster turbo2 is fastest (despite broken PPL).
1673+
1674+
### Summary
1675+
1676+
**Quality**: Our TCQ leads at 3-bit across all contexts. At 2-bit, Duster's TBQ2 (SRHT+Lloyd-Max)
1677+
beats everyone at 8K+ context. Duster's TBQ3 also overtakes at 64K.
1678+
1679+
**Speed**: We are 20-27% slower on prefill and ~7% slower on decode vs TheTom/Duster.
1680+
This is the main gap to close.
1681+
1682+
**Compression**: All turbo3 implementations are 3.25 bpv. All turbo2 are 2.25 bpv.
1683+
Duster's TBQ types may differ — need to verify exact bpv.
1684+
1685+
### Turbo4 PPL (4-bit, K+V uniform)
1686+
1687+
| Impl | @2K (64ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1688+
|------|------------|-----------|------------|------------|
1689+
| ours | 6.498 | 6.865 | 6.942 | 6.940 |
1690+
| TheTom turbo4 | 6.552 | 6.972 | 7.056 | 7.058 |
1691+
| Duster turbo4 | 6.498 | 6.865 | 6.942 | 6.940 |
1692+
| **Duster TBQ4** | **6.492** | **6.856** | **6.920** | **6.909** |
1693+
1694+
Ours and Duster's turbo4 are identical (same code). TheTom's is ~0.1 worse (missing
1695+
inverse-FWHT prefill dequant?). Duster's TBQ4 marginally best everywhere.
1696+
1697+
### Turbo4 Speed (tok/s)
1698+
1699+
| Impl | pp=512 | pp=8K | pp=32K | decode |
1700+
|------|--------|-------|--------|--------|
1701+
| ours | 1135 | 1100 | 978 | 30.0 |
1702+
| TheTom | 1134 | 1107 | 986 | 30.7 |
1703+
| Duster turbo4 | 1135 | 1103 | 973 | 30.1 |
1704+
1705+
All three identical — **speed gap is TCQ-specific, not general**.
1706+
1707+
### Overall Competitive Assessment
1708+
1709+
**Quality rankings by bitrate:**
1710+
- 4-bit: Duster TBQ4 > ours = Duster turbo4 > TheTom (we're tied for 2nd)
1711+
- 3-bit: **ours (TCQ) wins 2K-32K**, Duster TBQ3 wins 64K
1712+
- 2-bit: Duster TBQ2 wins 8K+, ours wins 2K, TheTom worst
1713+
1714+
**Speed gap is TCQ-only:**
1715+
- turbo3 TCQ: 20-27% slower prefill, ~7% slower decode (Viterbi encode overhead)
1716+
- turbo2 TCQ: similar pattern
1717+
- turbo4: identical speed across all implementations (no TCQ)
1718+
1719+
**Key insight:** TCQ's quality advantage comes at a speed cost. The Viterbi encode path
1720+
is the bottleneck — turbo3/turbo4 without TCQ run at the same speed across all repos.

experiments.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -791,6 +791,97 @@ for all turbo dequant kernels (turbo3, turbo4) in both prefill and decode paths.
791791
**Concept**: Run Viterbi trellis over 256 elements (full head_dim) instead of two independent 128-element groups. Longer trellis = better coding gain. Only benefits head_dim=256 models.
792792
**Priority**: Medium — nice paper result showing TCQ scales with block length.
793793
794+
### 69. Temperature scaling — attention sharpening via norm inflation `ready`
795+
**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36. This sharpens attention and helps at long context.
796+
**Concept**: Multiply `corrected_norm` by alpha in TCQ encode kernel. Try alpha = 1.5, 2.0, 2.5, 2.77. Combines our 976x better MSE with TBQ's temperature benefit.
797+
**Change**: One line in `turbo-quant-cuda.cuh` — scale the stored norm after correction.
798+
**Test**: PPL grid at 2K/8K/32K/64K for each alpha. If optimal alpha differs by context length, consider making alpha a runtime parameter.
799+
**Expected**: Beat TBQ at ALL context lengths. This is the single highest-impact experiment.
800+
801+
### 70. Asymmetric K/V norm scaling — raw norm for K, corrected for V `ready`
802+
**Source**: Competitive analysis quality findings. K temperature helps attention routing, V accuracy helps output quality.
803+
**Concept**: Remove norm correction for K only (store raw L2 norm), keep correction for V. K gets temperature scaling "for free", V stays MSE-optimal.
804+
**Change**: Conditional in encode kernel based on `iq_is_k` flag (already available in set-rows.cu).
805+
**Test**: PPL at 2K/8K/32K/64K. Compare with symmetric alpha scaling (#69).
806+
807+
### 71. Native `vec_dot_fattn_vec_KQ_turbo3_tcq` — inline TCQ decode in FA `ready`
808+
**Source**: Competitive analysis speed gap. Duster has `vec_dot_fattn_vec_KQ_tbq3_0` for TBQ. We dequant all KV to f16 first.
809+
**Concept**: Read 9-bit state from bitstream → `codebook[state] * norm` → accumulate dot product inline in FA. No intermediate f16 buffer needed.
810+
**Change**: New function in fattn-common.cuh. Simpler than TBQ's vec_dot (no inverse Hadamard).
811+
**Expected**: ~7% decode speedup, reduced VRAM at long context (eliminates O(context) f16 buffer).
812+
**Note**: This is #66 (fused attention+dequant) but specifically scoped to TCQ decode only.
813+
814+
### 72. Chunked cuBLAS GEMM prefill `ready`
815+
**Source**: Competitive analysis speed gap. Duster's implementation: 3-kernel pipeline (init, softmax-update, finalize) + `cublasGemmStridedBatchedEx`.
816+
**Concept**: Dequant 4096 KV tokens at a time to f16, use cuBLAS for Q@K^T and P@V with online softmax between chunks. NOT TCQ-specific — works for all quant types.
817+
**Change**: New prefill path in fattn.cu. Reference: Duster's fattn.cu lines 502-1005.
818+
**Expected**: 20-27% prefill speedup. Enables 350K+ context on single RTX 3090.
819+
**Note**: Our TCQ dequant per chunk is FASTER than TBQ's (no inverse Hadamard needed).
820+
821+
### 73. Parallelize TCQ encode thread-0 serial sections `ready`
822+
**Source**: Competitive analysis encode speed. Thread-0 does FWHT rotation + backtracking + bitpacking alone.
823+
**Concept**: FWHT is 128 elements × 7 stages — use 64+ threads (butterfly pattern, already done for #63 but only for non-TCQ). Bitpacking: each thread packs its own segment.
824+
**Change**: turbo-quant-cuda.cuh TCQ encode kernel.
825+
**Expected**: ~5% encode speedup.
826+
827+
### 74. TCQ error decorrelation via element permutation `needs-research`
828+
**Source**: Competitive analysis quality findings. TCQ trellis (right-shift, k=3) shares 6/9 state bits between consecutive positions → correlated errors. Autocorrelation ~0.15-0.30 at lag 1. Correlated errors average out slower in Q@K dot products.
829+
**Concept**: Apply fixed permutation (e.g., bit-reversal) to element indices after FWHT before trellis encoding. Decorrelates errors across d_k dimension without changing MSE. Matching inverse permutation in decode.
830+
**Risk**: Medium — needs careful verification that permuted trellis still converges. May interact with codebook optimality.
831+
**Test**: Measure lag-1 autocorrelation before/after, PPL at 2K-64K.
832+
833+
### 75. Lloyd-Max boundaries as TCQ initial state prior `needs-research`
834+
**Source**: Duster's TBQ uses textbook-optimal N(0,1) Lloyd-Max centroids. These are MSE-optimal for scalar Gaussian quantization.
835+
**Concept**: Use Lloyd-Max bin boundaries to inform TCQ Viterbi initial state distribution or trellis path metric initialization. The optimal scalar quantizer boundaries partition the space in a way that might help Viterbi converge faster or to better paths.
836+
**Risk**: Speculative — trellis structure already constrains state transitions heavily.
837+
838+
### 76. Optimal temperature grid search across context lengths `ready`
839+
**Source**: Follow-up to #69. TBQ4=6.909, TBQ3=7.034, TBQ2=7.332 are monotonically ordered by temperature severity, suggesting optimal alpha varies with bit-rate.
840+
**Concept**: Sweep alpha from 1.0 to 3.0 in 0.25 steps for turbo2_tcq, turbo3_tcq, and turbo4. Measure PPL at 2K/8K/32K/64K for each. Find Pareto-optimal alpha per bit-rate.
841+
**Test**: 4 alphas × 3 quant types × 4 contexts = 48 benchmark runs. ~2 hrs on server.
842+
843+
### 77. Verify turbo4 quality gap vs TBQ4 `ready`
844+
**Source**: Competitive analysis: TBQ4 beats turbo4 by 0.01-0.03 PPL everywhere. turbo4 has no TCQ — just scalar 4-bit quantization.
845+
**Concept**: Investigate whether TBQ4's advantage comes from (a) Lloyd-Max centroids being better than uniform for post-FWHT data, (b) the temperature scaling effect, or (c) something else. Try replacing turbo4 centroids with Lloyd-Max N(0,1) 16-point centroids.
846+
**Test**: PPL comparison at 2K/8K/32K/64K.
847+
848+
### 78. Measure TCQ error autocorrelation empirically `ready`
849+
**Source**: Finding #29 from competitive analysis. Theoretical prediction of lag-1 autocorrelation ~0.15-0.30 but never measured.
850+
**Concept**: Dump quantization errors from TCQ encode, compute autocorrelation at lags 1-10. Compare with scalar Lloyd-Max (should be ~0). Confirm the theoretical prediction.
851+
**Change**: Add dump mode to encode kernel (env var trigger), Python analysis script.
852+
**Test**: Correlate measured autocorrelation with PPL degradation at long context.
853+
854+
### 79. TBQ-style encode as TCQ fallback for speed-critical path `needs-research`
855+
**Source**: Duster's TBQ encode is fully parallel (one binary search per element, ~660B shared mem) vs our Viterbi (128 sequential barrier-synced iterations, 34.5KB shared mem).
856+
**Concept**: Offer Lloyd-Max scalar quantize as a fast encode path (e.g., for streaming/real-time), with TCQ Viterbi as the quality path. Runtime flag to select. Zero code sharing issues — the two encoders write the same bitstream format if we match centroids.
857+
**Risk**: Quality regression vs TCQ. Only useful if encode speed matters (batch inference, very long prompts).
858+
859+
### 80. Padding non-128 head_dim completion `ready`
860+
**Source**: Infrastructure comparison — Duster has it done, ours is "in progress" (#58/#64).
861+
**Status**: Code changes done, needs final build+test verification.
862+
**Test**: Verify PPL on a head_dim=128 model (e.g., Llama-3) and head_dim=256 model (Qwen3.5).
863+
864+
### 81. Sparse V dequant integration with TCQ `ready`
865+
**Source**: #65 planned but not tested with TCQ path. TheTom showed +22.8% decode at 32K.
866+
**Concept**: Skip V dequant+accumulation when all attention weights in a block < threshold. Works with both f16-dequant path and future native vec_dot (#71).
867+
**Change**: ~3 lines in fattn-vec.cuh. Purely additive.
868+
**Expected**: +22.8% decode at 32K, more at longer context. Zero quality impact.
869+
870+
### 82. Replicate TBQ's exact 1-bit behavior for validation `needs-research`
871+
**Source**: Bombshell finding — TBQ is accidentally 1-bit. 100% of post-FWHT values map to 2 inner bins.
872+
**Concept**: Implement explicit 1-bit quantization (just store sign + raw norm) and verify it matches TBQ3 PPL exactly. If it does, this conclusively proves the temperature theory and means TBQ's 3-bit storage wastes 2 bits.
873+
**Test**: PPL should match TBQ3 at all context lengths. If so, we can publish this finding.
874+
875+
### 83. Adaptive temperature per layer `needs-research`
876+
**Source**: From MSE-PPL divergence investigation — Q anisotropy varies wildly per layer (layer 19: rank 1.9, layer 20: rank 114). Optimal attention temperature likely varies per layer too.
877+
**Concept**: Store per-layer alpha in constant memory. Use different temperature scaling per layer based on that layer's Q effective rank.
878+
**Risk**: Requires knowing Q statistics at quantization time (not available in llama.cpp's online KV quantize). Could use precomputed per-model tables.
879+
880+
### 84. 350K+ context validation `planned`
881+
**Source**: Duster's chunked cuBLAS GEMM enables 350K+ on single 3090. Currently we OOM much earlier.
882+
**Concept**: After implementing #72, benchmark at 128K/256K/350K. Verify PPL doesn't degrade, measure speed.
883+
**Depends**: #72 (chunked cuBLAS GEMM prefill).
884+
794885
### DeltaKV (#44b) — inter-token residual compression `dropped`
795886
**Paper**: arXiv:2602.08005 (Feb 2026). Learned MLP compressor, strided reference tokens, global L2 retrieval.
796887
**Analysis**: Requires training (~8 GPU hours per model), learned projections (MLP weights per layer), and a full framework rewrite (Sparse-vLLM). Fundamentally incompatible with our fixed-codebook approach. The per-token reference lookup is O(S) per token, not feasible in a CUDA kernel during SET_ROWS. **Verdict: wrong paradigm for llama.cpp integration.**

0 commit comments

Comments
 (0)