You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Head-to-head benchmarks vs TheTom and Duster. Key finding: TBQ is
accidentally 1-bit quantization with temperature scaling. 16 new
experiment action items from analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: experiments.md
+91Lines changed: 91 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -791,6 +791,97 @@ for all turbo dequant kernels (turbo3, turbo4) in both prefill and decode paths.
791
791
**Concept**: Run Viterbi trellis over 256 elements (full head_dim) instead of two independent 128-element groups. Longer trellis = better coding gain. Only benefits head_dim=256 models.
792
792
**Priority**: Medium — nice paper result showing TCQ scales with block length.
793
793
794
+
### 69. Temperature scaling — attention sharpening via norm inflation `ready`
795
+
**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36. This sharpens attention and helps at long context.
796
+
**Concept**: Multiply `corrected_norm` by alpha in TCQ encode kernel. Try alpha = 1.5, 2.0, 2.5, 2.77. Combines our 976x better MSE with TBQ's temperature benefit.
797
+
**Change**: One line in `turbo-quant-cuda.cuh` — scale the stored norm after correction.
798
+
**Test**: PPL grid at 2K/8K/32K/64K for each alpha. If optimal alpha differs by context length, consider making alpha a runtime parameter.
799
+
**Expected**: Beat TBQ at ALL context lengths. This is the single highest-impact experiment.
800
+
801
+
### 70. Asymmetric K/V norm scaling — raw norm for K, corrected for V `ready`
802
+
**Source**: Competitive analysis quality findings. K temperature helps attention routing, V accuracy helps output quality.
803
+
**Concept**: Remove norm correction for K only (store raw L2 norm), keep correction for V. K gets temperature scaling "for free", V stays MSE-optimal.
804
+
**Change**: Conditional in encode kernel based on `iq_is_k` flag (already available in set-rows.cu).
805
+
**Test**: PPL at 2K/8K/32K/64K. Compare with symmetric alpha scaling (#69).
806
+
807
+
### 71. Native `vec_dot_fattn_vec_KQ_turbo3_tcq` — inline TCQ decode in FA `ready`
808
+
**Source**: Competitive analysis speed gap. Duster has `vec_dot_fattn_vec_KQ_tbq3_0` for TBQ. We dequant all KV to f16 first.
809
+
**Concept**: Read 9-bit state from bitstream → `codebook[state] * norm` → accumulate dot product inline in FA. No intermediate f16 buffer needed.
810
+
**Change**: New function in fattn-common.cuh. Simpler than TBQ's vec_dot (no inverse Hadamard).
811
+
**Expected**: ~7% decode speedup, reduced VRAM at long context (eliminates O(context) f16 buffer).
812
+
**Note**: This is #66 (fused attention+dequant) but specifically scoped to TCQ decode only.
**Concept**: Dequant 4096 KV tokens at a time to f16, use cuBLAS for Q@K^T and P@V with online softmax between chunks. NOT TCQ-specific — works for all quant types.
817
+
**Change**: New prefill path in fattn.cu. Reference: Duster's fattn.cu lines 502-1005.
818
+
**Expected**: 20-27% prefill speedup. Enables 350K+ context on single RTX 3090.
819
+
**Note**: Our TCQ dequant per chunk is FASTER than TBQ's (no inverse Hadamard needed).
820
+
821
+
### 73. Parallelize TCQ encode thread-0 serial sections `ready`
**Concept**: FWHT is 128 elements × 7 stages — use 64+ threads (butterfly pattern, already done for #63 but only for non-TCQ). Bitpacking: each thread packs its own segment.
### 74. TCQ error decorrelation via element permutation `needs-research`
828
+
**Source**: Competitive analysis quality findings. TCQ trellis (right-shift, k=3) shares 6/9 state bits between consecutive positions → correlated errors. Autocorrelation ~0.15-0.30 at lag 1. Correlated errors average out slower in Q@K dot products.
829
+
**Concept**: Apply fixed permutation (e.g., bit-reversal) to element indices after FWHT before trellis encoding. Decorrelates errors across d_k dimension without changing MSE. Matching inverse permutation in decode.
830
+
**Risk**: Medium — needs careful verification that permuted trellis still converges. May interact with codebook optimality.
831
+
**Test**: Measure lag-1 autocorrelation before/after, PPL at 2K-64K.
832
+
833
+
### 75. Lloyd-Max boundaries as TCQ initial state prior `needs-research`
834
+
**Source**: Duster's TBQ uses textbook-optimal N(0,1) Lloyd-Max centroids. These are MSE-optimal for scalar Gaussian quantization.
835
+
**Concept**: Use Lloyd-Max bin boundaries to inform TCQ Viterbi initial state distribution or trellis path metric initialization. The optimal scalar quantizer boundaries partition the space in a way that might help Viterbi converge faster or to better paths.
836
+
**Risk**: Speculative — trellis structure already constrains state transitions heavily.
837
+
838
+
### 76. Optimal temperature grid search across context lengths `ready`
839
+
**Source**: Follow-up to #69. TBQ4=6.909, TBQ3=7.034, TBQ2=7.332 are monotonically ordered by temperature severity, suggesting optimal alpha varies with bit-rate.
840
+
**Concept**: Sweep alpha from 1.0 to 3.0 in 0.25 steps for turbo2_tcq, turbo3_tcq, and turbo4. Measure PPL at 2K/8K/32K/64K for each. Find Pareto-optimal alpha per bit-rate.
**Source**: Competitive analysis: TBQ4 beats turbo4 by 0.01-0.03 PPL everywhere. turbo4 has no TCQ — just scalar 4-bit quantization.
845
+
**Concept**: Investigate whether TBQ4's advantage comes from (a) Lloyd-Max centroids being better than uniform for post-FWHT data, (b) the temperature scaling effect, or (c) something else. Try replacing turbo4 centroids with Lloyd-Max N(0,1) 16-point centroids.
**Source**: Finding #29 from competitive analysis. Theoretical prediction of lag-1 autocorrelation ~0.15-0.30 but never measured.
850
+
**Concept**: Dump quantization errors from TCQ encode, compute autocorrelation at lags 1-10. Compare with scalar Lloyd-Max (should be ~0). Confirm the theoretical prediction.
851
+
**Change**: Add dump mode to encode kernel (env var trigger), Python analysis script.
852
+
**Test**: Correlate measured autocorrelation with PPL degradation at long context.
853
+
854
+
### 79. TBQ-style encode as TCQ fallback for speed-critical path `needs-research`
855
+
**Source**: Duster's TBQ encode is fully parallel (one binary search per element, ~660B shared mem) vs our Viterbi (128 sequential barrier-synced iterations, 34.5KB shared mem).
856
+
**Concept**: Offer Lloyd-Max scalar quantize as a fast encode path (e.g., for streaming/real-time), with TCQ Viterbi as the quality path. Runtime flag to select. Zero code sharing issues — the two encoders write the same bitstream format if we match centroids.
857
+
**Risk**: Quality regression vs TCQ. Only useful if encode speed matters (batch inference, very long prompts).
**Source**: Infrastructure comparison — Duster has it done, ours is "in progress" (#58/#64).
861
+
**Status**: Code changes done, needs final build+test verification.
862
+
**Test**: Verify PPL on a head_dim=128 model (e.g., Llama-3) and head_dim=256 model (Qwen3.5).
863
+
864
+
### 81. Sparse V dequant integration with TCQ `ready`
865
+
**Source**: #65 planned but not tested with TCQ path. TheTom showed +22.8% decode at 32K.
866
+
**Concept**: Skip V dequant+accumulation when all attention weights in a block < threshold. Works with both f16-dequant path and future native vec_dot (#71).
867
+
**Change**: ~3 lines in fattn-vec.cuh. Purely additive.
868
+
**Expected**: +22.8% decode at 32K, more at longer context. Zero quality impact.
869
+
870
+
### 82. Replicate TBQ's exact 1-bit behavior for validation `needs-research`
871
+
**Source**: Bombshell finding — TBQ is accidentally 1-bit. 100% of post-FWHT values map to 2 inner bins.
872
+
**Concept**: Implement explicit 1-bit quantization (just store sign + raw norm) and verify it matches TBQ3 PPL exactly. If it does, this conclusively proves the temperature theory and means TBQ's 3-bit storage wastes 2 bits.
873
+
**Test**: PPL should match TBQ3 at all context lengths. If so, we can publish this finding.
874
+
875
+
### 83. Adaptive temperature per layer `needs-research`
876
+
**Source**: From MSE-PPL divergence investigation — Q anisotropy varies wildly per layer (layer 19: rank 1.9, layer 20: rank 114). Optimal attention temperature likely varies per layer too.
877
+
**Concept**: Store per-layer alpha in constant memory. Use different temperature scaling per layer based on that layer's Q effective rank.
878
+
**Risk**: Requires knowing Q statistics at quantization time (not available in llama.cpp's online KV quantize). Could use precomputed per-model tables.
879
+
880
+
### 84. 350K+ context validation `planned`
881
+
**Source**: Duster's chunked cuBLAS GEMM enables 350K+ on single 3090. Currently we OOM much earlier.
882
+
**Concept**: After implementing #72, benchmark at 128K/256K/350K. Verify PPL doesn't degrade, measure speed.
**Analysis**: Requires training (~8 GPU hours per model), learned projections (MLP weights per layer), and a full framework rewrite (Sparse-vLLM). Fundamentally incompatible with our fixed-codebook approach. The per-token reference lookup is O(S) per token, not feasible in a CUDA kernel during SET_ROWS. **Verdict: wrong paradigm for llama.cpp integration.**
0 commit comments