merge: experiment TheTom#69 — TCQ temperature scaling alpha=1.20

spiritbuun · claude · spiritbuun · commit 08c04a6472c8 · 2026-03-31T21:00:06.000-04:00
5-14% PPL improvement at ALL context lengths for both 3-bit and 2-bit TCQ.
Multiplies stored norm by 1.2 to sharpen attention logits.
Beats every competitor at every context length at both bit rates.
Override via TURBO_TCQ_ALPHA env var.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/benchmark-results.md b/benchmark-results.md
@@ -1718,3 +1718,90 @@ All three identical — **speed gap is TCQ-specific, not general**.
 
 **Key insight:** TCQ's quality advantage comes at a speed cost. The Viterbi encode path
 is the bottleneck — turbo3/turbo4 without TCQ run at the same speed across all repos.
+
+---
+
+## Experiment #69: Temperature Scaling (2026-03-31)
+
+### Preliminary sweep — turbo3_tcq, compiled-in codebook (NOT best)
+
+| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
+|-------|-----------|-----------|------------|------------|
+| 1.00 (baseline) | 5.824 | 6.883 | 7.005 | 7.053 |
+| **1.10** | **5.582** | 6.371 | 6.595 | 6.396 |
+| **1.25** | 5.528 | **6.219** | **6.541** | **6.178** |
+| 1.50 | 5.875 | 6.801 | 7.565 | 7.205 |
+| 1.75 | 6.880 | 8.835 | 10.745 | — |
+
+Sweet spot: alpha 1.10-1.25. Best universal: ~1.25 (wins everywhere except 2K by 0.05).
+Alpha=1.25 @64K: **6.178 vs 7.053 baseline = -12.4% PPL improvement**.
+Alpha=1.25 @2K: **5.528 vs 5.824 = -5.1% PPL improvement**.
+Note: these results use compiled-in codebook, NOT best.
+
+### Full sweep — turbo3_tcq, cb_50iter_finetuned.bin codebook
+
+| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
+|-------|-----------|-----------|------------|------------|
+| 1.00 (baseline) | 5.912 | 7.001 | 7.071 | 7.034 |
+| 1.05 | 5.762 | 6.675 | 6.807 | 6.665 |
+| 1.10 | 5.687 | 6.484 | 6.647 | 6.448 |
+| 1.15 | 5.623 | 6.351 | 6.572 | 6.290 |
+| **1.20** | **5.567** | **6.289** | **6.574** | **6.224** |
+| 1.25 | 5.624 | 6.271 | 6.619 | 6.274 |
+| 1.30 | 5.590 | 6.315 | 6.698 | 6.329 |
+
+Optimal alpha for 3-bit (50iter codebook): **1.15-1.20**
+- alpha=1.20 best at 2K and 64K
+- alpha=1.25 best at 8K (marginal: 6.271 vs 6.289)
+- alpha=1.15 and 1.20 essentially tied at 32K
+
+**Universal improvement**: alpha=1.20 improves PPL by 5.8% @2K, 10.2% @8K, 7.0% @32K, 11.5% @64K.
+No context length where alpha=1.0 is better. This is a pure win.
+
+Note: cb_50iter_finetuned baseline (5.912 @2K) is worse than compiled-in numpy (5.824 @2K).
+Compiled-in numpy + alpha=1.25 gave 5.528 @2K in preliminary sweep — even better.
+Need compiled-in numpy fine sweep to find true global optimum.
+
+### Full sweep — turbo2_tcq, tcq_2bit_100iter_s99.bin codebook
+
+| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
+|-------|-----------|-----------|------------|------------|
+| 1.00 (baseline) | 6.042 | 7.135 | 7.205 | 7.222 |
+| 1.05 | 5.804 | 6.793 | 6.937 | 6.779 |
+| 1.10 | 5.800 | 6.570 | 6.717 | 6.488 |
+| 1.15 | 5.607 | 6.412 | 6.616 | 6.337 |
+| **1.20** | 5.619 | 6.387 | **6.615** | **6.248** |
+| **1.25** | **5.611** | **6.345** | 6.635 | 6.250 |
+| 1.30 | 5.640 | 6.380 | 6.697 | 6.311 |
+| 1.50 | 6.004 | 6.970 | 7.601 | 7.206 |
+
+Optimal alpha for 2-bit: **1.20-1.25** (same range as 3-bit!)
+- alpha=1.25 best at 2K and 8K
+- alpha=1.20 best at 32K and 64K (by tiny margin)
+- alpha=1.50 already degrades past baseline at 32K
+
+**Universal improvement**: alpha=1.20 improves PPL by 7.0% @2K, 10.5% @8K, 8.2% @32K, 13.5% @64K.
+
+### Summary — Temperature Scaling Experiment #69
+
+**CONFIRMED: Temperature scaling is a massive universal improvement for TCQ.**
+- Optimal alpha: 1.15-1.25 for both 3-bit and 2-bit TCQ
+- Default recommendation: alpha=1.20 (best overall)
+- Improvement is 5-14% PPL reduction across ALL context lengths
+- No regression at any context length — pure win
+- Improvement grows with context length (larger at 64K than 2K)
+- Same optimal alpha range regardless of codebook choice or bit rate
+
+Comparison vs competitors at alpha=1.20:
+
+| Config | @2K | @8K | @32K | @64K |
+|--------|-----|-----|------|------|
+| **Ours t3_tcq α=1.20** | **5.567** | **6.289** | **6.574** | **6.224** |
+| Duster TBQ3 | 6.565 | 6.921 | 7.056 | 7.034 |
+| TheTom turbo3 | 6.548 | 6.934 | 7.089 | 7.114 |
+| **Ours t2_tcq α=1.20** | **5.619** | **6.387** | **6.615** | **6.248** |
+| Duster TBQ2 | 6.798 | 7.233 | 7.186 | 7.332 |
+
+Note: our numbers use 50iter_finetuned (3-bit) and 100iter (2-bit) codebooks.
+Compiled-in numpy codebook may be even better (preliminary showed 5.528 @2K vs 5.567).
+**We now CRUSH every competitor at every context length at both bit rates.**
diff --git a/experiments.md b/experiments.md
@@ -791,12 +791,12 @@ for all turbo dequant kernels (turbo3, turbo4) in both prefill and decode paths.
 **Concept**: Run Viterbi trellis over 256 elements (full head_dim) instead of two independent 128-element groups. Longer trellis = better coding gain. Only benefits head_dim=256 models.
 **Priority**: Medium — nice paper result showing TCQ scales with block length.
 
-### 69. Temperature scaling — attention sharpening via norm inflation `ready`
-**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36. This sharpens attention and helps at long context.
-**Concept**: Multiply `corrected_norm` by alpha in TCQ encode kernel. Try alpha = 1.5, 2.0, 2.5, 2.77. Combines our 976x better MSE with TBQ's temperature benefit.
-**Change**: One line in `turbo-quant-cuda.cuh` — scale the stored norm after correction.
-**Test**: PPL grid at 2K/8K/32K/64K for each alpha. If optimal alpha differs by context length, consider making alpha a runtime parameter.
-**Expected**: Beat TBQ at ALL context lengths. This is the single highest-impact experiment.
+### 69. Temperature scaling — attention sharpening via norm inflation `done`
+**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36.
+**Result**: **MASSIVE WIN.** Alpha=1.20 optimal for both 3-bit and 2-bit TCQ. 5-14% PPL improvement at ALL context lengths. No regression anywhere. We now beat every competitor at every context length at both bit rates.
+**Implementation**: `d_tcq_norm_alpha` constant in turbo-quant-cuda.cuh, loaded from `TURBO_TCQ_ALPHA` env var in set-rows.cu. Applied to both 3-bit and 2-bit encode kernels.
+**Key numbers**: 3-bit @64K: 6.224 (was 7.034, TBQ3 was 7.034). 2-bit @64K: 6.248 (was 7.222, TBQ2 was 7.332).
+**Default**: Hard-code alpha=1.20 for shipping. Keep env var for experimentation.
 
 ### 70. Asymmetric K/V norm scaling — raw norm for K, corrected for V `ready`
 **Source**: Competitive analysis quality findings. K temperature helps attention routing, V accuracy helps output quality.
diff --git a/ggml/src/ggml-cuda/set-rows.cu b/ggml/src/ggml-cuda/set-rows.cu
@@ -2,6 +2,24 @@
 #include "cpy-utils.cuh"
 #include "turbo-quant-cuda.cuh"
 #include <cstring>
+#include <cerrno>
+
+static void load_tcq_norm_alpha() {
+    static bool loaded = false;
+    if (loaded) return;
+    loaded = true;
+    const char *s = getenv("TURBO_TCQ_ALPHA");
+    if (!s) return;
+    char *end;
+    errno = 0;
+    float alpha = strtof(s, &end);
+    if (end == s || errno != 0 || alpha <= 0.0f || alpha >= 10.0f) {
+        fprintf(stderr, "TCQ: invalid TURBO_TCQ_ALPHA='%s'\n", s);
+        return;
+    }
+    cudaMemcpyToSymbol(d_tcq_norm_alpha, &alpha, sizeof(float));
+    fprintf(stderr, "TCQ: norm alpha = %.3f\n", alpha);
+}
 
 typedef void (*set_rows_kernel_t)(const char * src, char * dst);
 
@@ -373,6 +391,7 @@ static void set_rows_cuda(ggml_backend_cuda_context & ctx, const ggml_tensor * s
                     fprintf(stderr, "TCQ encode: FAILED to load codebook from %s\n", cb_path);
                 }
             }
+            load_tcq_norm_alpha();
         }
         // TCQ Viterbi encode: 512 threads per block (one per trellis state)
         const int64_t s01_f = nb01/sizeof(float); const int64_t s02_f = nb02/sizeof(float); const int64_t s03_f = nb03/sizeof(float);
@@ -410,6 +429,7 @@ static void set_rows_cuda(ggml_backend_cuda_context & ctx, const ggml_tensor * s
                     fprintf(stderr, "TCQ2 encode: FAILED to load codebook from %s\n", cb_path);
                 }
             }
+            load_tcq_norm_alpha();
         }
         // 2-bit TCQ Viterbi encode: 256 threads per block (one per trellis state, L=8)
         const int64_t s01_f = nb01/sizeof(float); const int64_t s02_f = nb02/sizeof(float); const int64_t s03_f = nb03/sizeof(float);
diff --git a/ggml/src/ggml-cuda/turbo-quant-cuda.cuh b/ggml/src/ggml-cuda/turbo-quant-cuda.cuh
@@ -613,6 +613,10 @@ static __constant__ float d_turbo3_tcq_codebook[512] = {
     -0.16474872f, -0.09278035f, -0.04699890f, -0.00779894f, +0.03187623f, +0.07828258f, +0.13561429f, +0.23917313f
 };
 
+// Temperature scaling for TCQ norm. alpha > 1 sharpens attention (helps long context).
+// Override via TURBO_TCQ_ALPHA env var.
+static __constant__ float d_tcq_norm_alpha = 1.2f;
+
 // TCQ SET_ROWS encode: Viterbi optimal path with right-shift trellis
 // 512 threads per block (one per trellis state), one block per 128-element group
 // Backtrace stored in shared memory (32KB, 4-bit packed)
@@ -771,6 +775,7 @@ static __global__ void __launch_bounds__(512, 1) k_set_rows_turbo3_tcq(
         }
         float recon_norm = sqrtf(recon_norm_sq);
         float corrected_norm = (recon_norm > 1e-10f) ? saved_norm / recon_norm : saved_norm;
+        corrected_norm *= d_tcq_norm_alpha;
 
         // Pack bitstream: [6 prefix bits] [out_0 (3 bits)] ... [out_127 (3 bits)]
         for (int j = 0; j < 49; j++) dst_blk->qs[j] = 0;
@@ -1008,6 +1013,7 @@ static __global__ void __launch_bounds__(256, 1) k_set_rows_turbo2_tcq(
         }
         float recon_norm = sqrtf(recon_norm_sq);
         float corrected_norm = (recon_norm > 1e-10f) ? saved_norm / recon_norm : saved_norm;
+        corrected_norm *= d_tcq_norm_alpha;
 
         // Pack bitstream: [6 prefix bits] [out_0 (2 bits)] ... [out_127 (2 bits)]
         for (int j = 0; j < 33; j++) dst_blk->qs[j] = 0;