Skip to content

Commit 08c04a6

Browse files
spiritbuunclaude
andcommitted
merge: experiment TheTom#69 — TCQ temperature scaling alpha=1.20
5-14% PPL improvement at ALL context lengths for both 3-bit and 2-bit TCQ. Multiplies stored norm by 1.2 to sharpen attention logits. Beats every competitor at every context length at both bit rates. Override via TURBO_TCQ_ALPHA env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2 parents becc846 + 4f89b65 commit 08c04a6

4 files changed

Lines changed: 119 additions & 6 deletions

File tree

benchmark-results.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1718,3 +1718,90 @@ All three identical — **speed gap is TCQ-specific, not general**.
17181718
17191719
**Key insight:** TCQ's quality advantage comes at a speed cost. The Viterbi encode path
17201720
is the bottleneck — turbo3/turbo4 without TCQ run at the same speed across all repos.
1721+
1722+
---
1723+
1724+
## Experiment #69: Temperature Scaling (2026-03-31)
1725+
1726+
### Preliminary sweep — turbo3_tcq, compiled-in codebook (NOT best)
1727+
1728+
| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1729+
|-------|-----------|-----------|------------|------------|
1730+
| 1.00 (baseline) | 5.824 | 6.883 | 7.005 | 7.053 |
1731+
| **1.10** | **5.582** | 6.371 | 6.595 | 6.396 |
1732+
| **1.25** | 5.528 | **6.219** | **6.541** | **6.178** |
1733+
| 1.50 | 5.875 | 6.801 | 7.565 | 7.205 |
1734+
| 1.75 | 6.880 | 8.835 | 10.745 | — |
1735+
1736+
Sweet spot: alpha 1.10-1.25. Best universal: ~1.25 (wins everywhere except 2K by 0.05).
1737+
Alpha=1.25 @64K: **6.178 vs 7.053 baseline = -12.4% PPL improvement**.
1738+
Alpha=1.25 @2K: **5.528 vs 5.824 = -5.1% PPL improvement**.
1739+
Note: these results use compiled-in codebook, NOT best.
1740+
1741+
### Full sweep — turbo3_tcq, cb_50iter_finetuned.bin codebook
1742+
1743+
| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1744+
|-------|-----------|-----------|------------|------------|
1745+
| 1.00 (baseline) | 5.912 | 7.001 | 7.071 | 7.034 |
1746+
| 1.05 | 5.762 | 6.675 | 6.807 | 6.665 |
1747+
| 1.10 | 5.687 | 6.484 | 6.647 | 6.448 |
1748+
| 1.15 | 5.623 | 6.351 | 6.572 | 6.290 |
1749+
| **1.20** | **5.567** | **6.289** | **6.574** | **6.224** |
1750+
| 1.25 | 5.624 | 6.271 | 6.619 | 6.274 |
1751+
| 1.30 | 5.590 | 6.315 | 6.698 | 6.329 |
1752+
1753+
Optimal alpha for 3-bit (50iter codebook): **1.15-1.20**
1754+
- alpha=1.20 best at 2K and 64K
1755+
- alpha=1.25 best at 8K (marginal: 6.271 vs 6.289)
1756+
- alpha=1.15 and 1.20 essentially tied at 32K
1757+
1758+
**Universal improvement**: alpha=1.20 improves PPL by 5.8% @2K, 10.2% @8K, 7.0% @32K, 11.5% @64K.
1759+
No context length where alpha=1.0 is better. This is a pure win.
1760+
1761+
Note: cb_50iter_finetuned baseline (5.912 @2K) is worse than compiled-in numpy (5.824 @2K).
1762+
Compiled-in numpy + alpha=1.25 gave 5.528 @2K in preliminary sweep — even better.
1763+
Need compiled-in numpy fine sweep to find true global optimum.
1764+
1765+
### Full sweep — turbo2_tcq, tcq_2bit_100iter_s99.bin codebook
1766+
1767+
| Alpha | @2K (8ch) | @8K (8ch) | @32K (4ch) | @64K (4ch) |
1768+
|-------|-----------|-----------|------------|------------|
1769+
| 1.00 (baseline) | 6.042 | 7.135 | 7.205 | 7.222 |
1770+
| 1.05 | 5.804 | 6.793 | 6.937 | 6.779 |
1771+
| 1.10 | 5.800 | 6.570 | 6.717 | 6.488 |
1772+
| 1.15 | 5.607 | 6.412 | 6.616 | 6.337 |
1773+
| **1.20** | 5.619 | 6.387 | **6.615** | **6.248** |
1774+
| **1.25** | **5.611** | **6.345** | 6.635 | 6.250 |
1775+
| 1.30 | 5.640 | 6.380 | 6.697 | 6.311 |
1776+
| 1.50 | 6.004 | 6.970 | 7.601 | 7.206 |
1777+
1778+
Optimal alpha for 2-bit: **1.20-1.25** (same range as 3-bit!)
1779+
- alpha=1.25 best at 2K and 8K
1780+
- alpha=1.20 best at 32K and 64K (by tiny margin)
1781+
- alpha=1.50 already degrades past baseline at 32K
1782+
1783+
**Universal improvement**: alpha=1.20 improves PPL by 7.0% @2K, 10.5% @8K, 8.2% @32K, 13.5% @64K.
1784+
1785+
### Summary — Temperature Scaling Experiment #69
1786+
1787+
**CONFIRMED: Temperature scaling is a massive universal improvement for TCQ.**
1788+
- Optimal alpha: 1.15-1.25 for both 3-bit and 2-bit TCQ
1789+
- Default recommendation: alpha=1.20 (best overall)
1790+
- Improvement is 5-14% PPL reduction across ALL context lengths
1791+
- No regression at any context length — pure win
1792+
- Improvement grows with context length (larger at 64K than 2K)
1793+
- Same optimal alpha range regardless of codebook choice or bit rate
1794+
1795+
Comparison vs competitors at alpha=1.20:
1796+
1797+
| Config | @2K | @8K | @32K | @64K |
1798+
|--------|-----|-----|------|------|
1799+
| **Ours t3_tcq α=1.20** | **5.567** | **6.289** | **6.574** | **6.224** |
1800+
| Duster TBQ3 | 6.565 | 6.921 | 7.056 | 7.034 |
1801+
| TheTom turbo3 | 6.548 | 6.934 | 7.089 | 7.114 |
1802+
| **Ours t2_tcq α=1.20** | **5.619** | **6.387** | **6.615** | **6.248** |
1803+
| Duster TBQ2 | 6.798 | 7.233 | 7.186 | 7.332 |
1804+
1805+
Note: our numbers use 50iter_finetuned (3-bit) and 100iter (2-bit) codebooks.
1806+
Compiled-in numpy codebook may be even better (preliminary showed 5.528 @2K vs 5.567).
1807+
**We now CRUSH every competitor at every context length at both bit rates.**

experiments.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -791,12 +791,12 @@ for all turbo dequant kernels (turbo3, turbo4) in both prefill and decode paths.
791791
**Concept**: Run Viterbi trellis over 256 elements (full head_dim) instead of two independent 128-element groups. Longer trellis = better coding gain. Only benefits head_dim=256 models.
792792
**Priority**: Medium — nice paper result showing TCQ scales with block length.
793793
794-
### 69. Temperature scaling — attention sharpening via norm inflation `ready`
795-
**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36. This sharpens attention and helps at long context.
796-
**Concept**: Multiply `corrected_norm` by alpha in TCQ encode kernel. Try alpha = 1.5, 2.0, 2.5, 2.77. Combines our 976x better MSE with TBQ's temperature benefit.
797-
**Change**: One line in `turbo-quant-cuda.cuh` — scale the stored norm after correction.
798-
**Test**: PPL grid at 2K/8K/32K/64K for each alpha. If optimal alpha differs by context length, consider making alpha a runtime parameter.
799-
**Expected**: Beat TBQ at ALL context lengths. This is the single highest-impact experiment.
794+
### 69. Temperature scaling — attention sharpening via norm inflation `done`
795+
**Source**: Competitive analysis 2026-03-31. Duster's TBQ accidentally inflates norms 2.77x, acting as attention temperature T=0.36.
796+
**Result**: **MASSIVE WIN.** Alpha=1.20 optimal for both 3-bit and 2-bit TCQ. 5-14% PPL improvement at ALL context lengths. No regression anywhere. We now beat every competitor at every context length at both bit rates.
797+
**Implementation**: `d_tcq_norm_alpha` constant in turbo-quant-cuda.cuh, loaded from `TURBO_TCQ_ALPHA` env var in set-rows.cu. Applied to both 3-bit and 2-bit encode kernels.
798+
**Key numbers**: 3-bit @64K: 6.224 (was 7.034, TBQ3 was 7.034). 2-bit @64K: 6.248 (was 7.222, TBQ2 was 7.332).
799+
**Default**: Hard-code alpha=1.20 for shipping. Keep env var for experimentation.
800800
801801
### 70. Asymmetric K/V norm scaling — raw norm for K, corrected for V `ready`
802802
**Source**: Competitive analysis quality findings. K temperature helps attention routing, V accuracy helps output quality.

ggml/src/ggml-cuda/set-rows.cu

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,24 @@
22
#include "cpy-utils.cuh"
33
#include "turbo-quant-cuda.cuh"
44
#include <cstring>
5+
#include <cerrno>
6+
7+
static void load_tcq_norm_alpha() {
8+
static bool loaded = false;
9+
if (loaded) return;
10+
loaded = true;
11+
const char *s = getenv("TURBO_TCQ_ALPHA");
12+
if (!s) return;
13+
char *end;
14+
errno = 0;
15+
float alpha = strtof(s, &end);
16+
if (end == s || errno != 0 || alpha <= 0.0f || alpha >= 10.0f) {
17+
fprintf(stderr, "TCQ: invalid TURBO_TCQ_ALPHA='%s'\n", s);
18+
return;
19+
}
20+
cudaMemcpyToSymbol(d_tcq_norm_alpha, &alpha, sizeof(float));
21+
fprintf(stderr, "TCQ: norm alpha = %.3f\n", alpha);
22+
}
523

624
typedef void (*set_rows_kernel_t)(const char * src, char * dst);
725

@@ -373,6 +391,7 @@ static void set_rows_cuda(ggml_backend_cuda_context & ctx, const ggml_tensor * s
373391
fprintf(stderr, "TCQ encode: FAILED to load codebook from %s\n", cb_path);
374392
}
375393
}
394+
load_tcq_norm_alpha();
376395
}
377396
// TCQ Viterbi encode: 512 threads per block (one per trellis state)
378397
const int64_t s01_f = nb01/sizeof(float); const int64_t s02_f = nb02/sizeof(float); const int64_t s03_f = nb03/sizeof(float);
@@ -410,6 +429,7 @@ static void set_rows_cuda(ggml_backend_cuda_context & ctx, const ggml_tensor * s
410429
fprintf(stderr, "TCQ2 encode: FAILED to load codebook from %s\n", cb_path);
411430
}
412431
}
432+
load_tcq_norm_alpha();
413433
}
414434
// 2-bit TCQ Viterbi encode: 256 threads per block (one per trellis state, L=8)
415435
const int64_t s01_f = nb01/sizeof(float); const int64_t s02_f = nb02/sizeof(float); const int64_t s03_f = nb03/sizeof(float);

ggml/src/ggml-cuda/turbo-quant-cuda.cuh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -613,6 +613,10 @@ static __constant__ float d_turbo3_tcq_codebook[512] = {
613613
-0.16474872f, -0.09278035f, -0.04699890f, -0.00779894f, +0.03187623f, +0.07828258f, +0.13561429f, +0.23917313f
614614
};
615615

616+
// Temperature scaling for TCQ norm. alpha > 1 sharpens attention (helps long context).
617+
// Override via TURBO_TCQ_ALPHA env var.
618+
static __constant__ float d_tcq_norm_alpha = 1.2f;
619+
616620
// TCQ SET_ROWS encode: Viterbi optimal path with right-shift trellis
617621
// 512 threads per block (one per trellis state), one block per 128-element group
618622
// Backtrace stored in shared memory (32KB, 4-bit packed)
@@ -771,6 +775,7 @@ static __global__ void __launch_bounds__(512, 1) k_set_rows_turbo3_tcq(
771775
}
772776
float recon_norm = sqrtf(recon_norm_sq);
773777
float corrected_norm = (recon_norm > 1e-10f) ? saved_norm / recon_norm : saved_norm;
778+
corrected_norm *= d_tcq_norm_alpha;
774779

775780
// Pack bitstream: [6 prefix bits] [out_0 (3 bits)] ... [out_127 (3 bits)]
776781
for (int j = 0; j < 49; j++) dst_blk->qs[j] = 0;
@@ -1008,6 +1013,7 @@ static __global__ void __launch_bounds__(256, 1) k_set_rows_turbo2_tcq(
10081013
}
10091014
float recon_norm = sqrtf(recon_norm_sq);
10101015
float corrected_norm = (recon_norm > 1e-10f) ? saved_norm / recon_norm : saved_norm;
1016+
corrected_norm *= d_tcq_norm_alpha;
10111017

10121018
// Pack bitstream: [6 prefix bits] [out_0 (2 bits)] ... [out_127 (2 bits)]
10131019
for (int j = 0; j < 33; j++) dst_blk->qs[j] = 0;

0 commit comments

Comments
 (0)