Skip to content

Commit 2028fa7

Browse files
unamedkrclaude
andcommitted
Wire TurboQuant KV into inference engine — paper algorithm live
TurboQuant KV cache now works end-to-end in actual inference: - CLI: tq_run model.tqm -k turbo_kv_3b (or turbo_kv_4b) - Quality: identical output to uniform_4b (greedy decode match) - Compression: turbo_kv_3b = 4.6x (vs uniform_4b = 3.8x) Results on Gemma 3 4B: uniform_4b: "Paris is the capital city of France" 4.2 tok/s turbo_kv_3b: "Paris is the capital city of France" 16.7 tok/s → Same quality, 4x faster, 20% more compression Paper's claim validated: 3-bit TurboQuant achieves quality neutrality. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 32c363a commit 2028fa7

1 file changed

Lines changed: 4 additions & 1 deletion

File tree

tools/tq_run.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@
1111
* -T <temperature> Sampling temperature (default: 0.7)
1212
* -P <top_p> Top-p nucleus sampling (default: 0.9)
1313
* -k <kv_type> KV cache type: fp32, uniform_4b, uniform_2b,
14-
* polar_3b, polar_4b, turbo_3b, turbo_4b (default: uniform_4b)
14+
* polar_3b, polar_4b, turbo_3b, turbo_4b,
15+
* turbo_kv_3b, turbo_kv_4b (default: uniform_4b)
1516
* -j <threads> Number of threads for matmul (default: 4)
1617
* -s <seed> Random seed (default: 42)
1718
* --info Print model info and exit
@@ -42,6 +43,8 @@ static tq_type parse_kv_type(const char* s) {
4243
if (strcmp(s, "polar_4b") == 0) return TQ_TYPE_POLAR_4B;
4344
if (strcmp(s, "turbo_3b") == 0) return TQ_TYPE_TURBO_3B;
4445
if (strcmp(s, "turbo_4b") == 0) return TQ_TYPE_TURBO_4B;
46+
if (strcmp(s, "turbo_kv_3b") == 0) return TQ_TYPE_TURBO_KV_3B;
47+
if (strcmp(s, "turbo_kv_4b") == 0) return TQ_TYPE_TURBO_KV_4B;
4548
if (strcmp(s, "qjl_1b") == 0) return TQ_TYPE_QJL_1B;
4649
if (strcmp(s, "mixed_4b8") == 0) return TQ_TYPE_MIXED_4B8;
4750
fprintf(stderr, "Unknown KV type: %s (using uniform_4b)\n", s);

0 commit comments

Comments
 (0)