You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're open-sourcing **TurboQuant.cpp**, a zero-dependency C/C++ library that compresses LLM KV caches from 16-bit to 2-4 bits — giving you **3x longer contexts on the same GPU**.
4
+
5
+
## The Problem
6
+
7
+
KV cache is the #1 memory bottleneck in LLM inference. Running Llama-3.2-3B at 64K context? That's **7 GB** just for KV cache — often more than the model weights.
8
+
9
+
## What TurboQuant Does
10
+
11
+
One line change. Same model. Same GPU. 3x more context.
0 commit comments