You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It compares MegaQuant methods against locally tested TurboQuant / RotorQuant / IsoQuant / PlanarQuant-style Python baseline implementations in a narrow GPT-2 KV-cache quality benchmark.
7
+
**Low-bit KV-cache compression experiments with honest metadata accounting.**
6
8
7
-
> Scope warning: this repository does **not** claim production speed, real VRAM savings, CUDA kernel quality, or general superiority across LLMs. The headline is an observed result in the benchmark described below.
9
+
## At a glance
8
10
9
-
## Current headline
11
+
In a local GPT-2 CPU/Python fake-quant benchmark, MegaQuant's best current point gives:
10
12
11
-
Current strongest method in this repo's benchmark:
13
+
| What you care about | Result | Compared with |
14
+
|---|---:|---|
15
+
| Modeled KV payload size |**19.6% of FP16**|**5.11x compression / 80.4% saving** vs FP16 |
16
+
| Attention-output quality |**+11.0% higher**| vs local `RotorQuant-3b` baseline (`0.942270` vs `0.848665`) |
17
+
| Memory cost vs RotorQuant-3b |**+4.35% more**|`3.130399` vs `3.000000` effective bits/dim |
18
+
19
+
Main method:
12
20
13
21
```text
14
22
affine_seven_level_3bit_g64_meta4
23
+
3.130399 effective bits/dim
24
+
0.942270 attention-output cosine
15
25
```
16
26
17
-
Full GPT-2 CPU/Python fake-quant benchmark result:
27
+
Need lower memory? The 2-bit Hadamard variant uses **24.8% less modeled memory than local RotorQuant-3b** (`2.255399` vs `3.000000` bits/dim) while landing in the same attention-output-cosine range in this benchmark (`0.851023` vs `0.848665`).
18
28
19
-
```text
20
-
effective_bits_per_dim = 3.130399
21
-
attn_out_cos_mean = 0.942270
22
-
score_cos_mean = 0.997257
23
-
```
29
+
## Scope
30
+
31
+
This repository is a **research proof-of-concept**, not a production inference engine.
24
32
25
-
Within this repository's current benchmark setup, this is the best observed quality/compression tradeoff among the tested methods.
33
+
The numbers above are:
26
34
27
-
## Plain-language comparison
35
+
- from a narrow GPT-2 KV-cache quality benchmark,
36
+
- CPU/Python fake-quant results,
37
+
- based on modeled `effective_bits_per_dim` including declared metadata,
38
+
- comparisons against local Python baseline implementations.
28
39
29
-
For readers who just want the headline numbers:
40
+
They are **not** claims about CUDA kernels, real VRAM, decode throughput, or general superiority across LLMs.
41
+
42
+
## Related repository
30
43
31
-
- vs **FP16 KV-cache payload**: `affine_seven_level_3bit_g64_meta4` uses about **19.6%** of the modeled payload, i.e. about **80.4% memory saving** and about **5.11x compression**.
32
-
- vs local **RotorQuant-3b** baseline in this benchmark: it uses about **4.35% more modeled memory** (`3.130399` vs `3.000000` effective bits/dim), but gives about **11.03% higher attention-output cosine** (`0.942270` vs `0.848665`).
33
-
- if you want a lower-memory point instead of the main quality point, `hadamard_affine_four_level_2bit_g64_meta8` uses about **24.8% less modeled memory** than local `RotorQuant-3b` (`2.255399` vs `3.000000` bits/dim) while showing slightly higher attention-output cosine in this benchmark (`0.851023` vs `0.848665`).
44
+
RAG/vector-index companion project:
34
45
35
-
These are benchmark-local modeled payload comparisons, not production VRAM or kernel-throughput claims.
@@ -44,12 +55,6 @@ Small implementation overheads such as padding, headers, estimator state, and ru
44
55
45
56
For simulated `meta4`/`meta8` methods, the public tables add a small conservative term for shared metadata-range parameters. This is still a modeled storage budget, not a packed-kernel measurement.
0 commit comments