Skip to content

Commit 6a3f7b0

Browse files
committed
Make README benchmark summary clearer
1 parent fa35fc0 commit 6a3f7b0

1 file changed

Lines changed: 29 additions & 24 deletions

File tree

README.md

Lines changed: 29 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,49 @@
11
# MegaQuant KV-Cache
22

3-
MegaQuant KV-Cache is a **CPU/Python research proof-of-concept** for metadata-aware low-bit KV-cache quantization.
3+
[![Research PoC](https://img.shields.io/badge/status-research%20PoC-blue)](#scope)
4+
[![CPU/Python](https://img.shields.io/badge/benchmark-CPU%2FPython-lightgrey)](#scope)
5+
[![Metadata aware](https://img.shields.io/badge/accounting-metadata--aware-green)](#what-effective_bits_per_dim-means)
46

5-
It compares MegaQuant methods against locally tested TurboQuant / RotorQuant / IsoQuant / PlanarQuant-style Python baseline implementations in a narrow GPT-2 KV-cache quality benchmark.
7+
**Low-bit KV-cache compression experiments with honest metadata accounting.**
68

7-
> Scope warning: this repository does **not** claim production speed, real VRAM savings, CUDA kernel quality, or general superiority across LLMs. The headline is an observed result in the benchmark described below.
9+
## At a glance
810

9-
## Current headline
11+
In a local GPT-2 CPU/Python fake-quant benchmark, MegaQuant's best current point gives:
1012

11-
Current strongest method in this repo's benchmark:
13+
| What you care about | Result | Compared with |
14+
|---|---:|---|
15+
| Modeled KV payload size | **19.6% of FP16** | **5.11x compression / 80.4% saving** vs FP16 |
16+
| Attention-output quality | **+11.0% higher** | vs local `RotorQuant-3b` baseline (`0.942270` vs `0.848665`) |
17+
| Memory cost vs RotorQuant-3b | **+4.35% more** | `3.130399` vs `3.000000` effective bits/dim |
18+
19+
Main method:
1220

1321
```text
1422
affine_seven_level_3bit_g64_meta4
23+
3.130399 effective bits/dim
24+
0.942270 attention-output cosine
1525
```
1626

17-
Full GPT-2 CPU/Python fake-quant benchmark result:
27+
Need lower memory? The 2-bit Hadamard variant uses **24.8% less modeled memory than local RotorQuant-3b** (`2.255399` vs `3.000000` bits/dim) while landing in the same attention-output-cosine range in this benchmark (`0.851023` vs `0.848665`).
1828

19-
```text
20-
effective_bits_per_dim = 3.130399
21-
attn_out_cos_mean = 0.942270
22-
score_cos_mean = 0.997257
23-
```
29+
## Scope
30+
31+
This repository is a **research proof-of-concept**, not a production inference engine.
2432

25-
Within this repository's current benchmark setup, this is the best observed quality/compression tradeoff among the tested methods.
33+
The numbers above are:
2634

27-
## Plain-language comparison
35+
- from a narrow GPT-2 KV-cache quality benchmark,
36+
- CPU/Python fake-quant results,
37+
- based on modeled `effective_bits_per_dim` including declared metadata,
38+
- comparisons against local Python baseline implementations.
2839

29-
For readers who just want the headline numbers:
40+
They are **not** claims about CUDA kernels, real VRAM, decode throughput, or general superiority across LLMs.
41+
42+
## Related repository
3043

31-
- vs **FP16 KV-cache payload**: `affine_seven_level_3bit_g64_meta4` uses about **19.6%** of the modeled payload, i.e. about **80.4% memory saving** and about **5.11x compression**.
32-
- vs local **RotorQuant-3b** baseline in this benchmark: it uses about **4.35% more modeled memory** (`3.130399` vs `3.000000` effective bits/dim), but gives about **11.03% higher attention-output cosine** (`0.942270` vs `0.848665`).
33-
- if you want a lower-memory point instead of the main quality point, `hadamard_affine_four_level_2bit_g64_meta8` uses about **24.8% less modeled memory** than local `RotorQuant-3b` (`2.255399` vs `3.000000` bits/dim) while showing slightly higher attention-output cosine in this benchmark (`0.851023` vs `0.848665`).
44+
RAG/vector-index companion project:
3445

35-
These are benchmark-local modeled payload comparisons, not production VRAM or kernel-throughput claims.
46+
- https://github.com/CrazyAngelm/megaquant-rag-compress
3647

3748
## What `effective_bits_per_dim` means
3849

@@ -44,12 +55,6 @@ Small implementation overheads such as padding, headers, estimator state, and ru
4455

4556
For simulated `meta4`/`meta8` methods, the public tables add a small conservative term for shared metadata-range parameters. This is still a modeled storage budget, not a packed-kernel measurement.
4657

47-
## Related repository
48-
49-
RAG/vector-index companion project:
50-
51-
- https://github.com/CrazyAngelm/megaquant-rag-compress
52-
5358
## Benchmark setup
5459

5560
```text

0 commit comments

Comments
 (0)