Fix misleading PyTorch comparison: clarify F32 vs Q4 conditions

unamedkr · claude · unamedkr · commit 352a02e59626 · 2026-03-31T14:07:13.000+09:00
Community feedback pointed out the 59x claim was not apples-to-apples.
Added F32/Q4 labels, removed multiplier claims, added disclaimer notes,
and noted upcoming llama.cpp Q4-vs-Q4 fair benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -9,13 +9,15 @@
 [![Build](https://img.shields.io/badge/build-passing-brightgreen)]()
 [![Tests](https://img.shields.io/badge/tests-70%2B%20pass-brightgreen)]()
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
-[![Speed](https://img.shields.io/badge/47%20tok%2Fs-Qwen3.5--0.8B-blue)]()
+[![Speed](https://img.shields.io/badge/47%20tok%2Fs%20(Q4)-Qwen3.5--0.8B-blue)]()
 
 ```
-PyTorch CPU:     0.8 tok/s
-PyTorch GPU:      10 tok/s
-TurboQuant CPU:   47 tok/s  ← 59배 빠름, GPU 불필요
+PyTorch CPU (F32):     0.8 tok/s
+PyTorch GPU (F32):      10 tok/s
+TurboQuant CPU (Q4):    47 tok/s  ← GPU 불필요
 ```
+> **참고:** PyTorch는 F32, TurboQuant는 Q4 — 동일 조건 비교가 아닙니다.
+> 핵심 기여는 KV 캐시 압축(7.5x)과 정수 어텐션이며, 비양자화 PyTorch를 이기는 것이 아닙니다.
 
 ---
 
@@ -45,15 +47,17 @@ that uses artificial neural networks to learn complex patterns...
 
 ## 왜 TurboQuant인가?
 
-|  | PyTorch | TurboQuant.cpp |
+|  | PyTorch (F32) | TurboQuant.cpp (Q4) |
 |---|---|---|
-| **속도** | 0.8 tok/s | **47 tok/s** (59배) |
+| **속도** | 0.8 tok/s | **47 tok/s** |
 | **로딩** | 3초 | **0.3초** (mmap) |
-| **가중치 메모리** | 1.7 GB | **270 MB** (Q4) |
+| **가중치 메모리** | 1.7 GB (F32) | **270 MB** (Q4) |
 | **KV 캐시** | 전체 크기 | **7.5배 압축** |
 | **의존성** | PyTorch, transformers | **없음** |
 | **바이너리** | ~2 GB 설치 | **~1 MB** |
-| **품질** | 기준 | **코사인 0.999** (PyTorch 대비) |
+| **품질** | 기준 (F32) | **코사인 유사도 0.999** |
+
+> 속도 차이는 주로 Q4 양자화에 기인합니다. llama.cpp 대비 Q4-vs-Q4 공정 벤치마크를 준비 중입니다.
 
 ---
 
diff --git a/README.md b/README.md
@@ -9,13 +9,15 @@ Load → Generate → Done. No Python. No GPU. Just one binary.
 [![Build](https://img.shields.io/badge/build-passing-brightgreen)]()
 [![Tests](https://img.shields.io/badge/tests-70%2B%20pass-brightgreen)]()
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
-[![Speed](https://img.shields.io/badge/47%20tok%2Fs-Qwen3.5--0.8B-blue)]()
+[![Speed](https://img.shields.io/badge/47%20tok%2Fs%20(Q4)-Qwen3.5--0.8B-blue)]()
 
 ```
-PyTorch CPU:     0.8 tok/s
-PyTorch GPU:      10 tok/s
-TurboQuant CPU:   47 tok/s  ← 59x faster, no GPU needed
+PyTorch CPU (F32):     0.8 tok/s
+PyTorch GPU (F32):      10 tok/s
+TurboQuant CPU (Q4):    47 tok/s  ← no GPU needed
 ```
+> **Note:** PyTorch runs F32, TurboQuant runs Q4 — not an apples-to-apples comparison.
+> The real contribution is KV cache compression (7.5x) and integer attention, not beating unquantized PyTorch.
 
 ---
 
@@ -45,15 +47,17 @@ that uses artificial neural networks to learn complex patterns...
 
 ## Why TurboQuant?
 
-|  | PyTorch | TurboQuant.cpp |
+|  | PyTorch (F32) | TurboQuant.cpp (Q4) |
 |---|---|---|
-| **Speed** | 0.8 tok/s | **47 tok/s** (59x) |
+| **Speed** | 0.8 tok/s | **47 tok/s** |
 | **Loading** | 3 sec | **0.3 sec** (mmap) |
-| **Weight Memory** | 1.7 GB | **270 MB** (Q4) |
+| **Weight Memory** | 1.7 GB (F32) | **270 MB** (Q4) |
 | **KV Cache** | Full size | **7.5x compressed** |
 | **Dependencies** | PyTorch, transformers, torch | **None** |
 | **Binary Size** | ~2 GB installed | **~1 MB** |
-| **Quality** | Baseline | **0.999 cosine** (vs PyTorch) |
+| **Quality** | Baseline (F32) | **0.999 cosine similarity** |
+
+> Speed difference is largely due to Q4 quantization. A fair Q4-vs-Q4 benchmark against llama.cpp is planned.
 
 ---