Skip to content

Commit 46e69c9

Browse files
unamedkrclaude
andcommitted
Apply real-model insights: recommendation engine + README rankings
Updated tq_recommend_strategy() based on real Qwen3.5-0.8B A/B findings: - uniform_4b as default (cosine 0.994, community validated) - mixed_4b8 for large head_dim with outliers (cosine 0.994) - uniform_2b for max compression (cosine 0.953 — A grade on real data) - QJL/PolarQuant deprioritized (uniform is better at same bits) README quantization types table now ranked by real model results: 1. uniform_4b (A+, 7.5x) — default production 2. mixed_4b8 (A+, 6.4x) — outlier-heavy models 3. uniform_2b (A, 14.2x) — surprisingly good 4-6. turbo/polar/qjl — research only Added recommended configurations block (en + ko) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2daef20 commit 46e69c9

15 files changed

Lines changed: 93 additions & 23 deletions

README.ko.md

Lines changed: 20 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,26 @@ scores = tq.attention(query, quantized, 512, 128, TurboQuant.UNIFORM_4B)
166166

167167
## 양자화 타입
168168

169-
| 타입 | 비트 | 알고리즘 | 압축률 | 품질 | 추천 용도 |
170-
|------|------|----------|--------|------|----------|
171-
| `uniform_4b` | 4 | Min-Max | 7.5x | A+ (0.995) | **프로덕션 (커뮤니티 추천)** |
172-
| `mixed_4b8` | ~5 | 4bit + fp16 아웃라이어 | 6.4x | A+ | 아웃라이어 많은 데이터 |
173-
| `uniform_2b` | 2 | Min-Max | 14.2x | B+ (0.855) | 극한 압축 |
174-
| `turbo_3b` | 3 | Polar+QJL | 4.6x | B+ (0.917) | 균형 |
175-
| `polar_4b` | 4 | PolarQuant | 7.1x | B (0.827) | 연구용 |
176-
| `qjl_1b` | 1 | QJL 부호 해시 | 12.8x | C (0.702) | 초극한 압축 |
169+
**실제 Qwen3.5-0.8B A/B 테스트 결과** 기반 순위 (합성 데이터 아님):
170+
171+
| 순위 | 타입 | 비트 | 압축률 | 실제 코사인 | 등급 | 추천 용도 |
172+
|------|------|------|--------|------------|------|----------|
173+
| 1 | **`uniform_4b`** | 4 | 7.5x | **0.994** | **A+** | **프로덕션 기본 선택** |
174+
| 2 | **`mixed_4b8`** | ~5 | 6.4x | **0.994** | **A+** | 아웃라이어 심한 모델 |
175+
| 3 | **`uniform_2b`** | 2 | 14.2x | **0.953** | **A** | 극한 압축 (의외로 우수) |
176+
| 4 | `turbo_3b` | 3 | 4.6x | 0.934 | B+ | 연구용 |
177+
| 5 | `polar_4b` | 4 | 7.1x | 0.893 | B | 연구용 |
178+
| 6 | `qjl_1b` | 1 | 25.6x | 0.744 | C | 비추천 |
179+
180+
**추천 설정:**
181+
```
182+
최고 품질: uniform_4b (코사인 0.994, 7.5배)
183+
최적 가성비: K4V2 (key=4b, value=2b) (코사인 ~0.97, 9.8배)
184+
극한 압축: uniform_2b (코사인 0.953, 14.2배)
185+
RHT 적용: RHT + uniform_4b (MSE 1.8배 추가 개선)
186+
```
187+
188+
> **커뮤니티 검증** (r/LocalLLaMA, llama.cpp #20969): 단순 min-max(`uniform_4b`)가 QJL/PolarQuant보다 실전에서 우수. `uniform_2b`도 14배 압축에서 실제 모델에서 A등급 달성.
177189
178190
> **커뮤니티 검증** (r/LocalLLaMA, llama.cpp #20969): `uniform_4b`가 QJL 기반 방법보다 실전에서 우수. QJL은 분산을 증가시켜 attention softmax에 불리.
179191

README.md

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -172,16 +172,26 @@ Measured on Apple M-series (ARM NEON):
172172

173173
## Quantization Types
174174

175-
| Type | Bits | Algorithm | Compression | Quality | Best For |
176-
|------|------|-----------|-------------|---------|----------|
177-
| `uniform_4b` | 4 | Min-Max | 7.5x | A+ (0.995) | **Production (recommended)** |
178-
| `mixed_4b8` | ~5 | 4-bit + fp16 outliers | 6.4x | A+ | Data with outliers |
179-
| `uniform_2b` | 2 | Min-Max | 14.2x | B+ (0.855) | Max compression |
180-
| `turbo_3b` | 3 | Polar+QJL | 4.6x | B+ (0.917) | Balanced |
181-
| `polar_4b` | 4 | PolarQuant | 7.1x | B (0.827) | Research |
182-
| `qjl_1b` | 1 | QJL Sign Hash | 12.8x | C (0.702) | Extreme compression |
183-
184-
> **Community finding** (r/LocalLLaMA, llama.cpp #20969): `uniform_4b` with bin-centered reconstruction outperforms QJL-based methods in practice. QJL increases variance which hurts attention softmax.
175+
Ranked by **real Qwen3.5-0.8B A/B test results** (not synthetic data):
176+
177+
| Rank | Type | Bits | Compression | Real Cosine | Grade | Recommended For |
178+
|------|------|------|-------------|-------------|-------|-----------------|
179+
| 1 | **`uniform_4b`** | 4 | 7.5x | **0.994** | **A+** | **Default production choice** |
180+
| 2 | **`mixed_4b8`** | ~5 | 6.4x | **0.994** | **A+** | Models with extreme outliers |
181+
| 3 | **`uniform_2b`** | 2 | 14.2x | **0.953** | **A** | Max compression (surprisingly good) |
182+
| 4 | `turbo_3b` | 3 | 4.6x | 0.934 | B+ | Research |
183+
| 5 | `polar_4b` | 4 | 7.1x | 0.893 | B | Research |
184+
| 6 | `qjl_1b` | 1 | 25.6x | 0.744 | C | Not recommended |
185+
186+
**Recommended configurations:**
187+
```
188+
Best quality: uniform_4b (cosine 0.994, 7.5x)
189+
Best balance: K4V2 (key=4b, value=2b) (cosine ~0.97, 9.8x)
190+
Max compression: uniform_2b (cosine 0.953, 14.2x)
191+
With RHT: RHT + uniform_4b (MSE 1.8x better)
192+
```
193+
194+
> **Community validated** (r/LocalLLaMA, llama.cpp #20969): Simple min-max (`uniform_4b`) outperforms QJL and PolarQuant in practice. QJL increases variance which hurts attention softmax. `uniform_2b` at 14x compression achieves A grade on real models.
185195
186196
---
187197

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)