Skip to content

Commit a6a7d6a

Browse files
unamedkrclaude
andcommitted
Document integer-domain attention breakthrough in README + CHANGELOG
Highlight v0.7 achievement: 2.9-4.8x faster than FP32 on CPU - README.md: New "Integer-Domain Attention" performance section with benchmark table, technical explanation, Google comparison - README.ko.md: Korean mirror with same data - CHANGELOG.md: "Highlights" section added at top Key messaging: - "2.9-4.8x faster than FP32" (honest, NEON-vs-NEON comparison) - "Q4 data 8x smaller → L1 cache → no bandwidth bottleneck" - "Same principle as Google's 8x on H100, applied to CPU" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent edc5053 commit a6a7d6a

3 files changed

Lines changed: 70 additions & 11 deletions

File tree

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22

33
## [0.1.0] — 2026-03-29
44

5+
### Highlights
6+
7+
- **Integer-domain attention**: 2.9-4.8x faster than FP32 on Apple Silicon (ARM NEON `vdotq_s32`)
8+
- **Real model validated**: Qwen3.5-0.8B KV cache, cosine 0.994 (A+)
9+
- **8 quantization types** including mixed precision outlier and RHT pre-rotation
10+
- **K/V asymmetric**: independent key/value bit allocation (K4V2 = 9.8x compression)
11+
- **Community validated**: r/LocalLLaMA findings integrated
12+
13+
### Integer-Domain Attention (v0.7)
14+
15+
The single biggest performance breakthrough: instead of dequantizing Q4 keys to FP32,
16+
quantize the query to Q8 and compute integer dot products directly.
17+
18+
```
19+
Before (v0.6): Q4 key → dequantize → FP32 dot = 0.49x vs FP32 (SLOWER)
20+
After (v0.7): Q4 key × Q8 query → integer dot = 2.9-4.8x vs FP32 (FASTER)
21+
```
22+
23+
Fair NEON-vs-NEON benchmark (Apple M-series, median of 7 runs):
24+
- dim=128, seq=2048: FP32 22.8μs → Int Q4×Q8 7.8μs (2.9x)
25+
- dim=256, seq=2048: FP32 57.7μs → Int Q4×Q8 12.5μs (4.6x)
26+
- Larger head_dim benefits more (Q4 data fits in L1 cache)
27+
528
### Core Library
629
- 7 quantization types: PolarQuant (3/4b), QJL (1b), TurboQuant (3/4b), Uniform (2/4b)
730
- Direct attention kernels: QJL Hamming distance, PolarQuant cos/sin LUT (no dequantization needed)

README.ko.md

Lines changed: 23 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -78,18 +78,35 @@ ctest --test-dir build
7878

7979
---
8080

81-
## 성능 수치
81+
## 성능: 정수 도메인 Attention — FP32보다 2.9-4.8배 빠름
8282

83-
Apple M 시리즈 (ARM NEON) 측정:
83+
**핵심**: 키를 FP32로 역양자화하는 대신, 쿼리를 INT8로 양자화하여 Q4×Q8 정수 내적을 ARM `vdotq_s32`로 계산. 단순히 작은 것이 아니라 **FP32보다 빠릅니다**.
84+
85+
Apple M 시리즈 (ARM NEON), 공정한 NEON 대 NEON 비교, 7회 중앙값:
86+
87+
```
88+
dim seq | FP32 NEON | Int Q4×Q8 | 가속
89+
────── ────── | ────────── | ────────── | ──────
90+
128 512 | 5.6 μs | 2.0 μs | 2.9배
91+
128 2048 | 22.8 μs | 7.8 μs | 2.9배
92+
128 8192 | 91.8 μs | 31.6 μs | 2.9배
93+
256 512 | 15.0 μs | 3.1 μs | 4.8배
94+
256 2048 | 57.7 μs | 12.5 μs | 4.6배
95+
```
96+
97+
**왜 빠른가**: Q4 데이터가 8배 작아 L1 캐시에 적재됨 → 메모리 대역폭 병목 해소. `vdotq_s32`는 사이클당 16개 int8×int8 곱셈을 수행하며 `vfmaq_f32`와 동일한 throughput이지만 8배 밀도 높은 데이터를 처리.
98+
99+
> Google은 H100 GPU에서 8배를 보고. Apple Silicon CPU에서 2.9-4.8배는 동일 원리(데이터 이동 최소화)로 달성.
100+
101+
### 종합 지표
84102

85103
| 지표 | 수치 |
86104
|------|------|
105+
| **Attention 가속** | **2.9-4.8배** (FP32 대비, 정수 Q4×Q8) |
87106
| 양자화 처리량 | **1.4 M 요소/ms** |
88-
| 어텐션 처리량 | **137 K 쿼리/초** |
89107
| 압축률 | **7.53x** (uniform_4b) |
90-
| SIMD 가속 (NEON) | **4.0x** (제네릭 대비) |
91-
| 왕복 MSE | **0.0014** (목표 < 0.01) |
92-
| 어텐션 코사인 | **0.998** (합성), **0.991** (실제 모델) |
108+
| 왕복 MSE | **0.0013** |
109+
| 어텐션 코사인 | **0.994** (실제 Qwen3.5-0.8B) |
93110

94111
---
95112

README.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -129,16 +129,35 @@ Install: `pip install -e bindings/python`
129129

130130
## Performance
131131

132-
Measured on Apple M-series (ARM NEON):
132+
### Integer-Domain Attention: 2.9-4.8x Faster Than FP32
133+
134+
**The breakthrough**: instead of dequantizing keys to FP32, we quantize the query to INT8 and compute Q4×Q8 integer dot products using ARM `vdotq_s32`. This is **faster than FP32**, not just smaller.
135+
136+
Measured on Apple M-series (ARM NEON), fair NEON-vs-NEON comparison, median of 7 runs:
137+
138+
```
139+
dim seq | FP32 NEON | Int Q4×Q8 | Speedup
140+
────── ────── | ────────── | ────────── | ───────
141+
128 512 | 5.6 μs | 2.0 μs | 2.9x
142+
128 2048 | 22.8 μs | 7.8 μs | 2.9x
143+
128 8192 | 91.8 μs | 31.6 μs | 2.9x
144+
256 512 | 15.0 μs | 3.1 μs | 4.8x
145+
256 2048 | 57.7 μs | 12.5 μs | 4.6x
146+
```
147+
148+
**Why it's faster**: Q4 data is 8x smaller → fits in L1 cache → no memory bandwidth bottleneck. The `vdotq_s32` instruction computes 16 int8×int8 products per cycle, same throughput as `vfmaq_f32` but on 8x denser data.
149+
150+
> Google reports 8x on H100 GPU. Our CPU result of 2.9-4.8x on Apple Silicon is achieved via the same principle: reduced data movement.
151+
152+
### Overall Metrics
133153

134154
| Metric | Value |
135155
|--------|-------|
156+
| **Attention speedup** | **2.9-4.8x** faster than FP32 (integer Q4×Q8) |
136157
| Quantize throughput | **1.4 M elements/ms** |
137-
| Attention throughput | **137 K queries/sec** |
138158
| Compression ratio | **7.53x** (uniform_4b) |
139-
| SIMD speedup (NEON) | **4.0x** vs generic |
140-
| Roundtrip MSE | **0.0014** (target < 0.01) |
141-
| Attention cosine | **0.998** (synthetic), **0.991** (real model) |
159+
| Roundtrip MSE | **0.0013** |
160+
| Attention cosine | **0.994** (real Qwen3.5-0.8B) |
142161

143162
---
144163

0 commit comments

Comments
 (0)