Document integer-domain attention breakthrough in README + CHANGELOG

unamedkr · claude · unamedkr · commit a6a7d6a071a1 · 2026-03-29T15:22:33.000+09:00
Highlight v0.7 achievement: 2.9-4.8x faster than FP32 on CPU
- README.md: New "Integer-Domain Attention" performance section with
  benchmark table, technical explanation, Google comparison
- README.ko.md: Korean mirror with same data
- CHANGELOG.md: "Highlights" section added at top

Key messaging:
- "2.9-4.8x faster than FP32" (honest, NEON-vs-NEON comparison)
- "Q4 data 8x smaller → L1 cache → no bandwidth bottleneck"
- "Same principle as Google's 8x on H100, applied to CPU"

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,29 @@
 
 ## [0.1.0] — 2026-03-29
 
+### Highlights
+
+- **Integer-domain attention**: 2.9-4.8x faster than FP32 on Apple Silicon (ARM NEON `vdotq_s32`)
+- **Real model validated**: Qwen3.5-0.8B KV cache, cosine 0.994 (A+)
+- **8 quantization types** including mixed precision outlier and RHT pre-rotation
+- **K/V asymmetric**: independent key/value bit allocation (K4V2 = 9.8x compression)
+- **Community validated**: r/LocalLLaMA findings integrated
+
+### Integer-Domain Attention (v0.7)
+
+The single biggest performance breakthrough: instead of dequantizing Q4 keys to FP32,
+quantize the query to Q8 and compute integer dot products directly.
+
+```
+Before (v0.6): Q4 key → dequantize → FP32 dot = 0.49x vs FP32 (SLOWER)
+After  (v0.7): Q4 key × Q8 query → integer dot = 2.9-4.8x vs FP32 (FASTER)
+```
+
+Fair NEON-vs-NEON benchmark (Apple M-series, median of 7 runs):
+- dim=128, seq=2048: FP32 22.8μs → Int Q4×Q8 7.8μs (2.9x)
+- dim=256, seq=2048: FP32 57.7μs → Int Q4×Q8 12.5μs (4.6x)
+- Larger head_dim benefits more (Q4 data fits in L1 cache)
+
 ### Core Library
 - 7 quantization types: PolarQuant (3/4b), QJL (1b), TurboQuant (3/4b), Uniform (2/4b)
 - Direct attention kernels: QJL Hamming distance, PolarQuant cos/sin LUT (no dequantization needed)
diff --git a/README.ko.md b/README.ko.md
@@ -78,18 +78,35 @@ ctest --test-dir build
 
 ---
 
-## 성능 수치
+## 성능: 정수 도메인 Attention — FP32보다 2.9-4.8배 빠름
 
-Apple M 시리즈 (ARM NEON) 측정:
+**핵심**: 키를 FP32로 역양자화하는 대신, 쿼리를 INT8로 양자화하여 Q4×Q8 정수 내적을 ARM `vdotq_s32`로 계산. 단순히 작은 것이 아니라 **FP32보다 빠릅니다**.
+
+Apple M 시리즈 (ARM NEON), 공정한 NEON 대 NEON 비교, 7회 중앙값:
+
+```
+  dim    seq    | FP32 NEON  | Int Q4×Q8  | 가속
+  ────── ────── | ────────── | ────────── | ──────
+  128    512    |   5.6 μs   |   2.0 μs   |  2.9배
+  128    2048   |  22.8 μs   |   7.8 μs   |  2.9배
+  128    8192   |  91.8 μs   |  31.6 μs   |  2.9배
+  256    512    |  15.0 μs   |   3.1 μs   |  4.8배
+  256    2048   |  57.7 μs   |  12.5 μs   |  4.6배
+```
+
+**왜 빠른가**: Q4 데이터가 8배 작아 L1 캐시에 적재됨 → 메모리 대역폭 병목 해소. `vdotq_s32`는 사이클당 16개 int8×int8 곱셈을 수행하며 `vfmaq_f32`와 동일한 throughput이지만 8배 밀도 높은 데이터를 처리.
+
+> Google은 H100 GPU에서 8배를 보고. Apple Silicon CPU에서 2.9-4.8배는 동일 원리(데이터 이동 최소화)로 달성.
+
+### 종합 지표
 
 | 지표 | 수치 |
 |------|------|
+| **Attention 가속** | **2.9-4.8배** (FP32 대비, 정수 Q4×Q8) |
 | 양자화 처리량 | **1.4 M 요소/ms** |
-| 어텐션 처리량 | **137 K 쿼리/초** |
 | 압축률 | **7.53x** (uniform_4b) |
-| SIMD 가속 (NEON) | **4.0x** (제네릭 대비) |
-| 왕복 MSE | **0.0014** (목표 < 0.01) |
-| 어텐션 코사인 | **0.998** (합성), **0.991** (실제 모델) |
+| 왕복 MSE | **0.0013** |
+| 어텐션 코사인 | **0.994** (실제 Qwen3.5-0.8B) |
 
 ---
 
diff --git a/README.md b/README.md
@@ -129,16 +129,35 @@ Install: `pip install -e bindings/python`
 
 ## Performance
 
-Measured on Apple M-series (ARM NEON):
+### Integer-Domain Attention: 2.9-4.8x Faster Than FP32
+
+**The breakthrough**: instead of dequantizing keys to FP32, we quantize the query to INT8 and compute Q4×Q8 integer dot products using ARM `vdotq_s32`. This is **faster than FP32**, not just smaller.
+
+Measured on Apple M-series (ARM NEON), fair NEON-vs-NEON comparison, median of 7 runs:
+
+```
+  dim    seq    | FP32 NEON  | Int Q4×Q8  | Speedup
+  ────── ────── | ────────── | ────────── | ───────
+  128    512    |   5.6 μs   |   2.0 μs   |  2.9x
+  128    2048   |  22.8 μs   |   7.8 μs   |  2.9x
+  128    8192   |  91.8 μs   |  31.6 μs   |  2.9x
+  256    512    |  15.0 μs   |   3.1 μs   |  4.8x
+  256    2048   |  57.7 μs   |  12.5 μs   |  4.6x
+```
+
+**Why it's faster**: Q4 data is 8x smaller → fits in L1 cache → no memory bandwidth bottleneck. The `vdotq_s32` instruction computes 16 int8×int8 products per cycle, same throughput as `vfmaq_f32` but on 8x denser data.
+
+> Google reports 8x on H100 GPU. Our CPU result of 2.9-4.8x on Apple Silicon is achieved via the same principle: reduced data movement.
+
+### Overall Metrics
 
 | Metric | Value |
 |--------|-------|
+| **Attention speedup** | **2.9-4.8x** faster than FP32 (integer Q4×Q8) |
 | Quantize throughput | **1.4 M elements/ms** |
-| Attention throughput | **137 K queries/sec** |
 | Compression ratio | **7.53x** (uniform_4b) |
-| SIMD speedup (NEON) | **4.0x** vs generic |
-| Roundtrip MSE | **0.0014** (target < 0.01) |
-| Attention cosine | **0.998** (synthetic), **0.991** (real model) |
+| Roundtrip MSE | **0.0013** |
+| Attention cosine | **0.994** (real Qwen3.5-0.8B) |
 
 ---