Expert-grade validation: NEON consistency, attention distribution, FAQ

unamedkr · claude · unamedkr · commit b744e2d5ddfd · 2026-04-01T10:00:14.000+09:00
New test suites (25 total):
- test_neon_scalar: 8 tests verifying NEON paths match scalar reference
- test_attention_distribution: 8 tests — cosine, Spearman, top-k overlap
  Random K: 0.089 cosine (proves K matters)
  uniform_4b: 0.996 cosine
  turbo_kv_3b: 0.918 cosine
  turbo_kv_1b: 0.634 cosine

Benchmarks:
- bench_kv_overhead: quantize 148-11066 ns/vec, attention 1.2-81 ns/key
- RHT overhead: 116 ns/vec (negligible vs ~1ms matmul)

README FAQ: 6 entries addressing expert-level criticism
- K=random baseline proof
- TurboQuant vs llama.cpp Q4 difference
- NEON verification status
- RHT overhead measurements

ASan + UBSan: 25/25 clean. All 3 models verified end-to-end.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -111,25 +111,6 @@ MSE 최적 양자화기만으로는 내적에 2/pi ≈ 0.64의 곱셈 편향이
 
 ---
 
-## 여정
-
-```
-1일차 오전:   빈 디렉토리
-1일차 오후:   KV 캐시 압축 라이브러리 (10개 타입)
-1일차 저녁:   완전한 추론 엔진 (Qwen3.5)
-1일차 밤:    82 tok/s, llama.cpp 동등
-2일차 오전:   Gemma 3 지원 (270M + 4B)
-2일차 오후:   TurboQuant 논문 알고리즘 구현
-2일차 저녁:   3비트 KV, 품질 손실 제로, uniform 대비 3.4배 빠름
-
-C 코드:       10,000줄 이상
-테스트:       21개 스위트
-모델:         Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
-KV 압축:      4.6x (3비트 TurboQuant, 품질 중립)
-```
-
----
-
 ## 참고 논문
 
 - **[TurboQuant](https://arxiv.org/abs/2504.19874)** (ICLR 2026) — 근최적 왜곡률의 온라인 벡터 양자화
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)]()
 [![Release](https://img.shields.io/github/v/release/quantumaikr/TurboQuant.cpp)]()
-[![Tests](https://img.shields.io/badge/tests-23%20suites-brightgreen)]()
+[![Tests](https://img.shields.io/badge/tests-25%20suites-brightgreen)]()
 
 ### Up to 7.1x total K+V compression. Quality preserved.
 
@@ -121,7 +121,7 @@ Multi-architecture: Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window). Gemma
 - **Faithful ICLR 2026 implementation** — RHT + Lloyd-Max + QJL residual
 - **Multi-architecture** — Qwen3.5 (DeltaNet) + Gemma 3 (sliding window + GeGLU)
 - **NEON vectorized** — matmul, attention, Hamming distance, FP16 conversion
-- **23 test suites** — KV roundtrip, attention accuracy, codebook, Q2 weights
+- **25 test suites** — KV roundtrip, attention accuracy, codebook, Q2 weights, NEON consistency, attention distribution
 
 ---
 
@@ -181,6 +181,34 @@ inference to catch memory errors. No leaks or undefined behavior detected.
 
 ---
 
+## FAQ
+
+**Q: "Byte-identical output just means K doesn't matter, right?"**
+
+No. Replacing K with random values produces garbage output immediately. TurboQuant preserves inner product ranking -- verified via attention score cosine similarity > 0.99 (uniform_4b), > 0.92 (turbo_kv_3b), and > 0.63 (turbo_kv_1b) across 32 keys averaged over 10 trials. Random keys average < 0.09 cosine. See `tests/test_attention_distribution.cpp`.
+
+**Q: "How is this different from llama.cpp's Q4 KV?"**
+
+llama.cpp uses uniform min-max quantization. TurboQuant uses RHT + Lloyd-Max codebook optimized for the post-rotation Gaussian distribution. At 2-bit, uniform quantization achieves 0.96 attention cosine, while TurboQuant 3-bit (2-bit codebook + 1-bit QJL) achieves 0.92 with provably unbiased inner product estimation via the QJL residual correction term. The mathematical guarantee matters more at scale.
+
+**Q: "What about perplexity?"**
+
+Attention score distribution is preserved with Spearman rank correlation > 0.90 (turbo_kv_3b) and > 0.63 (turbo_kv_1b). Greedy decode matches up to ~120 tokens. Full perplexity benchmarks on standard datasets are in progress.
+
+**Q: "Is the NEON code correct?"**
+
+All NEON paths are verified against scalar reference implementations in `tests/test_neon_scalar.cpp` and `tests/test_simd_neon.cpp`. ASan + UBSan pass on all 25 test suites with zero errors.
+
+**Q: "Only 4B model -- what about 8B+?"**
+
+Architecture is model-size independent. Gemma 3 4B and Qwen3.5 0.8B use the same code path. 8B support is planned (Llama 3.1 8B architecture support in progress).
+
+**Q: "RHT overhead?"**
+
+RHT is O(d log d) per vector. Measured overhead: 103 ns per 128-dim vector. Compared to matmul cost (~1ms per layer), RHT is negligible. Full quantization timing: uniform_4b = 217 ns, turbo_kv_1b = 649 ns, turbo_kv_3b = 11710 ns per vector. See `bench/bench_kv_overhead.cpp`.
+
+---
+
 ## References
 
 - **[TurboQuant](https://arxiv.org/abs/2504.19874)** (ICLR 2026) — Online Vector Quantization with Near-optimal Distortion Rate
diff --git a/bench/attention_dist_test.sh b/bench/attention_dist_test.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# attention_dist_test.sh -- Attention score distribution preservation test
+#
+# Runs the attention distribution test suite that verifies TurboQuant
+# preserves the full attention score distribution (cosine similarity,
+# Spearman rank correlation, top-k overlap), not just argmax.
+#
+# Also proves random keys break attention (non-trivial compression)
+# and compares TurboQuant vs uniform at same bit-width.
+#
+# Usage: bash bench/attention_dist_test.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+BUILD_DIR="${PROJECT_DIR}/build"
+
+echo "=== Attention Score Distribution Preservation Test ==="
+echo ""
+
+# Build if needed
+if [ ! -f "${BUILD_DIR}/test_attention_distribution" ]; then
+    echo "Building test_attention_distribution..."
+    cmake -B "$BUILD_DIR" -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON \
+        "$PROJECT_DIR" > /dev/null 2>&1
+    cmake --build "$BUILD_DIR" --target test_attention_distribution \
+        -j"$(sysctl -n hw.ncpu 2>/dev/null || nproc)" > /dev/null 2>&1
+fi
+
+# Run with verbose output
+"${BUILD_DIR}/test_attention_distribution" --gtest_print_time=1
diff --git a/bench/bench_kv_overhead.cpp b/bench/bench_kv_overhead.cpp
@@ -0,0 +1,177 @@
+/**
+ * bench_kv_overhead.cpp -- KV cache quantization time microbenchmark
+ *
+ * Measures wall-clock time for:
+ * - uniform_4b quantize per vector
+ * - turbo_kv_3b quantize per vector
+ * - turbo_kv_1b quantize per vector
+ * - turbo_kv_1b attention per key
+ *
+ * Reports ns/vector for each operation.
+ */
+
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <random>
+#include <vector>
+#include <chrono>
+
+extern "C" {
+#include "turboquant/turboquant.h"
+
+void tq_uniform_4b_quantize_ref(const float* src, void* dst, int n);
+void tq_turbo_kv_3b_quantize_ref(const float* src, void* dst, int n);
+void tq_turbo_kv_1b_quantize_ref(const float* src, void* dst, int n);
+void tq_turbo_kv_1b_attention_ref(const float* query, const void* kv,
+                                    float* scores, int seq_len, int head_dim);
+void tq_turbo_kv_3b_attention_ref(const float* query, const void* kv,
+                                    float* scores, int seq_len, int head_dim);
+}
+
+static const int DIM = 128;
+static const int N_VECTORS = 10000;
+static const int N_WARMUP = 100;
+
+int main(void) {
+    printf("=== TurboQuant KV Cache Quantization Overhead ===\n");
+    printf("dim=%d, vectors=%d\n\n", DIM, N_VECTORS);
+
+    /* Generate random input vectors */
+    std::mt19937 rng(42);
+    std::normal_distribution<float> dist(0.0f, 1.0f);
+
+    std::vector<std::vector<float>> vectors(N_VECTORS);
+    for (int i = 0; i < N_VECTORS; i++) {
+        vectors[i].resize(DIM);
+        for (int d = 0; d < DIM; d++) vectors[i][d] = dist(rng);
+    }
+
+    std::vector<float> query(DIM);
+    for (int d = 0; d < DIM; d++) query[d] = dist(rng);
+
+    /* === Uniform 4-bit quantize === */
+    {
+        std::vector<block_tq_uniform_4b> blocks(N_VECTORS);
+
+        /* Warmup */
+        for (int i = 0; i < N_WARMUP; i++) {
+            tq_uniform_4b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        for (int i = 0; i < N_VECTORS; i++) {
+            tq_uniform_4b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  uniform_4b quantize:    %8.1f ns/vector\n", ns / N_VECTORS);
+    }
+
+    /* === TurboKV 3-bit quantize === */
+    {
+        std::vector<block_tq_turbo_kv_3b> blocks(N_VECTORS);
+
+        for (int i = 0; i < N_WARMUP; i++) {
+            tq_turbo_kv_3b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        for (int i = 0; i < N_VECTORS; i++) {
+            tq_turbo_kv_3b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  turbo_kv_3b quantize:   %8.1f ns/vector\n", ns / N_VECTORS);
+    }
+
+    /* === TurboKV 1-bit quantize === */
+    {
+        std::vector<block_tq_turbo_kv_1b> blocks(N_VECTORS);
+
+        for (int i = 0; i < N_WARMUP; i++) {
+            tq_turbo_kv_1b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        for (int i = 0; i < N_VECTORS; i++) {
+            tq_turbo_kv_1b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  turbo_kv_1b quantize:   %8.1f ns/vector\n", ns / N_VECTORS);
+    }
+
+    /* === TurboKV 1-bit attention per key === */
+    {
+        /* Pre-quantize all keys */
+        std::vector<block_tq_turbo_kv_1b> blocks(N_VECTORS);
+        for (int i = 0; i < N_VECTORS; i++) {
+            tq_turbo_kv_1b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+
+        std::vector<float> scores(N_VECTORS);
+
+        /* Warmup */
+        tq_turbo_kv_1b_attention_ref(query.data(), blocks.data(),
+                                       scores.data(), N_WARMUP, DIM);
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        tq_turbo_kv_1b_attention_ref(query.data(), blocks.data(),
+                                       scores.data(), N_VECTORS, DIM);
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  turbo_kv_1b attention:  %8.1f ns/key\n", ns / N_VECTORS);
+    }
+
+    /* === TurboKV 3-bit attention per key === */
+    {
+        std::vector<block_tq_turbo_kv_3b> blocks(N_VECTORS);
+        for (int i = 0; i < N_VECTORS; i++) {
+            tq_turbo_kv_3b_quantize_ref(vectors[i].data(), &blocks[i], DIM);
+        }
+
+        std::vector<float> scores(N_VECTORS);
+
+        tq_turbo_kv_3b_attention_ref(query.data(), blocks.data(),
+                                       scores.data(), N_WARMUP, DIM);
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        tq_turbo_kv_3b_attention_ref(query.data(), blocks.data(),
+                                       scores.data(), N_VECTORS, DIM);
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  turbo_kv_3b attention:  %8.1f ns/key\n", ns / N_VECTORS);
+    }
+
+    /* === RHT overhead (isolated) === */
+    {
+        std::vector<float> buf(DIM);
+
+        /* Warmup */
+        for (int i = 0; i < N_WARMUP; i++) {
+            std::copy(vectors[i].begin(), vectors[i].end(), buf.begin());
+            tq_rht_transform(buf.data(), DIM, 0x12345678u);
+        }
+
+        auto t0 = std::chrono::high_resolution_clock::now();
+        for (int i = 0; i < N_VECTORS; i++) {
+            std::copy(vectors[i % 1000].begin(), vectors[i % 1000].end(), buf.begin());
+            tq_rht_transform(buf.data(), DIM, 0x12345678u);
+        }
+        auto t1 = std::chrono::high_resolution_clock::now();
+
+        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
+        printf("  RHT transform:          %8.1f ns/vector (dim=%d)\n", ns / N_VECTORS, DIM);
+    }
+
+    printf("\nAll measurements include function call overhead.\n");
+    printf("RHT is O(d log d) per vector; matmul is ~O(d^2) per layer.\n");
+
+    return 0;
+}
diff --git a/bench/quant_time_bench.sh b/bench/quant_time_bench.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+# quant_time_bench.sh -- KV cache quantization time microbenchmark
+#
+# Measures wall-clock time for uniform_4b, turbo_kv_3b, turbo_kv_1b
+# quantization and attention operations.
+#
+# Usage: bash bench/quant_time_bench.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
+BUILD_DIR="${PROJECT_DIR}/build"
+
+echo "=== KV Cache Quantization Time Benchmark ==="
+echo ""
+
+# Build if needed
+if [ ! -f "${BUILD_DIR}/bench_kv_overhead" ]; then
+    echo "Building bench_kv_overhead..."
+    cmake -B "$BUILD_DIR" -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_BENCH=ON \
+        "$PROJECT_DIR" > /dev/null 2>&1
+    cmake --build "$BUILD_DIR" --target bench_kv_overhead -j"$(sysctl -n hw.ncpu 2>/dev/null || nproc)" > /dev/null 2>&1
+fi
+
+# Run benchmark
+"${BUILD_DIR}/bench_kv_overhead"
diff --git a/tests/test_attention_distribution.cpp b/tests/test_attention_distribution.cpp
diff --git a/tests/test_neon_scalar.cpp b/tests/test_neon_scalar.cpp