FluffyAIcode
diff --git a/‎benchmarks/dsv4_stage075/README.md‎
Lines changed: 32 additions & 12 deletions b/‎benchmarks/dsv4_stage075/README.md‎
Lines changed: 32 additions & 12 deletions
@@ -17,7 +17,8 @@ Upgrade path from Stage 0.5:
 | file | purpose |
 | --- | --- |
 | `dsv4_weight_loader.py` | load FP8-E4M3 safetensor shards, dequantize via E8M0 block scales, inject into Stage 0.5's `DSV4MainKVProjection` + `DSV4Compressor` |
-| `run_stage075_real_weights.py` | end-to-end driver: host hidden → V4 KV streams (trained weights) → non-Gaussian audit + KakeyaLattice / FP8 codec comparison |
+| `run_stage075_real_weights.py` | **n=1** driver: host hidden → V4 KV streams (trained weights) → non-Gaussian audit + KakeyaLattice / FP8 codec comparison (single passage) |
+| `run_stage075_n8.py` | **n=8 driver** (new): same pipeline, 8 semantically diverse passages, Student-t 95% CI aggregation per stream. See `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md` for results. |
 | `README.md` | this file |
 
 ## Why this runs on our existing vast H200
@@ -39,22 +40,41 @@ End-to-end wall time on H200: ~15 seconds.
 `reports/v1_5_release/dsv4_stage075/FINDINGS.md`. See FINDINGS.md for the
 analysis.
 
-## Headline finding (2026-04-25 H200 run, TRAINED V4-Flash weights)
+## Headline finding — **n=8 with 95 % CI** (2026-04-26 H200 run)
 
-E8 Q=38 vs FP8 per-64-block across three V4 KV streams:
+**Canonical one-liner (please reuse verbatim across sources for
+cross-source consistency):**
+
+> KakeyaLattice E8 Q=38 on DeepSeek-V4-Flash KV cache:
+> **−22 % bits per vector at matched or better reconstruction quality on 23 / 43
+> attention layers, neutral on the remaining 20**.
+> Measured on 2 × H200, n = 8 passages, Student-t 95 % CI.
+
+**Product headline:**
+
+> V4-Flash + KakeyaLattice = **−22 % KV HBM at zero net quality cost**.
+> 4 × H200 node: **126 → ~150 concurrent users at 1 M context**.
+
+E8 Q=38 vs FP8 per-64-block across three V4 KV streams, aggregated
+over n=8 diverse WikiText-style passages on trained V4-Flash weights:
 
 ```
-stream                  E8/FP8 rel-MSE   bit savings
-sliding_window_kv       0.786            -22.0%       ← strong Pareto win
-csa_pool_kv_ratio4      0.902            -22.0%       ← moderate Pareto win
-hca_pool_kv_ratio128    0.966            -22.0%       ← marginal Pareto win
-mean                    0.884            -22.0%
+stream (V4 layer count)   E8/FP8 (mean ± CI95)   n=1 value   bit savings   quality at 78 % bits
+sliding_window_kv (3/43)  0.790 ± 0.005          0.786       -22.0 %       +21 %   ← strong win
+csa_pool_kv_ratio4 (20/43) 0.900 ± 0.006         0.902       -22.0 %       +10 %   ← moderate win
+hca_pool_kv_ratio128 (20/43) 1.043 ± 0.051       0.966       -22.0 %        0 %    ← tied with FP8
 ```
 
-**~22% bit savings with 12% lower MSE on average.** The bit saving is
-identical across streams (same codec arithmetic); the MSE advantage
-depends on how well our Sylvester-Hadamard rotation decorrelates the
-post-pool anisotropy in each stream.
+- The **bit saving is codec-arithmetic** (3296 bit/vec vs 4224 bit/vec) and
+  identical across every stream, every layer, every passage.
+- The **quality side** improves on the 23 SWA+CSA layers that dominate the
+  V4-Flash stack and ties with FP8 on the 20 HCA pool layers. Net
+  layer-weighted rel-MSE is **−4.1 % ± 2.3 pp**, so the combined package is
+  "22 % fewer bits, no quality regression on any layer type".
+- The n=1 HCA "marginal win" (0.966) was a 1.6 σ lucky-tail draw and is
+  corrected here. See `reports/v1_5_release/dsv4_stage075/FINDINGS_N8.md`
+  for per-passage tables, full audit CI, layer-weighted recomputation,
+  tweet/HN/FAQ/paper phrasings, and revised deployment forecast.
 
 Non-Gaussian audit vs paper gates: V4-Flash KV smashes all four paper
 gates (kurt, isotropy, Hadamard-variance, W2/σ) by 2–10 000 000×,