bench(splat3d): re-run EWA-SYRK crossover at target-cpu=x86-64-v4

claude · claude · commit 2a35416c2cb8 · 2026-05-26T04:15:39.000Z
Correction: the prior RESULTS reported v3 numbers and wrongly attributed AVX-512 to runtime dispatch. F32x16 is compile-time-selected by target-cpu, so v3 measured AVX2. Benches must run at the project's deployment tier v4 (AVX-512 native, F32x16 = __m512); committed .cargo/config.toml stays v3 for GitHub/CI portability, overridden locally via RUSTFLAGS=-Ctarget-cpu=x86-64-v4. v4 numbers (Melem/s): simd_x16 175/170/172 vs scalar 85/76/82 vs gemm_shape 90/85/87 at 1k/100k/1M. Verdict unchanged and tier-robust (v3 within ~5%): simd_x16 ~2x over both scalar and the BLAS-shape, no crossover — the EWA-SYRK backend is a pessimization at 3x3. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
diff --git a/benches/RESULTS.md b/benches/RESULTS.md
@@ -133,33 +133,37 @@ backend" — holds for the **3×3** EWA sandwich `Σ' = M·Σ·Mᵀ`.
 
 ### Hardware / build
 
-Container, default features, `.cargo/config.toml` `target-cpu=x86-64-v3`
-(AVX2 baseline); `sandwich_x16`'s per-function `#[target_feature(avx512f)]`
-+ LazyLock runtime dispatch selects AVX-512 on this host. No RUSTFLAGS.
+Container, AVX-512F+BW+VL. The committed `.cargo/config.toml` pins
+`target-cpu=x86-64-v3` (for GitHub/CI portability); **benches are run at the
+project's deployment tier `x86-64-v4`** (AVX-512 native — `F32x16` is a
+single `__m512`), via the documented override:
 
 ```bash
-cargo bench --features splat3d --bench ewa_syrk_crossover
+RUSTFLAGS="-Ctarget-cpu=x86-64-v4" \
+  cargo bench --features splat3d --bench ewa_syrk_crossover
 ```
 
-### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better)
+### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better) @ v4
 
 | N | scalar | `simd_x16` | `gemm_shape` (BLAS-shape) |
 |---|---|---|---|
-| 1 024 | 88.9 | **179.4** | 90.5 |
-| 100 000 | 84.1 | **170.0** | 85.6 |
-| 1 000 000 | 85.6 | **164.5** | 86.8 |
+| 1 024 | 85.2 | **175.2** | 90.1 |
+| 100 000 | 76.3 | **169.6** | 85.4 |
+| 1 000 000 | 81.9 | **172.0** | 87.1 |
 
 `gemm_shape` = two dense 3×3 matmuls per element (the shape a per-matrix
-BLAS call imposes), **in-process, no FFI**.
+BLAS call imposes), **in-process, no FFI**. The v3 baseline is within ~5% of
+these v4 numbers for this transpose-bound 6-field kernel — the verdict is
+tier-robust.
 
-### `project_batch` end-to-end (Melem/s)
+### `project_batch` end-to-end @ v4
 
 | N | throughput |
 |---|---|
-| 1 024 | 21.7 |
-| 100 000 | ~19.2 |
+| 1 024 | 12.1 Melem/s (84 µs) |
 
-(full pipeline incl. scalar SH eval; the sandwich is a fraction of this.)
+(full pipeline incl. scalar `sh_eval_deg3` per visible gaussian — SH eval
+dominates; the covariance sandwich is a small fraction of this.)
 
 ### Verdict — BLAS backend NOT justified at 3×3