Skip to content

Commit 2a35416

Browse files
committed
bench(splat3d): re-run EWA-SYRK crossover at target-cpu=x86-64-v4
Correction: the prior RESULTS reported v3 numbers and wrongly attributed AVX-512 to runtime dispatch. F32x16 is compile-time-selected by target-cpu, so v3 measured AVX2. Benches must run at the project's deployment tier v4 (AVX-512 native, F32x16 = __m512); committed .cargo/config.toml stays v3 for GitHub/CI portability, overridden locally via RUSTFLAGS=-Ctarget-cpu=x86-64-v4. v4 numbers (Melem/s): simd_x16 175/170/172 vs scalar 85/76/82 vs gemm_shape 90/85/87 at 1k/100k/1M. Verdict unchanged and tier-robust (v3 within ~5%): simd_x16 ~2x over both scalar and the BLAS-shape, no crossover — the EWA-SYRK backend is a pessimization at 3x3. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent d798b97 commit 2a35416

1 file changed

Lines changed: 17 additions & 13 deletions

File tree

benches/RESULTS.md

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -133,33 +133,37 @@ backend" — holds for the **3×3** EWA sandwich `Σ' = M·Σ·Mᵀ`.
133133

134134
### Hardware / build
135135

136-
Container, default features, `.cargo/config.toml` `target-cpu=x86-64-v3`
137-
(AVX2 baseline); `sandwich_x16`'s per-function `#[target_feature(avx512f)]`
138-
+ LazyLock runtime dispatch selects AVX-512 on this host. No RUSTFLAGS.
136+
Container, AVX-512F+BW+VL. The committed `.cargo/config.toml` pins
137+
`target-cpu=x86-64-v3` (for GitHub/CI portability); **benches are run at the
138+
project's deployment tier `x86-64-v4`** (AVX-512 native — `F32x16` is a
139+
single `__m512`), via the documented override:
139140

140141
```bash
141-
cargo bench --features splat3d --bench ewa_syrk_crossover
142+
RUSTFLAGS="-Ctarget-cpu=x86-64-v4" \
143+
cargo bench --features splat3d --bench ewa_syrk_crossover
142144
```
143145

144-
### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better)
146+
### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better) @ v4
145147

146148
| N | scalar | `simd_x16` | `gemm_shape` (BLAS-shape) |
147149
|---|---|---|---|
148-
| 1 024 | 88.9 | **179.4** | 90.5 |
149-
| 100 000 | 84.1 | **170.0** | 85.6 |
150-
| 1 000 000 | 85.6 | **164.5** | 86.8 |
150+
| 1 024 | 85.2 | **175.2** | 90.1 |
151+
| 100 000 | 76.3 | **169.6** | 85.4 |
152+
| 1 000 000 | 81.9 | **172.0** | 87.1 |
151153

152154
`gemm_shape` = two dense 3×3 matmuls per element (the shape a per-matrix
153-
BLAS call imposes), **in-process, no FFI**.
155+
BLAS call imposes), **in-process, no FFI**. The v3 baseline is within ~5% of
156+
these v4 numbers for this transpose-bound 6-field kernel — the verdict is
157+
tier-robust.
154158

155-
### `project_batch` end-to-end (Melem/s)
159+
### `project_batch` end-to-end @ v4
156160

157161
| N | throughput |
158162
|---|---|
159-
| 1 024 | 21.7 |
160-
| 100 000 | ~19.2 |
163+
| 1 024 | 12.1 Melem/s (84 µs) |
161164

162-
(full pipeline incl. scalar SH eval; the sandwich is a fraction of this.)
165+
(full pipeline incl. scalar `sh_eval_deg3` per visible gaussian — SH eval
166+
dominates; the covariance sandwich is a small fraction of this.)
163167

164168
### Verdict — BLAS backend NOT justified at 3×3
165169

0 commit comments

Comments
 (0)