Skip to content

Commit 44255ed

Browse files
committed
bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise
Adds benches/ewa_syrk_crossover.rs (W1 PR #3 of the cross-session program). Tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise — "projection is a BLAS workload in disguise → MKL/OpenBLAS/AMX backend for the covariance sandwich" — with a number instead of an assertion. Compares M·N·Mᵀ three ways over N = 1k/100k/1M: scalar hand upper-triangle sandwich, per element simd_x16 shipped SoA F32x16 sandwich_x16 (the renderer's kernel) gemm_shape two dense 3×3 matmuls per element — the shape a per-matrix CPU BLAS call imposes, in-process (no FFI ⇒ lower bound on cblas) plus project_batch end-to-end throughput. Result (target-cpu=x86-64-v3, AVX-512 via runtime dispatch): simd_x16 ~2× faster than BOTH scalar and gemm_shape at every size; no crossover up to 1M; gemm_shape ≈ scalar. Real MKL/cblas adds FFI on top. Verdict: the EWA-SYRK *backend* is a pessimization at 3×3/2×3 — fused SoA SIMD already wins. The plan row is idea-only (the sandwich IS SYRK-shaped) but the actionable backend is killed by measurement. Corroborates PR-3's predicted 1.5-2× SIMD-vs-scalar. Full numbers + verdict in benches/RESULTS.md. Evidence-grounded per the program: source (project.rs/spd3.rs whole-read) + measurement, not the plan-as-authority. Steelman (shared-W batched GEMM) left as a documented follow-up. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent 9ca4004 commit 44255ed

3 files changed

Lines changed: 242 additions & 0 deletions

File tree

Cargo.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,11 @@ name = "splat3d_bench"
175175
harness = false
176176
required-features = ["splat3d"]
177177

178+
[[bench]]
179+
name = "ewa_syrk_crossover"
180+
harness = false
181+
required-features = ["splat3d"]
182+
178183
[features]
179184
default = ["std", "hpc-extras"]
180185

benches/RESULTS.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,3 +123,59 @@ The demo binary `examples/splat3d_flex.rs` and integration test
123123
Full-pipeline frame-time numbers (p50/p95/p99) await a Inria bicycle
124124
scene download — left as a follow-up for the dedicated benchmarking
125125
session against real-world data.
126+
127+
## EWA-SYRK crossover — kill-or-justify the BLAS-backend premise
128+
129+
Bench: `benches/ewa_syrk_crossover.rs`. Tests whether the
130+
`3DGS-EWA-SYRK-BLAS-MKL` plan's premise — "projection is a BLAS workload
131+
in disguise → route the covariance sandwich through an MKL/OpenBLAS/AMX
132+
backend" — holds for the **3×3** EWA sandwich `Σ' = M·Σ·Mᵀ`.
133+
134+
### Hardware / build
135+
136+
Container, default features, `.cargo/config.toml` `target-cpu=x86-64-v3`
137+
(AVX2 baseline); `sandwich_x16`'s per-function `#[target_feature(avx512f)]`
138+
+ LazyLock runtime dispatch selects AVX-512 on this host. No RUSTFLAGS.
139+
140+
```bash
141+
cargo bench --features splat3d --bench ewa_syrk_crossover
142+
```
143+
144+
### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better)
145+
146+
| N | scalar | `simd_x16` | `gemm_shape` (BLAS-shape) |
147+
|---|---|---|---|
148+
| 1 024 | 88.9 | **179.4** | 90.5 |
149+
| 100 000 | 84.1 | **170.0** | 85.6 |
150+
| 1 000 000 | 85.6 | **164.5** | 86.8 |
151+
152+
`gemm_shape` = two dense 3×3 matmuls per element (the shape a per-matrix
153+
BLAS call imposes), **in-process, no FFI**.
154+
155+
### `project_batch` end-to-end (Melem/s)
156+
157+
| N | throughput |
158+
|---|---|
159+
| 1 024 | 21.7 |
160+
| 100 000 | ~19.2 |
161+
162+
(full pipeline incl. scalar SH eval; the sandwich is a fraction of this.)
163+
164+
### Verdict — BLAS backend NOT justified at 3×3
165+
166+
- `gemm_shape` is statistically identical to `scalar` and **~2× slower than
167+
the shipped `simd_x16`** at every size 1k→1M. **No crossover**; the gap is
168+
flat, not closing with batch size.
169+
- `gemm_shape` carries **no FFI** — a real `cblas`/MKL call adds marshalling
170+
+ dispatch on top, so it can only be worse. There is no efficient CPU
171+
batched-3×3 SYRK (that pattern is a GPU one).
172+
- ⇒ The EWA-SYRK *backend* (native/MKL/OpenBLAS/AMX dispatch for the
173+
covariance sandwich) is a **pessimization** at 3×3/2×3: fused SoA SIMD
174+
already wins. The plan row is **idea-only** — the sandwich *is*
175+
SYRK-shaped (true) but the actionable backend is killed by measurement.
176+
- Corroborates PR-3's predicted "1.5-2× SIMD-vs-scalar": `simd_x16` is ~
177+
over scalar at large N (transpose amortised, unlike the transpose-bound
178+
N=16 PR-1 microbench).
179+
- Steelman left open: `W·Σ·Wᵀ` has a *shared* `W` across gaussians → a
180+
batched shared-`W` GEMM is the one form that could differ; benched as a
181+
follow-up. Per-gaussian `J·Σ·Jᵀ` does not batch that way.

benches/ewa_syrk_crossover.rs

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
//! EWA-SYRK crossover bench — kill-or-justify the "3DGS projection is a
2+
//! BLAS workload in disguise" premise of `3DGS-EWA-SYRK-BLAS-MKL-plan.md`.
3+
//!
4+
//! # The question
5+
//!
6+
//! The plan proposes routing the EWA covariance sandwich `Σ' = M·Σ·Mᵀ`
7+
//! through a batched SYRK/GEMM backend (native / MKL / OpenBLAS / AMX). But
8+
//! the sandwich matrices are **3×3** (world→camera) and **2×3** (camera→
9+
//! image), and `splat3d::project` already runs them 16-wide via the
10+
//! `crate::simd` SoA polyfill. Batched *tiny* GEMM is exactly the regime
11+
//! where CPU BLAS loses to a fused SIMD unroll (per-call overhead dominates;
12+
//! there is no efficient CPU batched-3×3 SYRK — that pattern is a GPU one).
13+
//!
14+
//! This bench measures it instead of asserting it.
15+
//!
16+
//! # What it compares (same `M·N·Mᵀ` math, three kernel shapes)
17+
//!
18+
//! - `scalar` — the hand-unrolled upper-triangle `spd3::sandwich`, per element.
19+
//! - `simd_x16` — the shipped `sandwich_x16` SoA path (the renderer's kernel).
20+
//! - `gemm_shape` — two dense 3×3 matmuls per element (`M·N` then `·Mᵀ`),
21+
//! with no SoA fusion. This models the **shape a CPU BLAS
22+
//! backend imposes** — one small dense GEMM per matrix.
23+
//!
24+
//! `gemm_shape` runs **in-process, with no FFI**, so it is a *lower bound*
25+
//! on a real `cblas`/MKL path: a genuine BLAS call adds argument marshalling
26+
//! + dispatch overhead on top. If `gemm_shape` already loses to `simd_x16`,
27+
//! the BLAS backend loses by more.
28+
//!
29+
//! Plus `project_batch` end-to-end throughput at the same batch sizes, so
30+
//! the sandwich cost is seen in proportion to the full pipeline.
31+
//!
32+
//! Sweep: 1 024 / 100 000 / 1 000 000 (all exact multiples of 16, no tail).
33+
//!
34+
//! Caveat (honest scope): this measures the *per-element* kernel shape. The
35+
//! one genuinely batchable step is `W·Σ·Wᵀ`, where `W` (the camera view
36+
//! rotation) is shared across all gaussians — a shared-`W` batched GEMM is a
37+
//! steelman left as a follow-up. The per-image Jacobian `J` differs per
38+
//! gaussian, so `J·Σ·Jᵀ` does not batch that way.
39+
40+
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
41+
use ndarray::hpc::splat3d::{project_batch, sandwich, sandwich_x16, Camera, Gaussian3D, GaussianBatch, ProjectedBatch, Spd3};
42+
43+
const SIZES: [usize; 3] = [1_024, 100_000, 1_000_000];
44+
45+
#[inline]
46+
fn rng(s: &mut u32) -> f32 {
47+
*s ^= *s << 13;
48+
*s ^= *s >> 17;
49+
*s ^= *s << 5;
50+
(*s as f32) / (u32::MAX as f32)
51+
}
52+
53+
/// `n` distinct SPD pairs via `from_scale_quat` (entry-wise different across
54+
/// all 6 SoA channels so the SIMD transpose does real work).
55+
fn build_spd_pairs(n: usize) -> (Vec<Spd3>, Vec<Spd3>) {
56+
let mut st = 0x1234_5678u32;
57+
let mut ms = Vec::with_capacity(n);
58+
let mut ns = Vec::with_capacity(n);
59+
for _ in 0..n {
60+
let sm = [0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st)];
61+
let sn = [0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st)];
62+
let mut qm = [rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5];
63+
let mut qn = [rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5];
64+
normalize_quat(&mut qm);
65+
normalize_quat(&mut qn);
66+
ms.push(Spd3::from_scale_quat(sm, qm));
67+
ns.push(Spd3::from_scale_quat(sn, qn));
68+
}
69+
(ms, ns)
70+
}
71+
72+
fn normalize_quat(q: &mut [f32; 4]) {
73+
let n = (q[0] * q[0] + q[1] * q[1] + q[2] * q[2] + q[3] * q[3]).sqrt().max(1e-12);
74+
for v in q.iter_mut() {
75+
*v /= n;
76+
}
77+
}
78+
79+
/// `M·N·Mᵀ` via two dense 3×3 matmuls — the shape a per-matrix BLAS call
80+
/// imposes (no upper-triangle fusion, no SoA). Output upper triangle, off-
81+
/// diagonals averaged to match `sandwich`'s symmetrisation.
82+
#[inline]
83+
fn sandwich_gemm_shape(m: &Spd3, n: &Spd3) -> Spd3 {
84+
let a = m.to_rows();
85+
let b = n.to_rows();
86+
let mut t = [[0.0f32; 3]; 3];
87+
for i in 0..3 {
88+
for j in 0..3 {
89+
t[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j] + a[i][2] * b[2][j];
90+
}
91+
}
92+
// R = T · Mᵀ (Mᵀ[k][j] = A[j][k])
93+
let mut r = [[0.0f32; 3]; 3];
94+
for i in 0..3 {
95+
for j in 0..3 {
96+
r[i][j] = t[i][0] * a[j][0] + t[i][1] * a[j][1] + t[i][2] * a[j][2];
97+
}
98+
}
99+
Spd3::new(r[0][0], 0.5 * (r[0][1] + r[1][0]), 0.5 * (r[0][2] + r[2][0]), r[1][1], 0.5 * (r[1][2] + r[2][1]), r[2][2])
100+
}
101+
102+
fn build_gaussians(n: usize) -> GaussianBatch {
103+
let mut st = 0x9E37_79B9u32;
104+
let mut batch = GaussianBatch::with_capacity(n);
105+
for _ in 0..n {
106+
let mut g = Gaussian3D::unit();
107+
g.mean = [4.0 * rng(&mut st) - 2.0, 4.0 * rng(&mut st) - 2.0, 1.0 + 5.0 * rng(&mut st)];
108+
g.scale = [0.05 + 0.2 * rng(&mut st); 3];
109+
g.quat = [1.0, 0.0, 0.0, 0.0];
110+
g.opacity = 0.5 + 0.5 * rng(&mut st);
111+
batch.push(g);
112+
}
113+
batch
114+
}
115+
116+
fn bench_sandwich_paths(c: &mut Criterion) {
117+
let mut grp = c.benchmark_group("ewa_sandwich_paths");
118+
grp.sample_size(20);
119+
for &n in &SIZES {
120+
let (ms, ns) = build_spd_pairs(n);
121+
grp.throughput(Throughput::Elements(n as u64));
122+
123+
grp.bench_with_input(BenchmarkId::new("scalar", n), &n, |b, &n| {
124+
let mut out = vec![Spd3::ZERO; n];
125+
b.iter(|| {
126+
for i in 0..n {
127+
out[i] = sandwich(black_box(&ms[i]), black_box(&ns[i]));
128+
}
129+
black_box(&out[n - 1]);
130+
});
131+
});
132+
133+
grp.bench_with_input(BenchmarkId::new("simd_x16", n), &n, |b, &n| {
134+
let mut out = vec![Spd3::ZERO; n];
135+
b.iter(|| {
136+
let mc = ms.chunks_exact(16);
137+
let nc = ns.chunks_exact(16);
138+
let oc = out.chunks_exact_mut(16);
139+
for ((mm, nn), oo) in mc.zip(nc).zip(oc) {
140+
let ma: &[Spd3; 16] = mm.try_into().unwrap();
141+
let na: &[Spd3; 16] = nn.try_into().unwrap();
142+
let oa: &mut [Spd3; 16] = oo.try_into().unwrap();
143+
sandwich_x16(black_box(ma), black_box(na), oa);
144+
}
145+
black_box(&out[n - 1]);
146+
});
147+
});
148+
149+
grp.bench_with_input(BenchmarkId::new("gemm_shape", n), &n, |b, &n| {
150+
let mut out = vec![Spd3::ZERO; n];
151+
b.iter(|| {
152+
for i in 0..n {
153+
out[i] = sandwich_gemm_shape(black_box(&ms[i]), black_box(&ns[i]));
154+
}
155+
black_box(&out[n - 1]);
156+
});
157+
});
158+
}
159+
grp.finish();
160+
}
161+
162+
fn bench_project_batch(c: &mut Criterion) {
163+
let mut grp = c.benchmark_group("ewa_project_batch");
164+
grp.sample_size(10);
165+
let cam = Camera::identity_at_origin(1920, 1080);
166+
for &n in &SIZES {
167+
let batch = build_gaussians(n);
168+
let mut out = ProjectedBatch::with_capacity(batch.capacity);
169+
grp.throughput(Throughput::Elements(n as u64));
170+
grp.bench_with_input(BenchmarkId::from_parameter(n), &n, |b, _| {
171+
b.iter(|| {
172+
project_batch(black_box(&batch), black_box(&cam), &mut out);
173+
black_box(out.len);
174+
});
175+
});
176+
}
177+
grp.finish();
178+
}
179+
180+
criterion_group!(ewa_syrk, bench_sandwich_paths, bench_project_batch);
181+
criterion_main!(ewa_syrk);

0 commit comments

Comments
 (0)