Skip to content

Commit e2ec430

Browse files
authored
Merge pull request #213 from AdaWorldAPI/claude/splat-native-ultrasound-v1-fixes
docs(splat-native): address review feedback on #212 (ray segmentation + Cholesky scratch buffer)
2 parents 481205a + 3c78901 commit e2ec430

1 file changed

Lines changed: 30 additions & 12 deletions

File tree

.claude/plans/splat-native-ultrasound-simd-substrate-v1.md

Lines changed: 30 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,18 @@ pub fn batched_cholesky_3x3(
4848
);
4949

5050
pub fn batched_mahalanobis(
51-
query_xyz: &[f32], // M × 3 query points
52-
mu_xyz: &[f32], // N × 3 Gaussian centroids
53-
sigma_packed: &[f32], // N × 6 packed Σ
54-
out_dist_sq: &mut [f32], // M × N output (squared Mahalanobis)
51+
query_xyz: &[f32], // M × 3 query points
52+
mu_xyz: &[f32], // N × 3 Gaussian centroids
53+
sigma_packed: &[f32], // N × 6 packed Σ
54+
cholesky_scratch: &mut [f32], // N × 6 — caller-provided packed-L scratch (24 MiB @ N=1M); function MUST NOT allocate (see §4.2)
55+
out_dist_sq: &mut [f32], // M × N output (squared Mahalanobis)
5556
);
5657

5758
pub fn batched_opacity_blend(
58-
sorted_amplitudes: &[f32], // N (front-to-back along view ray)
59-
opacity_lut: &[u8; 256], // amplitude → opacity LUT
60-
out_alpha: &mut [u8], // composited alpha per pixel
59+
sorted_amplitudes: &[f32], // flat; all rays' samples concatenated (front-to-back per ray)
60+
ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1]) (see §4.3)
61+
opacity_lut: &[u8; 256], // amplitude → opacity LUT
62+
out_alpha: &mut [u8], // length = n_rays — composited alpha per ray
6163
);
6264

6365
pub fn batched_sh_eval_l3(
@@ -194,11 +196,15 @@ L₂₂ = √(Σ₂₂ - L₂₀² - L₂₁²)
194196
/// - Degenerate Σ (Cholesky NaN) yields f32::INFINITY (sortable; never NaN).
195197
/// - SIMD-batched 16×16 on AVX-512, 4×4 on NEON.
196198
pub fn batched_mahalanobis(
197-
query_xyz: &[f32], mu_xyz: &[f32], sigma_packed: &[f32], out_dist_sq: &mut [f32],
199+
query_xyz: &[f32], // length 3*M
200+
mu_xyz: &[f32], // length 3*N
201+
sigma_packed: &[f32], // length 6*N (upper-triangle Σ per Gaussian)
202+
cholesky_scratch: &mut [f32], // length 6*N (caller-provided; holds packed L per Gaussian)
203+
out_dist_sq: &mut [f32], // length M*N (row-major)
198204
);
199205
```
200206

201-
**Implementation note:** internally calls `batched_cholesky_3x3` once on `sigma_packed`, caches L (heap-free via stack or caller-provided scratch), then triangular-solve + squared norm per (m, n) pair.
207+
**Implementation note:** internally calls `batched_cholesky_3x3` on `sigma_packed` once per call, writing packed L into `cholesky_scratch` (caller-provided; zero-allocation contract). The caller sizes the buffer as `6 * N * size_of::<f32>()` — for `N = 1_000_000` Gaussians this is **24 MiB**, which is not stack-feasible; callers must allocate it once at engine init and re-use across frames (matches the `splat-fit` / registration loop pattern). For small `N` (e.g. `N ≤ 8192`) callers MAY pass a stack-resident buffer. The function MUST NOT allocate internally.
202208

203209
**Tests:**
204210
- Reference comparison against scipy `scipy.spatial.distance.mahalanobis` on random points + Σ.
@@ -226,14 +232,26 @@ pub fn batched_mahalanobis(
226232
/// - Composition: α_new = α_old + (1 - α_old) · α_i.
227233
/// - Internal accumulator: u16 with saturation; truncate to u8 at end.
228234
pub fn batched_opacity_blend(
229-
sorted_amplitudes: &[f32], opacity_lut: &[u8; 256], out_alpha: &mut [u8],
235+
sorted_amplitudes: &[f32], // flat; contains all rays' samples concatenated
236+
ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1])
237+
opacity_lut: &[u8; 256],
238+
out_alpha: &mut [u8], // length = n_rays
230239
);
231240
```
232241

242+
**Per-ray segmentation contract.** A renderer composites N independent view rays per frame; each ray has its own front-to-back-sorted Gaussian sequence. `ray_offsets` is a CSR-style prefix-sum (length `n_rays + 1`) so ray `r`'s amplitudes are `sorted_amplitudes[ray_offsets[r] as usize..ray_offsets[r+1] as usize]` and `out_alpha[r]` is its composited alpha. Constraints:
243+
- `ray_offsets[0] == 0` and `ray_offsets[n_rays] == sorted_amplitudes.len() as u32` (assert-on-debug).
244+
- A ray with `ray_offsets[r] == ray_offsets[r+1]` (empty) yields `out_alpha[r] = 0`.
245+
- Per-frame amplitude quantization (the 256-bucket LUT input) is computed by the caller from the per-frame max amplitude; `opacity_lut` is a frame-global constant for that pass.
246+
247+
**Implementation note:** the SIMD inner loop processes one ray's range as a contiguous front-to-back sweep; rays are independent (no cross-ray data dependence) so the outer ray loop is trivially parallelizable.
248+
233249
**Tests:**
234-
- Reference comparison against scalar reference for known sequences.
250+
- Reference comparison against scalar reference for known sequences (single-ray + multi-ray).
235251
- Saturation at full opacity (sequence of high-amplitude Gaussians → α = 255).
236-
- Empty sequence → α = 0.
252+
- Empty ray (`ray_offsets[r] == ray_offsets[r+1]`) → α = 0.
253+
- Multi-ray independence (concatenated rays produce same per-ray output as separate single-ray calls).
254+
- `ray_offsets` invariant violations (debug assert on `ray_offsets[0] != 0` or `ray_offsets[last] != amplitudes.len()`).
237255
- SIMD parity.
238256

239257
### 4.4 `batched_sh_eval_l3`

0 commit comments

Comments
 (0)