Skip to content

Commit 0c1fb05

Browse files
committed
docs(splat-native): address review feedback on #212 (2 fixes)
Follow-up to PR #212 (merged). Addresses two codex review findings. ## Fix 1 — `batched_opacity_blend` needs ray segmentation (codex P1) The original signature took a single flat `sorted_amplitudes` slice and emitted a single `out_alpha[ray]` — but a renderer composites N independent view rays per frame, each with its own front-to-back- sorted Gaussian sequence. Without per-ray boundaries the implementation could not know which Gaussians belong to which output pixel, so it would either composite the same global sequence for every ray or guess boundaries outside the API. Adds a CSR-style `ray_offsets: &[u32]` prefix-sum (length `n_rays + 1`) that segments the flat amplitude buffer into per-ray ranges. Documented contract: - `ray_offsets[0] == 0` and `ray_offsets[n_rays] == sorted_amplitudes.len()` - Empty ray (`ray_offsets[r] == ray_offsets[r+1]`) yields `out_alpha[r] = 0` - Rays are independent (no cross-ray data dependence) — outer loop is trivially parallelizable - Per-frame amplitude quantization is caller-side; `opacity_lut` is a frame-global constant for that pass Adds three new tests: - Multi-ray independence (concatenated rays match per-ray calls) - Empty-ray boundary case (→ α = 0) - `ray_offsets` invariant debug-asserts ## Fix 2 — `batched_mahalanobis` needs scratch buffer for Cholesky cache (codex P2) The original implementation note said L was "heap-free via stack or caller-provided scratch" but the public signature had no scratch parameter. At the documented `N = 1_000_000` bench size, the Cholesky cache is `6 * N * size_of::<f32>() = 24 MiB` — not stack-feasible. The function would either have to allocate internally (breaking the zero-allocation contract) or recompute factors per query (breaking the throughput contract). Adds explicit `cholesky_scratch: &mut [f32]` parameter (length `6*N`) with documented sizing guidance: - `N ≤ 8192` MAY use a stack-resident buffer - `N > 8192` MUST allocate once at engine init and re-use across frames - The function MUST NOT allocate internally Matches the `splat-fit` engine and registration-loop pattern where the scratch is allocated once per `SplatFitActor` mailbox at boot. ## What's NOT in this PR - Source code: still none. Plan-spec only. - The W1c primitive-addition contract (all three backends mandatory, parity tests gate, VPABSB-correction-style degenerate-input documentation) is unchanged — the fix updates the two signatures but not the testing or backend invariants. ## Test plan - [x] Codex P1 (ray segmentation) — added per-ray offset + 3 new tests to the contract. - [x] Codex P2 (Mahalanobis scratch) — added `cholesky_scratch` parameter + sizing note + zero-allocation contract. - [x] Signatures rebalanced (each line-broken with one arg per line + sized comments) for readability. - [ ] Codex re-review on this PR.
1 parent 481205a commit 0c1fb05

1 file changed

Lines changed: 21 additions & 5 deletions

File tree

.claude/plans/splat-native-ultrasound-simd-substrate-v1.md

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -194,11 +194,15 @@ L₂₂ = √(Σ₂₂ - L₂₀² - L₂₁²)
194194
/// - Degenerate Σ (Cholesky NaN) yields f32::INFINITY (sortable; never NaN).
195195
/// - SIMD-batched 16×16 on AVX-512, 4×4 on NEON.
196196
pub fn batched_mahalanobis(
197-
query_xyz: &[f32], mu_xyz: &[f32], sigma_packed: &[f32], out_dist_sq: &mut [f32],
197+
query_xyz: &[f32], // length 3*M
198+
mu_xyz: &[f32], // length 3*N
199+
sigma_packed: &[f32], // length 6*N (upper-triangle Σ per Gaussian)
200+
cholesky_scratch: &mut [f32], // length 6*N (caller-provided; holds packed L per Gaussian)
201+
out_dist_sq: &mut [f32], // length M*N (row-major)
198202
);
199203
```
200204

201-
**Implementation note:** internally calls `batched_cholesky_3x3` once on `sigma_packed`, caches L (heap-free via stack or caller-provided scratch), then triangular-solve + squared norm per (m, n) pair.
205+
**Implementation note:** internally calls `batched_cholesky_3x3` on `sigma_packed` once per call, writing packed L into `cholesky_scratch` (caller-provided; zero-allocation contract). The caller sizes the buffer as `6 * N * size_of::<f32>()` — for `N = 1_000_000` Gaussians this is **24 MiB**, which is not stack-feasible; callers must allocate it once at engine init and re-use across frames (matches the `splat-fit` / registration loop pattern). For small `N` (e.g. `N ≤ 8192`) callers MAY pass a stack-resident buffer. The function MUST NOT allocate internally.
202206

203207
**Tests:**
204208
- Reference comparison against scipy `scipy.spatial.distance.mahalanobis` on random points + Σ.
@@ -226,14 +230,26 @@ pub fn batched_mahalanobis(
226230
/// - Composition: α_new = α_old + (1 - α_old) · α_i.
227231
/// - Internal accumulator: u16 with saturation; truncate to u8 at end.
228232
pub fn batched_opacity_blend(
229-
sorted_amplitudes: &[f32], opacity_lut: &[u8; 256], out_alpha: &mut [u8],
233+
sorted_amplitudes: &[f32], // flat; contains all rays' samples concatenated
234+
ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1])
235+
opacity_lut: &[u8; 256],
236+
out_alpha: &mut [u8], // length = n_rays
230237
);
231238
```
232239

240+
**Per-ray segmentation contract.** A renderer composites N independent view rays per frame; each ray has its own front-to-back-sorted Gaussian sequence. `ray_offsets` is a CSR-style prefix-sum (length `n_rays + 1`) so ray `r`'s amplitudes are `sorted_amplitudes[ray_offsets[r] as usize..ray_offsets[r+1] as usize]` and `out_alpha[r]` is its composited alpha. Constraints:
241+
- `ray_offsets[0] == 0` and `ray_offsets[n_rays] == sorted_amplitudes.len() as u32` (assert-on-debug).
242+
- A ray with `ray_offsets[r] == ray_offsets[r+1]` (empty) yields `out_alpha[r] = 0`.
243+
- Per-frame amplitude quantization (the 256-bucket LUT input) is computed by the caller from the per-frame max amplitude; `opacity_lut` is a frame-global constant for that pass.
244+
245+
**Implementation note:** the SIMD inner loop processes one ray's range as a contiguous front-to-back sweep; rays are independent (no cross-ray data dependence) so the outer ray loop is trivially parallelizable.
246+
233247
**Tests:**
234-
- Reference comparison against scalar reference for known sequences.
248+
- Reference comparison against scalar reference for known sequences (single-ray + multi-ray).
235249
- Saturation at full opacity (sequence of high-amplitude Gaussians → α = 255).
236-
- Empty sequence → α = 0.
250+
- Empty ray (`ray_offsets[r] == ray_offsets[r+1]`) → α = 0.
251+
- Multi-ray independence (concatenated rays produce same per-ray output as separate single-ray calls).
252+
- `ray_offsets` invariant violations (debug assert on `ray_offsets[0] != 0` or `ray_offsets[last] != amplitudes.len()`).
237253
- SIMD parity.
238254

239255
### 4.4 `batched_sh_eval_l3`

0 commit comments

Comments
 (0)