Skip to content

Commit 25874e7

Browse files
authored
Merge pull request #167 from AdaWorldAPI/claude/pr-x4-splat-cascade-pre-sprint-prompt
PR-X1 SIMD-staged primitives + PR-X4 splat-cascade pre-sprint docs
2 parents c7b3f38 + b0f16b2 commit 25874e7

21 files changed

Lines changed: 1505 additions & 37 deletions

.claude/knowledge/hhtl-pr-x4-splat-cascade-pre-sprint-prompt.md

Lines changed: 108 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ the level-4 fix is the only blocker. **PR-X4 must NOT re-introduce its
235235
own Hilbert-3D encode** — wait for the A12b fix, then consume
236236
`linalg::hilbert::hilbert3d_encode`.
237237

238-
## Schedule slot — W4-W5 (5 workers), no new Q-marker
238+
## Schedule slot — W4-W5 (6 workers), no new Q-marker
239239

240240
The 8-week schedule from `hhtl-substrate-execution-prompt.md`:76-91
241241
already places PR-X4 at W4-W5 alongside PR-X12 (codec, 8 workers):
@@ -245,7 +245,7 @@ already places PR-X4 at W4-W5 alongside PR-X12 (codec, 8 workers):
245245
| W1-W2 | PR-X10 (linalg-core foundation) | 12 |
246246
| **W2.5/W3** | **PR-X1 + PR-X2 (GridLake) + PR-X14′ (contract)** ← Q-NEW-1/Q-NEW-2 pending | 4-10 depending on cell |
247247
| W3 | PR-X11 (jc consolidation) + PR-X13 (OGIT bridge) | 6 + 4 |
248-
| **W4-W5** | **PR-X12 (codec, 8) + PR-X4 (splat cascade, 5)** | **13** |
248+
| **W4-W5** | **PR-X12 (codec, 8) + PR-X4 (splat cascade, 6)** | **14** |
249249
| W6-W7 | PR-X9 (basin-codebook) | 6 |
250250
| W8 | Integration + canary | 3 |
251251

@@ -266,7 +266,7 @@ If GridLake + X14′ slip past W3 (e.g., Cell A-β extension), PR-X4
266266
starts late by the same margin. **No schedule extension owed by
267267
PR-X4 itself.**
268268

269-
## Worker spawn shape (5 workers)
269+
## Worker spawn shape (6 workers)
270270

271271
```
272272
A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor (chain dep)
@@ -279,14 +279,25 @@ A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor (chain dep)
279279
280280
├──→ A4: G2 INT4×32 packed dot (3 backends + parity test)
281281
282-
└──→ A5: G3 NARS truth-revision kernel + G4 fast_exp_x16
283-
precision audit (combined; A5 verdict determines if
284-
precise_exp_x16 follow-up needed)
282+
├──→ A5: G3 NARS truth-revision kernel + G4 fast_exp_x16
283+
│ precision audit (combined; A5 verdict determines if
284+
│ precise_exp_x16 follow-up needed)
285+
286+
└──→ A6: Railway smoke deployment — splat4d::cascade::frame_pipeline
287+
wired to HLS over a minimal axum/warp service, Prom metrics
288+
endpoint, FPS + jitter histogram surfaced in the player UI.
289+
Depends on A1 + A5 only (no cross-deps to A2/A3/A4) so the
290+
banal smoke test can ship even if A12b's L4 Hilbert fix
291+
slips past W3 — A6 exercises L1-L3 cascade and the
292+
composition closure, which is enough to falsify a latency
293+
regression.
285294
```
286295

287-
A1 is the only chain dep. A2-A5 are parallel after A1 lands.
296+
A1 is the only chain dep. A2-A6 are parallel after A1 lands.
288297
A2 has an additional gate dep on PR-X10 A12b's L4 fix landing on
289-
master (not just PR-X10 W2 completion).
298+
master (not just PR-X10 W2 completion). A6 has an additional dep on
299+
A5's composition closure being callable from the pipeline (needed for
300+
SG4); the alpha-only branch of A6 can ship without A5.
290301

291302
## Done criteria
292303

@@ -343,6 +354,60 @@ The sprint is done when ALL of the following hold:
343354
union currently green on master) still passes after the refactor.
344355
`cargo clippy -- -D warnings` clean.
345356

357+
7. **Smoke gates pass on Railway** (see § "Smoke acceptance gates"):
358+
SG1 ≥ 60 fps median, SG2 ≤ 20 ms p95, SG3 zero stutter events
359+
over 10 minutes, SG4 same envelope under the `splat4d-nars-compose`
360+
feature flag. A6 must be deployed and metrics scraped before the
361+
sprint closes.
362+
363+
## Smoke acceptance gates — Railway-hosted video player
364+
365+
The cascade ships not just as a refactor but as a service: a small
366+
Railway-deployed binary that streams a video through the splat4d
367+
pipeline. **Banality is the test.** If the cascade can stream a 1080p
368+
video on a Railway hobby tier without stuttering, every bundle is
369+
honoring its latency contract under sustained load — and any cliff
370+
that hides on a workstation surfaces in the deployment envelope.
371+
372+
### Why this beats synthetic benchmarks (PSNR, throughput-only)
373+
374+
- **PSNR is a number; stuttering is a sensation.** A dropped frame is
375+
unfalsifiable: you either see it or you don't. PSNR averages across
376+
frames and hides lane-traffic spikes — exactly the pathology bundle
377+
contracts are designed to prevent.
378+
- **Railway adds the deployment envelope for free.** Hobby-tier CPU
379+
caps, real network egress, container memory limits. Anything that
380+
hides on a workstation surfaces here.
381+
- **60 fps × 10 minutes = 36,000 frames each inside a 16.6 ms budget.**
382+
No batch averaging, no "we'll catch up." The harshest possible test
383+
of per-bundle latency dressed up as the most boring deliverable.
384+
385+
### What ships in A6
386+
387+
- **Service**: minimal Railway-deployed binary (axum or warp), HTML5
388+
player + FPS counter + jitter histogram in the UI
389+
- **Server path**: video frames flow through
390+
`splat4d::cascade::frame_pipeline` — decode → B-Interleave-Transpose
391+
→ cascade L1..L4 (L1-L3 only if A12b slipped) → B-Compose (alpha) →
392+
emit
393+
- **Client**: stock `<video>` element over HLS, no special player
394+
- **Metrics**: median FPS, p95 frame time, stutter events (count of
395+
inter-frame gaps > 33 ms), exposed on a Prom endpoint
396+
397+
### Gates
398+
399+
| Gate | Target | What it proves |
400+
|---|---|---|
401+
| **SG1: median FPS** | ≥ 60 fps for a 1080p H.264 input, 10-minute Big Buck Bunny | Steady-state throughput; the cascade isn't dropping behind |
402+
| **SG2: p95 frame time** | ≤ 20 ms | No bundle is silently decomposing into its constituent ops |
403+
| **SG3: stutter count** | 0 events > 33 ms over 10 minutes | Every bundle honors its latency contract under sustained load |
404+
| **SG4: closure swap** | Same SG1-SG3 envelope with `splat4d-nars-compose` feature on | NARS-revision path has the same latency class as alpha, as designed |
405+
406+
SG4 is conditional on G3 (the NARS truth-revision kernel) being
407+
ULP-correct against the scalar reference; an SG4 failure with G3
408+
passing is evidence that the NARS B-Compose kernel has a worse latency
409+
class than alpha and must be re-staged before the W7 closure swap.
410+
346411
## Forbidden constraints
347412

348413
Five invariants the sprint MUST NOT violate:
@@ -353,11 +418,29 @@ Five invariants the sprint MUST NOT violate:
353418
worker stubs the L4 path (returns `Err(NotReadyL4)`) and ships L1-L3
354419
addressing only.
355420

356-
2. **No `crate::simd::*` extension from inside PR-X4**. Any new SIMD
357-
primitive (e.g., a missing lane width for G2 INT4×32) must be
358-
proposed against `vertical-simd-consumer-contract.md` and land in
359-
ndarray's `src/simd_*.rs` before PR-X4 consumes it. PR-X4 must not
360-
reach for raw `std::arch::*` intrinsics.
421+
2. **PR-X4 consumes — and must not extend — the following SIMD bundles
422+
from `ndarray::simd`.** Each bundle is a fused multi-op transaction
423+
with its own latency budget; reaching past a bundle into raw
424+
`std::arch::*` intrinsics, or proposing new lane primitives without
425+
going through `vertical-simd-consumer-contract.md`, breaks the
426+
contract and re-introduces the bespoke-binner pathology v1 is
427+
leaving behind.
428+
429+
| Bundle | Composition | Cognitive role |
430+
|---|---|---|
431+
| **B-Splat** | `splat_f32x16`, `splat_i32x16` | Broadcast a Gaussian center / NARS truth-value across the 16 tile lanes of a single L_k cell. The identity of a single belief across its support. |
432+
| **B-Gather-FMA** | `gather_idx_f32x16``fmadd_f32x16` | Pick up the 16 neighbouring Gaussians of a tile and fuse-multiply-add their contributions in one shot. Evidence-aggregation across siblings. |
433+
| **B-Pack-Dot** | `pack_int4x32``dot_i4x32_to_i32``dequant_f32` | The INT4×32 packed dot of A4. SH-coefficient evaluation, NARS confidence × frequency products. Three backends (AVX-512 VNNI, NEON UDOT, scalar) with parity tests. |
434+
| **B-Cascade-Permute** | `shuffle_lanes_4x4``transpose_16x16` | Cross-tier rotation L_k → L_{k+1}. The 4×4 stride identity made executable — without this bundle the cascade is just a hierarchy of independent grids. |
435+
| **B-Compose** | `hreduce_sum_f32x16` for alpha; `revise_truth_f32x16` for NARS | Closure-swappable horizontal reduction. The `splat4d-nars-compose` feature gate selects which kernel binds; same lane width, same latency class, different algebra. |
436+
| **B-Interleave-Transpose** | `interleave_f32x16``transpose_inplace` | Row-major splat3d ↔ lane-major splat4d. Boundary primitive between v1 binner and v2 cascade. |
437+
438+
The forbidden thing is reaching past a bundle into its internal lane
439+
primitives — that breaks the latency contract that the A6 Railway
440+
smoke gates (SG2 p95 ≤ 20 ms, SG3 zero stutter) are designed to
441+
falsify. Missing bundles get proposed against
442+
`vertical-simd-consumer-contract.md` and land in `src/simd_*.rs`
443+
before PR-X4 consumes them; PR-X4 itself never adds a primitive.
361444

362445
3. **No write to lance-graph upstream**. PR-X4 lives entirely in
363446
ndarray (`src/hpc/splat3d_v2/`, `src/hpc/splat4d/`). It consumes
@@ -427,12 +510,16 @@ Five invariants the sprint MUST NOT violate:
427510
PR-X4 promotes splat3d from "bespoke 16×16 tile binner" to "typed
428511
multi-resolution cognitive evolution operator" with the
429512
(4×4)×(4×4)×(4×4)×(4×4) tier scheme as its load-bearing structural
430-
identity. Slots at W4-W5 (5 workers). Consumes GridLake +
513+
identity. Slots at W4-W5 (6 workers). Consumes GridLake +
431514
`lance-graph-contract::column` + PR-X10 A6/A8/A12b + PR-X11 jc Spd3.
432-
Gates on PR-X10 A12b's L4 Hilbert-3D P0-4 fix (`hilbert3d_encode([15,15,15], 4)
433-
→ 2925, expected 4095`). Ships four splat gaps (G1 deg-3 SH inquiry-
434-
direction, G2 INT4×32 packed dot, G3 NARS truth-revision kernel,
435-
G4 fast_exp_x16 precision audit). Alpha-compositing stays default;
436-
NARS-revision composition gated behind `splat4d-nars-compose` feature
437-
flag until W7 closure swap. CTU-mode encoder is PR-X9's deliverable,
438-
not PR-X4's.
515+
Consumes (and does not extend) six SIMD bundles from `ndarray::simd`:
516+
B-Splat, B-Gather-FMA, B-Pack-Dot, B-Cascade-Permute, B-Compose,
517+
B-Interleave-Transpose. Gates on PR-X10 A12b's L4 Hilbert-3D P0-4 fix
518+
(`hilbert3d_encode([15,15,15], 4) → 2925, expected 4095`). Ships four
519+
splat gaps (G1 deg-3 SH inquiry-direction, G2 INT4×32 packed dot, G3
520+
NARS truth-revision kernel, G4 fast_exp_x16 precision audit) and four
521+
Railway smoke gates (SG1 ≥ 60 fps, SG2 p95 ≤ 20 ms, SG3 zero stutter,
522+
SG4 same envelope under NARS feature flag). Alpha-compositing stays
523+
default; NARS-revision composition gated behind `splat4d-nars-compose`
524+
feature flag until W7 closure swap. CTU-mode encoder is PR-X9's
525+
deliverable, not PR-X4's.
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# A1: TileInstance v2 + BlockedGrid<SplatBinList, 1, 1> refactor
2+
3+
Worker A1 of PR-X4 (W4-W5). **Chain dep** for A2-A6 — must land
4+
before any other worker spawns. Owns the structural port from the
5+
bespoke 16×16 binner to the typed BlockedGrid substrate.
6+
7+
## Scope
8+
9+
Lift the existing splat3d binner (`src/hpc/splat3d/{tile,frame,
10+
gaussian,project,raster,sh,spd3,ply,mod}.rs`) into `splat3d_v2/` as a
11+
sibling tree on `BlockedGrid<SplatBinList, 1, 1>` from PR-X3. Same
12+
algorithmic shape (project → bin → sort → rasterize) on the typed
13+
substrate, with the tier field A2 will populate.
14+
15+
## File moves
16+
17+
| v1 path | v2 path |
18+
|-------------------------------|--------------------------------------|
19+
| `splat3d/tile.rs` | `splat3d_v2/tile.rs` |
20+
| `splat3d/frame.rs` | `splat3d_v2/frame.rs` |
21+
| `splat3d/gaussian.rs` | `splat3d_v2/gaussian.rs` |
22+
| `splat3d/project.rs` | `splat3d_v2/project.rs` |
23+
| `splat3d/raster.rs` | `splat3d_v2/raster.rs` |
24+
| `splat3d/sh.rs` | `splat3d_v2/sh.rs` (A3 expands) |
25+
| `splat3d/spd3.rs` | `splat3d_v2/spd3.rs` |
26+
| `splat3d/mod.rs` | `splat3d_v2/mod.rs` |
27+
28+
Side-by-side per pr-x4-design § Q1 — `splat3d/` stays unchanged until
29+
W7 closure swap. Both compile.
30+
31+
## Verbatim struct (pre-sprint lines 95-103)
32+
33+
```rust
34+
#[repr(C, align(16))]
35+
pub struct TileInstance {
36+
pub tier: u8, // 1 = L1, 2 = L2, 3 = L3, 4 = L4
37+
pub _pad: [u8; 3],
38+
pub block_row: u16,
39+
pub block_col: u16,
40+
pub gaussian_id: u32,
41+
pub confidence: f32, // replaces depth — sort key, highest-first
42+
}
43+
```
44+
45+
A1 emits `tier == 1` only. L2-L4 emission is A2's deliverable, so
46+
**A1 is NOT gated on PR-X10 A12b's L4 Hilbert-3D fix.** For the
47+
graphics-compat layer, `confidence = 1.0 / (depth + EPS)` so
48+
highest-first sort recovers front-to-back order under the new key.
49+
50+
## BlockedGrid<SplatBinList, 1, 1> migration
51+
52+
`SplatBinList` is the per-block payload — `SmallVec<[TileInstance; 8]>`
53+
or equivalent — replacing v1's `Vec<TileInstance> + Vec<u32> prefix`
54+
hand-rolled CSR. The `<1, 1>` block-params mean **1×1 cells per
55+
substrate block**: each tile is its own atomic block. Cascade-tier
56+
striding belongs to A2.
57+
58+
Constructor: `BlockedGrid::<SplatBinList, 1, 1>::with_dims(rows, cols)`,
59+
populated by the two-pass count+emit pattern v1 uses. The packed-u64
60+
radix sort survives unchanged; the prefix-sum CSR is replaced by
61+
`BlockedGrid::iter_blocks()`.
62+
63+
The PP-13 PR4 P0 boundary-tile fix (`floor + 1` instead of `ceil`) at
64+
`splat3d/tile.rs:241-243` MUST be preserved verbatim in the v2 port
65+
— a regression here silently breaks SG3.
66+
67+
## SIMD bundles — B-Splat + B-Interleave-Transpose
68+
69+
A1 consumes exactly two bundles:
70+
71+
- **B-Splat** (`splat_f32x16`, `splat_i32x16`): broadcast a Gaussian
72+
center across 16 tile lanes during the bin step.
73+
- **B-Interleave-Transpose** (`interleave_f32x16 ∘ transpose_inplace`):
74+
the row-major splat3d ↔ lane-major splat4d boundary primitive. A1
75+
IS the v1↔v2 boundary, so this is its primary tool.
76+
77+
B-Gather-FMA and B-Cascade-Permute belong to A3, A2. A1 must not
78+
reach past either bundle into raw intrinsics — breaks SG2 p95.
79+
80+
## Parity tests
81+
82+
1. **v1-vs-v2 binner parity**: feed both binners the same 16-Gaussian
83+
fixture (seed `0xA1_B1_NA_RY`), assert emitted `TileInstance`
84+
streams agree on `(tile_id ↔ (block_row, block_col), gaussian_id,
85+
sort order under the depth/confidence transform)`.
86+
2. **Boundary-tile coverage regression**: the PP-13 PR4 P0 case —
87+
3σ-ellipse straddling a tile boundary — must produce identical
88+
per-tile splat counts in v2 as v1.
89+
90+
The 2370-test no-regression line (done-criteria #6) requires the v1
91+
suite to still pass with `splat3d/` untouched.
92+
93+
## Exit criteria — when A2-A6 may spawn
94+
95+
- [ ] `cargo test -p ndarray --lib splat3d_v2::` green
96+
- [ ] v1↔v2 binner parity on the 16-Gaussian fixture
97+
- [ ] Boundary-tile coverage regression passes
98+
- [ ] `cargo clippy -- -D warnings` clean
99+
- [ ] `splat3d_v2::TileInstance` + `BlockedGrid<SplatBinList,1,1>`
100+
exported from `splat3d_v2::mod`
101+
- [ ] A6's `frame_pipeline` skeleton can call into `splat3d_v2`
102+
without depending on A2..A5
103+
104+
No AABB or Hilbert dep, no SH or INT4 dep. A1 is the chain dep gate.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# A2: CascadeAddr + from_position/to_position_center
2+
3+
Worker A2 of PR-X4 (W4-W5). Spawns after A1's TileInstance v2
4+
refactor lands. **Hard gate on PR-X10 A12b's L4 Hilbert-3D fix
5+
landing on master.**
6+
7+
## Gate — PR-X10 A12b L4 fix
8+
9+
Verbatim symptom (pp13-brutally-honest-tester-verdict.md P0-4):
10+
`hilbert3d_encode([15,15,15], 4) → 2925, expected 4095`. A12b must
11+
ship the `NEXT_STATE`/`H_TO_XYZ` re-derivation from Hamilton 2006
12+
Table 2 + `round_trip_level4_exhaustive` (4096 cells × 4 µs ≈ 16 ms)
13+
before A2 starts. **A2 must NOT re-introduce a bespoke Hilbert-3D**
14+
(forbidden constraint #1).
15+
16+
If A12b slips past W3, A2 stubs the L4 path: `Err(NotReadyL4)`,
17+
ships L1-L3 addressing only. `parent()`/`children()` remain
18+
functional since they're pure nibble ops.
19+
20+
## API surface
21+
22+
```rust
23+
pub struct CascadeAddr(u16); // 4 nibbles, one per tier level
24+
25+
impl CascadeAddr {
26+
pub fn level(&self, l: u8) -> u8 { (self.0 >> (l * 4) & 0xF) as u8 }
27+
pub fn parent(&self) -> CascadeAddr { CascadeAddr(self.0 & !0xF000) }
28+
pub fn children(&self) -> [CascadeAddr; 16] { ... }
29+
pub fn from_position(p: Vec3, bbox: AABB, level: u8) -> CascadeAddr {
30+
CascadeAddr(linalg::hilbert::hilbert3d_encode(p_quantised, level) as u16)
31+
}
32+
pub fn to_position_center(&self, bbox: AABB) -> Vec3 { ... }
33+
}
34+
```
35+
36+
The 4-nibble layout: one nibble per L1..L4 tier, 16 children per
37+
parent. `parent()` masks off the L4 nibble. `children()` enumerates
38+
all 16 nibble values at the L4 slot.
39+
40+
## AABB quantisation convention
41+
42+
Per `linalg::hilbert::hilbert3d_encode` contract, the input is a
43+
quantised 3-tuple of unsigned ints. At level `k`, each axis has
44+
`2^k` cells:
45+
46+
| level | cells/axis | index range | bits |
47+
|-------|------------|--------------|------|
48+
| 1 | 2 | [0, 8) | 3 |
49+
| 2 | 4 | [0, 64) | 6 |
50+
| 3 | 8 | [0, 512) | 9 |
51+
| 4 | 16 | [0, 4096) | 12 |
52+
53+
L4's 12-bit range fits within 3 of the 4 cascade-addr nibbles. NB:
54+
if A12b's actual encode returns a monolithic per-call index rather
55+
than packed cascade, A2 must call once per tier and assemble nibbles
56+
itself. **Flag this discrepancy with the A12b author at spawn.**
57+
58+
Quantisation: `q = floor((p.x - bbox.min.x) / (bbox.size.x) * (1 << level))`
59+
clamped to `[0, (1 << level) - 1]`. Same for y, z.
60+
61+
## Tests
62+
63+
- **Exhaustive level=4 round-trip** (4096 cells × 3 axes): for each
64+
of 4096 quantised positions, `decode(encode(p)) == p`. ~16 ms.
65+
- **Exhaustive level=1..3 round-trip**: already pass under current
66+
A12b — just verify under the splat4d call sites.
67+
- **AABB sanity**: corner cells map to `level()==0` and
68+
`level()==(1<<level)-1` per axis.
69+
- **parent/children round-trip**: for any addr,
70+
`addr.children()[i].parent() == addr` for all i.
71+
72+
## SIMD bundle — B-Cascade-Permute
73+
74+
A2 consumes one bundle:
75+
76+
- **B-Cascade-Permute** (`shuffle_lanes_4x4 ∘ transpose_16x16`): the
77+
cross-tier rotation L_k → L_{k+1}. The 4×4 stride identity made
78+
executable. Without this bundle the cascade is just a hierarchy
79+
of independent grids.
80+
81+
A2 must not reach past into raw shuffle intrinsics. If the bundle
82+
primitive is missing in `ndarray::simd`, file a pre-PR-X4 gating
83+
PR against the vertical-simd-consumer-contract before spawning.
84+
85+
## Exit criteria
86+
87+
- [ ] A12b's `hilbert3d_encode([15,15,15], 4) == 4095` and
88+
`round_trip_level4_exhaustive` green on master
89+
- [ ] A2's exhaustive level=4 round-trip green
90+
- [ ] `CascadeAddr::from_position` and `to_position_center`
91+
round-trip on 10K random positions within the unit AABB
92+
- [ ] `parent`/`children` round-trip exhaustive
93+
- [ ] L1-L3 addressing exercisable from A6's frame_pipeline (smoke
94+
gate the cascade addressing layer)
95+
- [ ] `cargo clippy -- -D warnings` clean

0 commit comments

Comments
 (0)