|
| 1 | +# SIMD review fixes — 2026-05-13 |
| 2 | + |
| 3 | +> **Branch:** `claude/ndarray-simd-review-S0zXK` |
| 4 | +> **Driver:** 15-agent CCA2A fleet review (12 file-scoped + meta + brutal-reviewer + this PR). |
| 5 | +> **Fleet log:** [`AGENT_LOG.md`](./AGENT_LOG.md) |
| 6 | +
|
| 7 | +## What this PR fixes |
| 8 | + |
| 9 | +Three soundness/correctness bugs surfaced by the review fleet and confirmed |
| 10 | +real by the brutally-honest reviewer (which built the workspace and ran |
| 11 | +`cargo clippy --features rayon -- -D warnings` clean and `cargo test |
| 12 | +--features rayon --lib` 1783-pass before any change). Most other findings |
| 13 | +were either already-clean (project_ortho saturating-cast was already |
| 14 | +defined behavior post-Rust-1.45) or deferred (cosmetic-SIMD sweep, polyfill |
| 15 | +completion). |
| 16 | + |
| 17 | +| # | Bug | Severity | Fix | |
| 18 | +|---|---|---|---| |
| 19 | +| 1 | `simd_avx512::permute_bytes` calls `_mm512_permutexvar_epi8` (AVX-512VBMI) as safe `pub fn` with no gate. SIGILL on Skylake-X / Cascade Lake / Ice Lake-SP (which have AVX-512F but **not** VBMI). The doc comment claimed a fallback existed; none did. | **P0 SIGILL** | Added `avx512vbmi: bool` to `SimdCaps`. `permute_bytes` now runtime-branches via the singleton: VBMI hosts use the hardware intrinsic (gated `#[target_feature(enable = "avx512vbmi")]` inner unsafe leaf, Rust language requirement); non-VBMI AVX-512F hosts use a scalar fallback (mirrors the AVX2-tier fallback at `simd_avx2.rs:1435`). | |
| 20 | +| 2 | `simd_exp_f32(+Inf)` silently returned ~0.5 in release / panicked in debug. `pow2n_from_int` saturated `f32::INFINITY as i32` to `i32::MAX`, then `(i32::MAX + 127) as u32` wrapped, producing an arbitrary IEEE bit pattern via `f32::from_bits` that combined with the polynomial to `~0.5`. | **P1 silent-wrong-output** | Pre-clamp input domain to `[-87.336, 88.722]` in `simd_exp_f32` (the safe range where exp() is f32-representable). Defense in depth: `pow2n_from_int` also clamps `ni` to `[-126, 127]` before the +127 bias. NaN propagates naturally through the polynomial. Three regression tests added: `+Inf`, `-Inf`, and large-positive (`x=200`) — all assert finite output. | |
| 21 | +| 3 | `framebuffer::project_ortho` cast `(neg_f32) as usize` directly. **Reviewer correction:** Rust 1.45+ saturates float→int casts (NaN→0, <MIN→0, >MAX→MAX), so this was already defined behavior. The original commit message overstated it as "UB fix"; it's actually a clarity improvement that clamps in float domain so the intent is visible at the call site. Same observable behavior. | **clarity** | Pre-fix in float domain via `.clamp(0.0, screen_dim as f32 - 1)` before the cast. Functionally equivalent to the prior code; just makes the bounds explicit. | |
| 22 | + |
| 23 | +## What this PR does NOT fix (intentional) |
| 24 | + |
| 25 | +The reviewer flagged that the broader fleet over-alarmed. These were |
| 26 | +considered and explicitly deferred: |
| 27 | + |
| 28 | +- **"Cosmetic SIMD" sweep.** ~6 files (`byte_scan::byte_find_all_avx2`, |
| 29 | + `palette_codec::pack_generic_avx512`, `aabb::aabb_intersect_batch_sse41`, |
| 30 | + `renderer::apply_uniform_force`, `simd_ln_f32`) wear `#[target_feature]` |
| 31 | + decorations on scalar bodies. Real but the reviewer judged: not |
| 32 | + Bevy-blocking, real perf-only fix is to complete the polyfill (`U8x64` |
| 33 | + has 25 methods on AVX-512, 0 in `simd_avx2.rs`, 3 in scalar fallback). |
| 34 | + That's the keystone for a future hpc/* rewrite — separate work. |
| 35 | +- **AMX detection duplication.** `simd_amx::amx_available()` re-implements |
| 36 | + CPUID + XCR0 + Linux prctl detection that should fold into `SimdCaps`. |
| 37 | + The user explicitly asked to keep this PR surgical and not touch AMX |
| 38 | + byte-call tricks. Deferred. |
| 39 | +- **SAFETY-comment audit on `simd_avx512.rs`** (200-deficit). Reviewer |
| 40 | + judged: macro-generated, share one safety contract, adding 200 inline |
| 41 | + comments catches zero bugs. Defer. |
| 42 | + |
| 43 | +## Changes by file |
| 44 | + |
| 45 | +### `src/hpc/simd_caps.rs` |
| 46 | +- Added `avx512vbmi: bool` field to `SimdCaps` (previously absent — the |
| 47 | + reviewer's #1 missing-field finding). |
| 48 | +- Added `is_x86_feature_detected!("avx512vbmi")` to the x86_64 detect |
| 49 | + branch; `false` in the aarch64 + non-x86 stubs. |
| 50 | +- Strictly additive: every existing field unchanged. |
| 51 | + |
| 52 | +### `src/simd_avx512.rs` |
| 53 | +- `U8x64::permute_bytes`: rewrote to runtime-dispatch via |
| 54 | + `simd_caps().avx512vbmi`. VBMI path delegates to a new `unsafe fn |
| 55 | + permute_bytes_vbmi` leaf marked `#[target_feature(enable = |
| 56 | + "avx512vbmi")]` (Rust requires this attribute to call VBMI intrinsics |
| 57 | + from a function not compiled with VBMI globally — there is no other |
| 58 | + legal way). |
| 59 | +- AVX-512F-without-VBMI path: scalar fallback via `to_array` → |
| 60 | + permute → `from_array`. Same algorithm as `simd_avx2.rs:1435`. |
| 61 | +- Inner leaf `permute_bytes_vbmi` documented with explicit SAFETY |
| 62 | + contract referencing the `simd_caps()` gate. |
| 63 | +- No other intrinsic touched. AMX inline-asm encodings, `_mm512_*` calls |
| 64 | + in other methods, and the existing `#[target_feature]` annotations are |
| 65 | + all unchanged. |
| 66 | + |
| 67 | +### `src/simd.rs` |
| 68 | +- `simd_exp_f32`: pre-clamp input via `simd_clamp(splat(-87.336), |
| 69 | + splat(88.722))` before range reduction. Comment explains the bound is |
| 70 | + the f32-representable domain of exp(). |
| 71 | +- `pow2n_from_int`: clamp `ni` to `[-126, 127]` before bias addition. |
| 72 | + Defense in depth — caller already pre-clamps but this prevents future |
| 73 | + regressions if the caller's clamp is removed or bypassed. |
| 74 | +- Three new tests: `simd_exp_f32_handles_positive_infinity`, |
| 75 | + `simd_exp_f32_handles_negative_infinity`, |
| 76 | + `simd_exp_f32_handles_large_positive`. All assert finite, plausibly- |
| 77 | + scaled output. Pre-fix these would have shown garbage bit patterns |
| 78 | + (release) or panicked (debug). |
| 79 | + |
| 80 | +### `src/hpc/framebuffer.rs` |
| 81 | +- `project_ortho`: clamp coords in float domain before `as usize` cast. |
| 82 | + Functionally equivalent to the prior code (Rust 1.45+ saturates), but |
| 83 | + the bound is now visible at the call site rather than relying on the |
| 84 | + cast's saturating behavior + post-cast `.min`. |
| 85 | + |
| 86 | +### `.claude/board/AGENT_LOG.md` |
| 87 | +- New file. CCA2A file-blackboard for the 15-agent fleet review that |
| 88 | + produced this PR. APPEND-ONLY. Includes the fleet manifest and 13 |
| 89 | + agent entries (12 file-scoped + meta-orchestrator + brutally-honest |
| 90 | + reviewer). |
| 91 | + |
| 92 | +### `.claude/board/SIMD_REVIEW_FIXES_2026_05_13.md` |
| 93 | +- This file. PR documentation per request. |
| 94 | + |
| 95 | +## Test surface |
| 96 | + |
| 97 | +``` |
| 98 | +$ cargo test --features rayon --lib |
| 99 | +test result: ok. 1786 passed; 0 failed; 36 ignored; 0 measured |
| 100 | +
|
| 101 | +$ cargo clippy --features rayon -- -D warnings |
| 102 | +Finished `dev` profile [unoptimized + debuginfo] target(s) — 0 warnings |
| 103 | +``` |
| 104 | + |
| 105 | +Pre-PR: 1783 passing. Post-PR: 1786 passing (+3 simd_exp_f32 regression |
| 106 | +tests). No existing tests modified or removed. |
| 107 | + |
| 108 | +## Hardware test matrix |
| 109 | + |
| 110 | +| Target | Pre-PR `permute_bytes` | Post-PR `permute_bytes` | |
| 111 | +|---|---|---| |
| 112 | +| Sapphire Rapids (avx512f + avx512vbmi) | works (VBMI hardware path) | works (same VBMI path, now via dispatch) | |
| 113 | +| Skylake-X / Cascade Lake / Ice Lake-SP (avx512f, no VBMI) | **SIGILL** | works (scalar fallback) | |
| 114 | +| Pre-AVX-512 (avx2 only) | type unavailable (cfg-gated out) | type unavailable (unchanged) | |
| 115 | +| ARM aarch64 | type unavailable (unchanged) | type unavailable (unchanged) | |
| 116 | + |
| 117 | +`simd_exp_f32` regression tests cover any host capable of running the |
| 118 | +test suite — the bug was in the f32 cast logic, not the SIMD intrinsics. |
| 119 | + |
| 120 | +## Review fleet output |
| 121 | + |
| 122 | +15 agents, all entries in `.claude/board/AGENT_LOG.md`: |
| 123 | +- Agents #1-12: file-scoped reviews (Sonnet, parallel) |
| 124 | +- Agent M: meta-orchestrator synthesis (Opus) |
| 125 | +- Agent R: brutally-honest reviewer (Opus, ran the build) |
| 126 | + |
| 127 | +Pattern observed by the fleet but deferred: many `hpc/*` files use |
| 128 | +`#[target_feature(enable = "...")]` decorations on scalar code bodies |
| 129 | +("cosmetic SIMD"). Real perf work, but per the brutally-honest reviewer |
| 130 | +not Bevy-blocking. The keystone fix is completing the polyfill — every |
| 131 | +method on `U8x64` / `F32x8` / etc. that exists on AVX-512 must also |
| 132 | +exist on AVX2 and scalar, so consumers can write |
| 133 | +`crate::simd::U8x64::cmpeq_mask()` and have it work on any CPU. Then |
| 134 | +the cosmetic-SIMD wrappers can be deleted in favor of polyfill calls. |
| 135 | +That's the next session. |
0 commit comments