Skip to content

Commit f857a81

Browse files
committed
feat(simd): Phase 2 — wire simd_nightly into crate::simd::* dispatch
Phase 2 of the integration plan in `.claude/knowledge/ simd-dispatch-architecture.md`. simd.rs ------- Adds a top-priority `feature = "nightly-simd"` dispatch arm that re-exports the full `simd_nightly::*` portable-SIMD type set through `crate::simd::*`. No `target_arch` constraint — `core::simd` is portable, so the same arm catches wasm32 / riscv / aarch64 / x86_64. Tightens the predicate on every other type-re-export arm to `not(feature = "nightly-simd")`: * AVX-512 (avx512f) * AVX-512BF16 (BF16x8/16 types) * AVX2 baseline (the v3 default arm) * U8x32 (cross-tier export) * aarch64 NEON * non-x86/non-aarch64 scalar fallback * the inline `mod scalar` declaration itself Result: when `cargo +nightly --features nightly-simd ...` is used, every `use crate::simd::F32x16` call site routes to the portable-SIMD implementation — and miri can actually execute it (it treats `_mm*` intrinsics as opaque, but `core::simd::*` runs fine). BF16 conversion FUNCTIONS (bf16_to_f32_batch etc.) are NOT gated under the nightly arm: they're scalar/intrinsic functions taking primitive slices, not the SIMD types, and they coexist cleanly with the portable backend. architecture doc ---------------- Parity matrix updated to reflect what `src/simd_avx2.rs` actually ships. Previous matrix marked U8x64 / I8x64 / I16x32 / I32x16 / I64x8 / U16x32 / U32x16 / U64x8 as ❌ in the AVX2 column. On survey those types exist via the `avx2_int_type!` macro — full API-parity structs with `[$elem; $lanes]` scalar storage (align 64). New 🟠 marker introduced to distinguish "struct exists with API, storage is scalar" from "true two-half SIMD composite" (🟡). I8x32 / I16x16 also corrected: they share the AVX-512 `__m256i` definition (re-exported through `simd_avx2`'s `pub use crate::simd_avx512::{i16x16, i8x32, ...}`). The remaining AVX2 vectorization gap (filling 🟠 → 🟡 with real two-half SIMD ops) is tracked separately as TD-SIMD-3.
1 parent a18366a commit f857a81

2 files changed

Lines changed: 44 additions & 36 deletions

File tree

.claude/knowledge/simd-dispatch-architecture.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -128,26 +128,36 @@ chooses the source; the cargo config chooses how `simd.rs` chooses.
128128

129129
## 4. Parity matrix — typed lane primitives per backend
130130

131-
Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
132-
scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch.
131+
Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🟠
132+
scalar polyfill (struct exists with full API but storage is `[$elem;
133+
$lanes]` — no SIMD execution), 🔵 portable-SIMD polyfill via
134+
`core::simd`, ❌ missing, ⛔ N/A for this arch.
135+
136+
(Reality check 2026-05-20: many AVX2 int rows previously marked ❌ are
137+
actually 🟠 — `simd_avx2.rs` ships them via the `avx2_int_type!` macro
138+
as scalar-storage structs that match the AVX-512 API surface. The
139+
arithmetic is plain Rust under the hood; only the FLOAT wrappers in
140+
this column are true two-half SIMD composites. Filling in real AVX2
141+
vectorization for the int wrappers is its own piece of tech debt
142+
tracked as TD-SIMD-3.)
133143

134144
| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
135145
|---|---|---|---|---|---|
136146
| `F32x16` |`__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` |`[f32; 16]` |
137147
| `F32x8` |`__m256` ||| 🔵 ||
138148
| `F64x8` |`__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 ||
139149
| `F64x4` |`__m256d` ||| 🔵 ||
140-
| `U8x64` |`__m512i` | || 🔵 ||
150+
| `U8x64` |`__m512i` | 🟠 `[u8; 64]` polyfill || 🔵 ||
141151
| `U8x32` |`__m256i` |`__m256i` || 🔵 ||
142-
| `U16x32` |`__m512i` | || 🔵 ||
143-
| `U32x16` |`__m512i` | || 🔵 ||
144-
| `U64x8` |`__m512i` | || 🔵 ||
145-
| `I8x32` |`__m256i` | || 🔵 ||
146-
| `I8x64` |`__m512i` | || 🔵 ||
147-
| `I16x16` |`__m256i` | || 🔵 ||
148-
| `I16x32` |`__m512i` | || 🔵 ||
149-
| `I32x16` |`__m512i` | || 🔵 ||
150-
| `I64x8` |`__m512i` | || 🔵 ||
152+
| `U16x32` |`__m512i` | 🟠 `[u16; 32]` polyfill || 🔵 ||
153+
| `U32x16` |`__m512i` | 🟠 `[u32; 16]` polyfill || 🔵 ||
154+
| `U64x8` |`__m512i` | 🟠 `[u64; 8]` polyfill || 🔵 ||
155+
| `I8x32` |`__m256i` | `__m256i` (in `simd_avx512`) || 🔵 ||
156+
| `I8x64` |`__m512i` | 🟠 `[i8; 64]` polyfill || 🔵 ||
157+
| `I16x16` |`__m256i` | `__m256i` (in `simd_avx512`) || 🔵 ||
158+
| `I16x32` |`__m512i` | 🟠 `[i16; 32]` polyfill || 🔵 ||
159+
| `I32x16` |`__m512i` | 🟠 `[i32; 16]` polyfill || 🔵 ||
160+
| `I64x8` |`__m512i` | 🟠 `[i64; 8]` polyfill || 🔵 ||
151161
| `BF16x8` |`__m128bh` ||| 🔵 ||
152162
| `BF16x16` |`__m256bh` ||| 🔵 ||
153163
| `F16x16` || 🟡 `F16Scaler` (scalar) || 🔵 ||

src/simd.rs

Lines changed: 22 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -210,23 +210,21 @@ pub const PREFERRED_I16_LANES: usize = 16;
210210
// * aarch64 → simd_neon backend.
211211
// * everything else (wasm32, riscv, etc.) → scalar fallback.
212212

213-
// Note on the `nightly-simd` feature: it adds the `crate::simd_nightly`
214-
// module (a portable-simd backend wrapping `core::simd`) but does NOT
215-
// replace the intrinsics dispatch below. The polyfill ships full
216-
// type-parity with production (PR #146): 24 types covering F32x8/16,
217-
// F64x4/8, BF16x8/16, F16x16, I8x32/64, I16x16/32, I32x16, I64x8,
218-
// U8x32/64, U16x32, U32x8/16, U64x4/8, plus the F32/F64 mask types —
219-
// matches the 24 types defined in `simd_avx2.rs` + `simd_avx512.rs`.
220-
// Consumers who want miri-runnable SIMD code import from `simd_nightly`
221-
// explicitly today (e.g. `use ndarray::simd_nightly::F32x16`).
222-
//
223-
// The remaining work for Miri-clean coverage of `hpc::*` is wiring this
224-
// file's `pub use crate::simd_{avx512,avx2,neon}::*` re-exports to
225-
// route through `simd_nightly` under `cfg(miri)`. Once that lands,
226-
// every `use crate::simd::F32x16` call site becomes miri-checkable
227-
// without source changes. The polyfill itself is no longer the bottleneck.
213+
// Nightly-simd dispatch — when `feature = "nightly-simd"` is on, the
214+
// `crate::simd_nightly` portable backend (wrapping `core::simd::*`)
215+
// REPLACES the intrinsics arms below. This is a compile-time-dispatch
216+
// choice: opt in via `cargo +nightly --features nightly-simd ...` and
217+
// the same `use crate::simd::F32x16` call sites become miri-runnable.
218+
// No target_arch constraint — `core::simd` is portable, so this arm
219+
// is the one true backend on wasm32 / riscv / aarch64 / x86_64 alike
220+
// as soon as `nightly-simd` is on.
221+
#[cfg(feature = "nightly-simd")]
222+
pub use crate::simd_nightly::{
223+
BF16x16, BF16x8, F16x16, F32Mask16, F32Mask8, F32x16, F32x8, F64Mask4, F64Mask8, F64x4, F64x8, I16x16, I16x32,
224+
I32x16, I64x8, I8x32, I8x64, U16x32, U32x16, U32x8, U64x4, U64x8, U8x32, U8x64,
225+
};
228226

229-
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f"))]
227+
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
230228
pub use crate::simd_avx512::{
231229
f32x16,
232230
f32x8,
@@ -276,7 +274,7 @@ pub use crate::simd_avx512::{bf16_to_f32_batch, bf16_to_f32_scalar, f32_to_bf16_
276274
#[cfg(target_arch = "x86_64")]
277275
pub use crate::simd_avx512::{f32_to_bf16_batch_rne, f32_to_bf16_scalar_rne};
278276
// BF16 SIMD types only available when avx512bf16 is enabled at compile time
279-
#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16"))]
277+
#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16", not(feature = "nightly-simd")))]
280278
pub use crate::simd_avx512::{BF16x16, BF16x8};
281279

282280
// AVX2 baseline arm — selected by the `x86-64-v3` cargo default. The
@@ -290,10 +288,10 @@ pub use crate::simd_avx512::{BF16x16, BF16x8};
290288
// `RUSTFLAGS="-D warnings"` env, which overrides our v3 config.toml,
291289
// landing on x86-64 baseline → the previous tighter `avx2` predicate
292290
// left no matching arm).
293-
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f")))]
291+
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
294292
pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32};
295293

296-
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f")))]
294+
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
297295
pub use crate::simd_avx2::{
298296
f32x16, f64x8, i16x32, i32x16, i64x8, i8x64, u32x16, u64x8, u8x64, F32Mask16, F32x16, F64Mask8, F64x8, I16x32,
299297
I32x16, I64x8, I8x64, U16x32, U32x16, U64x8, U8x64,
@@ -304,14 +302,14 @@ pub use crate::simd_avx2::{
304302
// AVX2 ops, and on AVX-512 builds it's the half-register companion to
305303
// U8x64. Lives in simd_avx2.rs (single source of truth) and is re-exported
306304
// from both tier branches.
307-
#[cfg(target_arch = "x86_64")]
305+
#[cfg(all(target_arch = "x86_64", not(feature = "nightly-simd")))]
308306
pub use crate::simd_avx2::{u8x32, U8x32};
309307

310308
// ============================================================================
311309
// Non-x86: scalar fallback types with identical API
312310
// ============================================================================
313311

314-
#[cfg(not(target_arch = "x86_64"))]
312+
#[cfg(all(not(target_arch = "x86_64"), not(feature = "nightly-simd")))]
315313
pub(crate) mod scalar {
316314
use core::fmt;
317315
use core::ops::{
@@ -1587,15 +1585,15 @@ pub(crate) mod scalar {
15871585
// in simd_neon::aarch64_simd (verified 2026-04-30, agent A7 — burn parity item 9).
15881586
// Integer + 256-bit float types still come from the scalar fallback; they're
15891587
// not on the critical path for f32 BLAS-1 / VML kernels.
1590-
#[cfg(target_arch = "aarch64")]
1588+
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
15911589
pub use crate::simd_neon::aarch64_simd::{f32x16, f64x8, F32Mask16, F32x16, F64Mask8, F64x8};
1592-
#[cfg(target_arch = "aarch64")]
1590+
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
15931591
pub use scalar::{
15941592
f32x8, f64x4, i32x16, i64x8, u32x16, u64x8, u8x64, F32x8, F64x4, I32x16, I64x8, U16x32, U32x16, U64x8, U8x64,
15951593
};
15961594

15971595
// Other non-x86 targets (wasm, riscv, etc.): full scalar fallback.
1598-
#[cfg(all(not(target_arch = "x86_64"), not(target_arch = "aarch64")))]
1596+
#[cfg(all(not(target_arch = "x86_64"), not(target_arch = "aarch64"), not(feature = "nightly-simd")))]
15991597
pub use scalar::{
16001598
f32x16, f32x8, f64x4, f64x8, i16x16, i16x32, i32x16, i64x8, i8x32, i8x64, u32x16, u64x8, u8x64, F32Mask16, F32x16,
16011599
F32x8, F64Mask8, F64x4, F64x8, I16x16, I16x32, I32x16, I64x8, I8x32, I8x64, U16x32, U32x16, U64x8, U8x64,

0 commit comments

Comments
 (0)