Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion). Companion to:
vertical-simd-consumer-contract.md(W1a consumer contract),databend-ndarray-simd-prompt.md,ndarray-simd-trojan-horse-prompt.md.
ndarray::simd::* is the single public surface every cognitive-shader,
splat, codec, BLAS, and FFI consumer reaches for. The current dispatch
in src/simd.rs is compile-time-only with arms keyed off
target_feature = "avx512f" / target_arch = "aarch64" / scalar
fallback. .cargo/config.toml pins target-cpu = x86-64-v4, baking
AVX-512 into every compiled artifact.
The consequence surfaced on PR #170 (tests/1.95.0 CI run
26151746204/76920666348):
38 tests in simd_avx2, simd_amx, simd_ops, simd_soa SIGILL on
a GitHub runner without AVX-512 silicon, all timing out uniformly
~19 s — the symptom of "binary cannot execute" rather than assertion
failure. The same configuration also leaves simd_nightly/* (the
portable-SIMD polyfill backend) unreachable because no dispatch arm in
simd.rs re-exports from it.
This document pins the target architecture, captures the parity gaps, ranks the technical debt, and sequences the integration.
Each build mode is a conscious cargo invocation via a distinct
.cargo/config*.toml. No silent fallbacks, no surprise hardware
mismatch. Whoever builds with v3 / v4 / native / nightly-simd
chose it deliberately.
| Config file | target-cpu |
Dispatch strategy | Default? | Use case |
|---|---|---|---|---|
.cargo/config.toml |
x86-64-v3 (AVX2) |
compile-time → simd_avx2 |
✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 |
.cargo/config-avx512.toml |
x86-64-v4 (AVX-512) |
compile-time → simd_avx512 |
explicit | benchmarking, AVX-512 deployment |
.cargo/config-native.toml |
native |
compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds |
.cargo/config-nightly.toml (+ --features nightly-simd) |
x86-64-v3 (or any) |
compile-time → simd_nightly (std::simd::* polyfill) |
explicit | miri / cargo-careful / portable-SIMD experiments |
The aarch64 path is automatic: any target_arch = "aarch64" build
selects simd_neon regardless of the config above.
Runtime LazyLock dispatch is a separate, fifth opt-in mode used
when shipping a single release binary that must adapt at process
start across heterogeneous deployment silicon (one binary running on
AVX-512 + AVX2-only machines from the same artifact). It compiles all
backends in and uses LazyLock<CpuCaps> trampolines. Reserved for the
release-binary distribution path; never the dev / CI default.
Compile-time arms read like a cascade, not like priority overrides
— each cargo config sets exactly one target_feature / feature such
that exactly one arm matches. The order below is the source-of-truth
ranking the compiler walks:
// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature)
#[cfg(all(feature = "nightly-simd", any(target_arch = "x86_64", target_arch = "aarch64")))]
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};
// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
pub use crate::simd_avx512::{...};
// 3. AVX2 baseline (the v3 / GitHub-CI default)
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
pub use crate::simd_avx2::{...};
// 4. NEON (aarch64)
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
pub use crate::simd_neon::aarch64_simd::{...};
// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without AVX2, etc.)
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", feature = "nightly-simd")))]
pub use scalar::{...};Runtime dispatch via LazyLock<CpuCaps> lives in a separate
simd_runtime module (TBD per § 7.1) reached by a --features runtime-dispatch
flag, mutually exclusive with the compile-time arms above.
crate::simd::* ← user-facing key registry (re-exports only)
│
├── simd.rs = dispatch arms; no implementation, only `pub use`
│
├── simd_ops.rs = slice-level ops over crate::simd::* primitives
│ (add_f32, scale_f64, array_chunks, …)
│
├── simd_avx512.rs = __m512* values, native 512-bit registers
├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half
│ wrappers (struct F32x16(pub f32x8, pub f32x8))
├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes
│ composed as [float32x4_t; 4] etc.
├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable
│ ├── f32_types.rs F32x16, F32x8
│ ├── f64_types.rs F64x8, F64x4
│ ├── u8_types.rs U8x64, U8x32
│ ├── u_word_types.rs U16x32, U32x16, U64x8
│ ├── i8_types.rs I8x64, I8x32
│ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
│ ├── bf16_types.rs BF16x16, BF16x8
│ ├── f16_types.rs F16x16
│ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8
│ └── ops.rs op impls
└── scalar (inline `mod scalar` in simd.rs)
= pure-Rust fallback for unknown arch
Every simd_<arch>.rs is just a SOURCE of typed primitives. simd.rs
chooses the source; the cargo config chooses how simd.rs chooses.
Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
scalar polyfill via core::simd, ❌ missing, ⛔ N/A for this arch.
| Lane type | simd_avx512 (v4) |
simd_avx2 (v3) |
simd_neon (aarch64) |
simd_nightly |
scalar |
|---|---|---|---|---|---|
F32x16 |
✅ __m512 |
🟡 (f32x8, f32x8) |
🟡 [float32x4_t; 4] |
🔵 core::simd::f32x16 |
✅ [f32; 16] |
F32x8 |
✅ __m256 |
❌ | ⛔ | 🔵 | ✅ |
F64x8 |
✅ __m512d |
🟡 (f64x4, f64x4) |
🟡 [float64x2_t; 4] |
🔵 | ✅ |
F64x4 |
✅ __m256d |
❌ | ⛔ | 🔵 | ✅ |
U8x64 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
U8x32 |
✅ __m256i |
✅ __m256i |
❌ | 🔵 | ✅ |
U16x32 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
U32x16 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
U64x8 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
I8x32 |
✅ __m256i |
❌ | ❌ | 🔵 | ✅ |
I8x64 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
I16x16 |
✅ __m256i |
❌ | ❌ | 🔵 | ✅ |
I16x32 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
I32x16 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
I64x8 |
✅ __m512i |
❌ | ❌ | 🔵 | ✅ |
BF16x8 |
✅ __m128bh |
❌ | ❌ | ❌ | ✅ |
BF16x16 |
✅ __m256bh |
❌ | ❌ | 🔵 | ✅ |
F16x16 |
❌ | 🟡 F16Scaler (scalar) |
❌ | 🔵 | ✅ |
F32Mask16 |
✅ __mmask16 |
✅ u16 bitmask |
✅ u16 bitmask |
🔵 | ✅ |
F64Mask8 |
✅ __mmask8 |
✅ u8 bitmask |
✅ u8 bitmask |
🔵 | ✅ |
Aarch64-native narrower types (only useful directly when the
consumer wants 128-bit shapes): I8x16, I16x8, U8x16, U16x8,
U32x4, U64x2, I32x4, I64x2. These are not in the cross-arch
parity surface — consumers requesting 256-bit / 512-bit shapes go
through the composed wrappers.
- F32x16 + F64x8 are universal — all four backends ship them. Hot paths can rely on these without branching.
simd_avx2is the bottleneck. It only exposesF32x16,F64x8,F32Mask16,F64Mask8,U8x32, and anF16Scaler. Every other cross-arch lane is missing — making the v3 default config crash any consumer that reaches forU64x8,I32x16,U16x32, etc.- NEON is even sparser at the 256/512-bit level.
simd_nightlyis the most complete but is unreachable today becausesimd.rshas no arm wiringfeature = "nightly-simd"to its re-exports.scalarhas comprehensive cover and is the safest fallback for any arch the others miss, but lives inline insimd.rsrather than in a dedicatedsimd_scalar.rs. Symmetry would help.
Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
| ID | Severity | Description | Detection | Fix scope |
|---|---|---|---|---|
| TD-SIMD-1 | P0 | .cargo/config.toml defaults to x86-64-v4 → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on tests/1.95.0. |
PR #170 CI run | Change default to x86-64-v3; add .cargo/config-avx512.toml for the opt-in AVX-512 path. ~5 LoC. |
| TD-SIMD-2 | P0 | simd_avx2.rs ships F32x16/F64x8/U8x32 only. Consumers requesting U64x8, I32x16, U16x32, BF16x16, etc. fail to compile on the v3 path. |
grep pub use crate::simd_avx2:: then cross-ref against the parity matrix |
Add the missing types as two-half wrappers (U64x8(pub u64x4, pub u64x4) etc.) over native __m256i halves. ~500 LoC. |
| TD-SIMD-3 | P1 | simd.rs has no dispatch arm for #[cfg(feature = "nightly-simd")] → the simd_nightly polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to std::simd::*. |
grep simd_nightly in simd.rs (returns 0 dispatch arms) |
Add the feature = "nightly-simd" arm at the top of the cascade per § 2. ~30 LoC. |
| TD-SIMD-4 | P1 | simd_neon.rs only ships F32x16 / F64x8 cross-arch shapes. Consumers reaching for U8x64, U64x8, I32x16, etc. on aarch64 have no path. |
grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (U8x64([uint8x16_t; 4]), U64x8([uint64x2_t; 4]), etc.). ~400 LoC. |
| TD-SIMD-5 | P1 | Scalar fallback inline in simd.rs (pub(crate) mod scalar) makes symmetry hard — every other backend is its own file. |
inspection | Promote to src/simd_scalar.rs; simd.rs becomes pure dispatch. ~mechanical refactor. |
| TD-SIMD-6 | P2 | No runtime-dispatch feature / simd_runtime module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. |
grep -r "LazyLock<CpuCaps>" only matches reporting code in simd.rs:52-55 |
New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
| TD-SIMD-7 | P2 | Compile-time arms in simd.rs:153-194 are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four #[cfg(...)] arms. |
inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
| TD-SIMD-8 | P2 | F16Scaler in simd_avx2.rs:2566 is a scalar implementation masquerading as a SIMD type. Consumers using F16x16 on v3 get scalar perf without warning. |
grep F16Scaler |
Either gate F16x16 behind target_feature = "f16c" or rename / document the scalar nature. ~20 LoC + docs. |
| TD-SIMD-9 | P3 | No CI matrix entry for the nightly-simd polyfill path. |
.github/workflows/ci.yaml |
Add a nightly-simd-polyfill job that builds with --features nightly-simd on nightly rustc. ~20 LoC YAML. |
| TD-SIMD-10 | P3 | No CI matrix entry for .cargo/config-avx512.toml. AVX-512 deployment path silently bit-rots between PRs. |
.github/workflows/ci.yaml |
Add an avx-512-explicit job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
Each phase is a single-PR worker (sized for one Sonnet impl-sprint per
the .claude/EN/agents/worker-template.md shape). Phases sequence so
each lands a green CI; the next phase depends only on shipped state.
Goal: GitHub tests/1.95.0 job green. The default .cargo/config.toml
build runs end-to-end on AVX2-only silicon.
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 1.1 | flip baseline | Change target-cpu from v4 → v3. Add .cargo/config-avx512.toml with the old v4 value. |
.cargo/config.toml, .cargo/config-avx512.toml |
cargo check clean on default; tests/1.95.0 no longer SIGILLs |
| 1.2 | AVX2 two-half wrappers — float | Add U8x64, U64x8, U32x16, U16x32, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8 as two-half wrappers over native AVX2 __m256i halves. |
src/simd_avx2.rs |
per-type parity test vs simd_avx512 on AVX-512 host; per-type unit test on AVX2-only |
| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | src/simd.rs |
cargo check --features approx,serde,rayon clean on default config; cargo check clean on --config .cargo/config-avx512.toml |
After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships green CI by default. AVX-512 testing becomes an explicit job.
Goal: cargo +nightly check --features nightly-simd reaches
simd_nightly/* via crate::simd::*. miri can execute the portable
path.
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 2.1 | nightly-simd dispatch arm | Add #[cfg(feature = "nightly-simd")] arms in simd.rs re-exporting every typed lane from crate::simd_nightly::*. |
src/simd.rs |
crate::simd::F32x16 resolves to core::simd::f32x16 under the feature |
| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | src/simd_nightly/tests.rs |
all simd_ops + simd_soa tests pass under --features nightly-simd |
| 2.3 | CI matrix | Add nightly-simd-polyfill job to .github/workflows/ci.yaml. |
.github/workflows/ci.yaml |
job green on nightly rustc with the feature |
Goal: aarch64 build reaches the same cross-arch lane set as the v3 config.
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 3.1 | NEON quartet wrappers | Compose U8x64, U64x8, U32x16, U16x32, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8 from native 128-bit NEON lanes. |
src/simd_neon.rs |
parity vs simd_avx2 two-half wrappers on a 16-pair fixture |
| 3.2 | simd.rs aarch64 arms | Extend aarch64 arms to re-export the new types. |
src/simd.rs |
cargo check --target aarch64-unknown-linux-gnu clean |
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 4.1 | scalar → file | Promote mod scalar to src/simd_scalar.rs. |
src/simd.rs, new src/simd_scalar.rs |
no behaviour change; cargo check clean on all configs |
| 4.2 | dispatch macro | Collapse the 4× duplicated #[cfg(...)] blocks into one macro. |
src/simd.rs |
adding a new lane type is one macro invocation |
| 4.3 | F16 honesty | Either rename F16Scaler or gate F16x16 behind f16c. |
src/simd_avx2.rs |
scalar perf no longer surprises hot-path consumers |
Goal: ship-once binaries that adapt across heterogeneous deployment silicon.
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 5.1 | simd_runtime module |
New module compiling all backends in and selecting per-op trampolines via LazyLock<CpuCaps>. |
src/simd_runtime.rs |
one binary runs on AVX-512 + AVX2-only hosts from the same artifact |
| 5.2 | feature flag | New runtime-dispatch cargo feature, mutually exclusive with nightly-simd. |
Cargo.toml, src/simd.rs |
cargo check --features runtime-dispatch clean on the v3 baseline |
| 5.3 | CI matrix | Add a runtime-dispatch-portable job. |
.github/workflows/ci.yaml |
job green |
| # | Worker | Scope | Files | Acceptance |
|---|---|---|---|---|
| 6.1 | AVX-512 explicit job | Add avx-512-explicit to .github/workflows/ci.yaml using --config .cargo/config-avx512.toml. Requires AVX-512-capable runner. |
.github/workflows/ci.yaml |
green on the AVX-512 runner |
- Runtime trampoline cost class. Phase 5's per-op indirection
adds one indirect call per
F32x16::add(...). Acceptable for the typical 100+ cycle SIMD-op cost, but consumer benchmarks should sanity-check before declaring the path production-ready. feature = "nightly-simd"precedence. § 2 puts it at the top of the cascade; alternative reading is "polyfill is for miri only, so put it BELOW the arch-specific arms and only fire on non-x86_64, non-aarch64 targets." The current proposal matches the user's "explicit opt-in wins" framing; revisit if there's a use case for--features nightly-simdon an AVX-512 host wanting the AVX-512 path.- AMX status.
simd_amx.rs(Sapphire Rapids+ tile ops) is x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface. Out of scope for this document; tracked under PR-X10 A6 (linalg::distance) follow-ups.
.claude/knowledge/vertical-simd-consumer-contract.md— W1a consumer contract every new SIMD primitive follows (struct methods on typed wrappers, three-backend parity test, saturating/overflow semantics documented)..claude/knowledge/databend-ndarray-simd-prompt.md— Databend integration consumer ofcrate::simd::*..claude/knowledge/ndarray-simd-trojan-horse-prompt.md— ClickHouse + Tantivy injection plan; depends on Phase 1 + 2 landing.src/simd.rslines 52-55 — existingis_x86_feature_detected!reporting (NOT dispatch) — repurpose for Phase 5 trampoline.src/simd_nightly/mod.rslines 37-44 — completepub useset ready to be wired intosimd.rsdispatch (Phase 2).
Default cargo config drops to x86-64-v3 (AVX2) → GitHub CI green by
default. .cargo/config-avx512.toml is the explicit AVX-512 path.
simd_avx2.rs needs ~10 missing two-half wrappers (P0, Phase 1).
simd.rs needs a nightly-simd dispatch arm so simd_nightly/*
becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase
3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are
P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing
in order keeps every step green.