SIMD Dispatch Architecture — design, parity, tech debt, integration plan

Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion). Companion to: vertical-simd-consumer-contract.md (W1a consumer contract), databend-ndarray-simd-prompt.md, ndarray-simd-trojan-horse-prompt.md.

1. Why this exists

ndarray::simd::* is the single public surface every cognitive-shader, splat, codec, BLAS, and FFI consumer reaches for. The current dispatch in src/simd.rs is compile-time-only with arms keyed off target_feature = "avx512f" / target_arch = "aarch64" / scalar fallback. .cargo/config.toml pins target-cpu = x86-64-v4, baking AVX-512 into every compiled artifact.

The consequence surfaced on PR #170 (tests/1.95.0 CI run 26151746204/76920666348): 38 tests in simd_avx2, simd_amx, simd_ops, simd_soa SIGILL on a GitHub runner without AVX-512 silicon, all timing out uniformly ~19 s — the symptom of "binary cannot execute" rather than assertion failure. The same configuration also leaves simd_nightly/* (the portable-SIMD polyfill backend) unreachable because no dispatch arm in simd.rs re-exports from it.

This document pins the target architecture, captures the parity gaps, ranks the technical debt, and sequences the integration.

2. Dispatch model — three build configs, one runtime mode

Each build mode is a conscious cargo invocation via a distinct .cargo/config*.toml. No silent fallbacks, no surprise hardware mismatch. Whoever builds with v3 / v4 / native / nightly-simd chose it deliberately.

Config file	`target-cpu`	Dispatch strategy	Default?	Use case
`.cargo/config.toml`	`x86-64-v3` (AVX2)	compile-time → `simd_avx2`	✅ default, GitHub CI	portable artifact across all x86_64 silicon ≥ 2013
`.cargo/config-avx512.toml`	`x86-64-v4` (AVX-512)	compile-time → `simd_avx512`	explicit	benchmarking, AVX-512 deployment
`.cargo/config-native.toml`	`native`	compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host	explicit	developer machine builds
`.cargo/config-nightly.toml` (+ `--features nightly-simd`)	`x86-64-v3` (or any)	compile-time → `simd_nightly` (`std::simd::*` polyfill)	explicit	miri / cargo-careful / portable-SIMD experiments

The aarch64 path is automatic: any target_arch = "aarch64" build selects simd_neon regardless of the config above.

Runtime LazyLock dispatch is a separate, fifth opt-in mode used when shipping a single release binary that must adapt at process start across heterogeneous deployment silicon (one binary running on AVX-512 + AVX2-only machines from the same artifact). It compiles all backends in and uses LazyLock<CpuCaps> trampolines. Reserved for the release-binary distribution path; never the dev / CI default.

Dispatch precedence in `simd.rs`

Compile-time arms read like a cascade, not like priority overrides — each cargo config sets exactly one target_feature / feature such that exactly one arm matches. The order below is the source-of-truth ranking the compiler walks:

// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature)
#[cfg(all(feature = "nightly-simd", any(target_arch = "x86_64", target_arch = "aarch64")))]
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};

// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
pub use crate::simd_avx512::{...};

// 3. AVX2 baseline (the v3 / GitHub-CI default)
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
pub use crate::simd_avx2::{...};

// 4. NEON (aarch64)
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
pub use crate::simd_neon::aarch64_simd::{...};

// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without AVX2, etc.)
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", feature = "nightly-simd")))]
pub use scalar::{...};

Runtime dispatch via LazyLock<CpuCaps> lives in a separate simd_runtime module (TBD per § 7.1) reached by a --features runtime-dispatch flag, mutually exclusive with the compile-time arms above.

3. Module roles

crate::simd::*                ← user-facing key registry (re-exports only)
        │
        ├── simd.rs           = dispatch arms; no implementation, only `pub use`
        │
        ├── simd_ops.rs       = slice-level ops over crate::simd::* primitives
        │                       (add_f32, scale_f64, array_chunks, …)
        │
        ├── simd_avx512.rs    = __m512* values, native 512-bit registers
        ├── simd_avx2.rs      = __m256* values + (F32x16, F64x8) as two-half
        │                       wrappers (struct F32x16(pub f32x8, pub f32x8))
        ├── simd_neon.rs      = float32x4_t / uint64x2_t natives + larger shapes
        │                       composed as [float32x4_t; 4] etc.
        ├── simd_nightly/     = std::simd::* polyfill — portable, miri-executable
        │   ├── f32_types.rs    F32x16, F32x8
        │   ├── f64_types.rs    F64x8, F64x4
        │   ├── u8_types.rs     U8x64, U8x32
        │   ├── u_word_types.rs U16x32, U32x16, U64x8
        │   ├── i8_types.rs     I8x64, I8x32
        │   ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
        │   ├── bf16_types.rs   BF16x16, BF16x8
        │   ├── f16_types.rs    F16x16
        │   ├── masks.rs        F32Mask16, F32Mask8, F64Mask4, F64Mask8
        │   └── ops.rs          op impls
        └── scalar (inline `mod scalar` in simd.rs)
                              = pure-Rust fallback for unknown arch

Every simd_<arch>.rs is just a SOURCE of typed primitives. simd.rs chooses the source; the cargo config chooses how simd.rs chooses.

4. Parity matrix — typed lane primitives per backend

Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵 scalar polyfill via core::simd, ❌ missing, ⛔ N/A for this arch.

Lane type	`simd_avx512` (v4)	`simd_avx2` (v3)	`simd_neon` (aarch64)	`simd_nightly`	`scalar`
`F32x16`	✅ `__m512`	🟡 `(f32x8, f32x8)`	🟡 `[float32x4_t; 4]`	🔵 `core::simd::f32x16`	✅ `[f32; 16]`
`F32x8`	✅ `__m256`	❌	⛔	🔵	✅
`F64x8`	✅ `__m512d`	🟡 `(f64x4, f64x4)`	🟡 `[float64x2_t; 4]`	🔵	✅
`F64x4`	✅ `__m256d`	❌	⛔	🔵	✅
`U8x64`	✅ `__m512i`	❌	❌	🔵	✅
`U8x32`	✅ `__m256i`	✅ `__m256i`	❌	🔵	✅
`U16x32`	✅ `__m512i`	❌	❌	🔵	✅
`U32x16`	✅ `__m512i`	❌	❌	🔵	✅
`U64x8`	✅ `__m512i`	❌	❌	🔵	✅
`I8x32`	✅ `__m256i`	❌	❌	🔵	✅
`I8x64`	✅ `__m512i`	❌	❌	🔵	✅
`I16x16`	✅ `__m256i`	❌	❌	🔵	✅
`I16x32`	✅ `__m512i`	❌	❌	🔵	✅
`I32x16`	✅ `__m512i`	❌	❌	🔵	✅
`I64x8`	✅ `__m512i`	❌	❌	🔵	✅
`BF16x8`	✅ `__m128bh`	❌	❌	❌	✅
`BF16x16`	✅ `__m256bh`	❌	❌	🔵	✅
`F16x16`	❌	🟡 `F16Scaler` (scalar)	❌	🔵	✅
`F32Mask16`	✅ `__mmask16`	✅ `u16` bitmask	✅ `u16` bitmask	🔵	✅
`F64Mask8`	✅ `__mmask8`	✅ `u8` bitmask	✅ `u8` bitmask	🔵	✅

Aarch64-native narrower types (only useful directly when the consumer wants 128-bit shapes): I8x16, I16x8, U8x16, U16x8, U32x4, U64x2, I32x4, I64x2. These are not in the cross-arch parity surface — consumers requesting 256-bit / 512-bit shapes go through the composed wrappers.

Read of the matrix

F32x16 + F64x8 are universal — all four backends ship them. Hot paths can rely on these without branching.
simd_avx2 is the bottleneck. It only exposes F32x16, F64x8, F32Mask16, F64Mask8, U8x32, and an F16Scaler. Every other cross-arch lane is missing — making the v3 default config crash any consumer that reaches for U64x8, I32x16, U16x32, etc.
NEON is even sparser at the 256/512-bit level.
simd_nightly is the most complete but is unreachable today because simd.rs has no arm wiring feature = "nightly-simd" to its re-exports.
scalar has comprehensive cover and is the safest fallback for any arch the others miss, but lives inline in simd.rs rather than in a dedicated simd_scalar.rs. Symmetry would help.

5. Technical debt matrix

Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).

ID	Severity	Description	Detection	Fix scope
TD-SIMD-1	P0	`.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`.	PR #170 CI run	Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC.
TD-SIMD-2	P0	`simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path.	grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix	Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC.
TD-SIMD-3	P1	`simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`.	grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms)	Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC.
TD-SIMD-4	P1	`simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path.	grep + parity matrix	Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC.
TD-SIMD-5	P1	Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file.	inspection	Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor.
TD-SIMD-6	P2	No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today.	`grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55`	New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature.
TD-SIMD-7	P2	Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms.	inspection	Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC.
TD-SIMD-8	P2	`F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning.	grep `F16Scaler`	Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs.
TD-SIMD-9	P3	No CI matrix entry for the `nightly-simd` polyfill path.	`.github/workflows/ci.yaml`	Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML.
TD-SIMD-10	P3	No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs.	`.github/workflows/ci.yaml`	Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD.

6. Integration plan — sequenced sprints

Each phase is a single-PR worker (sized for one Sonnet impl-sprint per the .claude/EN/agents/worker-template.md shape). Phases sequence so each lands a green CI; the next phase depends only on shipped state.

Phase 1 — Unblock CI (P0 fixes)

Goal: GitHub tests/1.95.0 job green. The default .cargo/config.toml build runs end-to-end on AVX2-only silicon.

#	Worker	Scope	Files	Acceptance
1.1	flip baseline	Change `target-cpu` from `v4` → `v3`. Add `.cargo/config-avx512.toml` with the old `v4` value.	`.cargo/config.toml`, `.cargo/config-avx512.toml`	`cargo check` clean on default; `tests/1.95.0` no longer SIGILLs
1.2	AVX2 two-half wrappers — float	Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves.	`src/simd_avx2.rs`	per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only
1.3	simd.rs dispatch refresh	Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2).	`src/simd.rs`	`cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml`

After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships green CI by default. AVX-512 testing becomes an explicit job.

Phase 2 — Unblock the polyfill (P1: `nightly-simd`)

Goal: cargo +nightly check --features nightly-simd reaches simd_nightly/* via crate::simd::*. miri can execute the portable path.

#	Worker	Scope	Files	Acceptance
2.1	nightly-simd dispatch arm	Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`.	`src/simd.rs`	`crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature
2.2	nightly-simd parity tests	Run the existing simd_ops / simd_soa test suite against the polyfill backend.	`src/simd_nightly/tests.rs`	all simd_ops + simd_soa tests pass under `--features nightly-simd`
2.3	CI matrix	Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`.	`.github/workflows/ci.yaml`	job green on nightly rustc with the feature

Phase 3 — NEON parity (P1)

Goal: aarch64 build reaches the same cross-arch lane set as the v3 config.

#	Worker	Scope	Files	Acceptance
3.1	NEON quartet wrappers	Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes.	`src/simd_neon.rs`	parity vs `simd_avx2` two-half wrappers on a 16-pair fixture
3.2	simd.rs aarch64 arms	Extend `aarch64` arms to re-export the new types.	`src/simd.rs`	`cargo check --target aarch64-unknown-linux-gnu` clean

Phase 4 — Symmetry + ergonomics (P1/P2)

#	Worker	Scope	Files	Acceptance
4.1	scalar → file	Promote `mod scalar` to `src/simd_scalar.rs`.	`src/simd.rs`, new `src/simd_scalar.rs`	no behaviour change; `cargo check` clean on all configs
4.2	dispatch macro	Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro.	`src/simd.rs`	adding a new lane type is one macro invocation
4.3	F16 honesty	Either rename `F16Scaler` or gate `F16x16` behind `f16c`.	`src/simd_avx2.rs`	scalar perf no longer surprises hot-path consumers

Phase 5 — Runtime dispatch (P2, opt-in)

Goal: ship-once binaries that adapt across heterogeneous deployment silicon.

#	Worker	Scope	Files	Acceptance
5.1	`simd_runtime` module	New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`.	`src/simd_runtime.rs`	one binary runs on AVX-512 + AVX2-only hosts from the same artifact
5.2	feature flag	New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`.	`Cargo.toml`, `src/simd.rs`	`cargo check --features runtime-dispatch` clean on the v3 baseline
5.3	CI matrix	Add a `runtime-dispatch-portable` job.	`.github/workflows/ci.yaml`	job green

Phase 6 — CI matrix for explicit AVX-512 (P3)

#	Worker	Scope	Files	Acceptance
6.1	AVX-512 explicit job	Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner.	`.github/workflows/ci.yaml`	green on the AVX-512 runner

7. Open questions

Runtime trampoline cost class. Phase 5's per-op indirection adds one indirect call per F32x16::add(...). Acceptable for the typical 100+ cycle SIMD-op cost, but consumer benchmarks should sanity-check before declaring the path production-ready.
feature = "nightly-simd" precedence. § 2 puts it at the top of the cascade; alternative reading is "polyfill is for miri only, so put it BELOW the arch-specific arms and only fire on non-x86_64, non-aarch64 targets." The current proposal matches the user's "explicit opt-in wins" framing; revisit if there's a use case for --features nightly-simd on an AVX-512 host wanting the AVX-512 path.
AMX status. simd_amx.rs (Sapphire Rapids+ tile ops) is x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface. Out of scope for this document; tracked under PR-X10 A6 (linalg::distance) follow-ups.

8. Cross-references

.claude/knowledge/vertical-simd-consumer-contract.md — W1a consumer contract every new SIMD primitive follows (struct methods on typed wrappers, three-backend parity test, saturating/overflow semantics documented).
.claude/knowledge/databend-ndarray-simd-prompt.md — Databend integration consumer of crate::simd::*.
.claude/knowledge/ndarray-simd-trojan-horse-prompt.md — ClickHouse + Tantivy injection plan; depends on Phase 1 + 2 landing.
src/simd.rs lines 52-55 — existing is_x86_feature_detected! reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
src/simd_nightly/mod.rs lines 37-44 — complete pub use set ready to be wired into simd.rs dispatch (Phase 2).

9. TL;DR

Default cargo config drops to x86-64-v3 (AVX2) → GitHub CI green by default. .cargo/config-avx512.toml is the explicit AVX-512 path. simd_avx2.rs needs ~10 missing two-half wrappers (P0, Phase 1). simd.rs needs a nightly-simd dispatch arm so simd_nightly/* becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase 3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing in order keeps every step green.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD Dispatch Architecture — design, parity, tech debt, integration plan

1. Why this exists

2. Dispatch model — three build configs, one runtime mode

Dispatch precedence in `simd.rs`

3. Module roles

4. Parity matrix — typed lane primitives per backend

Read of the matrix

5. Technical debt matrix

6. Integration plan — sequenced sprints

Phase 1 — Unblock CI (P0 fixes)

Phase 2 — Unblock the polyfill (P1: `nightly-simd`)

Phase 3 — NEON parity (P1)

Phase 4 — Symmetry + ergonomics (P1/P2)

Phase 5 — Runtime dispatch (P2, opt-in)

Phase 6 — CI matrix for explicit AVX-512 (P3)

7. Open questions

8. Cross-references

9. TL;DR

FilesExpand file tree

simd-dispatch-architecture.md

Latest commit

History

simd-dispatch-architecture.md

File metadata and controls

SIMD Dispatch Architecture — design, parity, tech debt, integration plan

1. Why this exists

2. Dispatch model — three build configs, one runtime mode

Dispatch precedence in simd.rs

3. Module roles

4. Parity matrix — typed lane primitives per backend

Read of the matrix

5. Technical debt matrix

6. Integration plan — sequenced sprints

Phase 1 — Unblock CI (P0 fixes)

Phase 2 — Unblock the polyfill (P1: nightly-simd)

Phase 3 — NEON parity (P1)

Phase 4 — Symmetry + ergonomics (P1/P2)

Phase 5 — Runtime dispatch (P2, opt-in)

Phase 6 — CI matrix for explicit AVX-512 (P3)

7. Open questions

8. Cross-references

9. TL;DR

Dispatch precedence in `simd.rs`

Phase 2 — Unblock the polyfill (P1: `nightly-simd`)