Skip to content

Latest commit

 

History

History
294 lines (239 loc) · 18.7 KB

File metadata and controls

294 lines (239 loc) · 18.7 KB

SIMD Dispatch Architecture — design, parity, tech debt, integration plan

Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion). Companion to: vertical-simd-consumer-contract.md (W1a consumer contract), databend-ndarray-simd-prompt.md, ndarray-simd-trojan-horse-prompt.md.

1. Why this exists

ndarray::simd::* is the single public surface every cognitive-shader, splat, codec, BLAS, and FFI consumer reaches for. The current dispatch in src/simd.rs is compile-time-only with arms keyed off target_feature = "avx512f" / target_arch = "aarch64" / scalar fallback. .cargo/config.toml pins target-cpu = x86-64-v4, baking AVX-512 into every compiled artifact.

The consequence surfaced on PR #170 (tests/1.95.0 CI run 26151746204/76920666348): 38 tests in simd_avx2, simd_amx, simd_ops, simd_soa SIGILL on a GitHub runner without AVX-512 silicon, all timing out uniformly ~19 s — the symptom of "binary cannot execute" rather than assertion failure. The same configuration also leaves simd_nightly/* (the portable-SIMD polyfill backend) unreachable because no dispatch arm in simd.rs re-exports from it.

This document pins the target architecture, captures the parity gaps, ranks the technical debt, and sequences the integration.

2. Dispatch model — three build configs, one runtime mode

Each build mode is a conscious cargo invocation via a distinct .cargo/config*.toml. No silent fallbacks, no surprise hardware mismatch. Whoever builds with v3 / v4 / native / nightly-simd chose it deliberately.

Config file target-cpu Dispatch strategy Default? Use case
.cargo/config.toml x86-64-v3 (AVX2) compile-time → simd_avx2 ✅ default, GitHub CI portable artifact across all x86_64 silicon ≥ 2013
.cargo/config-avx512.toml x86-64-v4 (AVX-512) compile-time → simd_avx512 explicit benchmarking, AVX-512 deployment
.cargo/config-native.toml native compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host explicit developer machine builds
.cargo/config-nightly.toml (+ --features nightly-simd) x86-64-v3 (or any) compile-time → simd_nightly (std::simd::* polyfill) explicit miri / cargo-careful / portable-SIMD experiments

The aarch64 path is automatic: any target_arch = "aarch64" build selects simd_neon regardless of the config above.

Runtime LazyLock dispatch is a separate, fifth opt-in mode used when shipping a single release binary that must adapt at process start across heterogeneous deployment silicon (one binary running on AVX-512 + AVX2-only machines from the same artifact). It compiles all backends in and uses LazyLock<CpuCaps> trampolines. Reserved for the release-binary distribution path; never the dev / CI default.

Dispatch precedence in simd.rs

Compile-time arms read like a cascade, not like priority overrides — each cargo config sets exactly one target_feature / feature such that exactly one arm matches. The order below is the source-of-truth ranking the compiler walks:

// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature)
#[cfg(all(feature = "nightly-simd", any(target_arch = "x86_64", target_arch = "aarch64")))]
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};

// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
pub use crate::simd_avx512::{...};

// 3. AVX2 baseline (the v3 / GitHub-CI default)
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
pub use crate::simd_avx2::{...};

// 4. NEON (aarch64)
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
pub use crate::simd_neon::aarch64_simd::{...};

// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without AVX2, etc.)
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", feature = "nightly-simd")))]
pub use scalar::{...};

Runtime dispatch via LazyLock<CpuCaps> lives in a separate simd_runtime module (TBD per § 7.1) reached by a --features runtime-dispatch flag, mutually exclusive with the compile-time arms above.

3. Module roles

crate::simd::*                ← user-facing key registry (re-exports only)
        │
        ├── simd.rs           = dispatch arms; no implementation, only `pub use`
        │
        ├── simd_ops.rs       = slice-level ops over crate::simd::* primitives
        │                       (add_f32, scale_f64, array_chunks, …)
        │
        ├── simd_avx512.rs    = __m512* values, native 512-bit registers
        ├── simd_avx2.rs      = __m256* values + (F32x16, F64x8) as two-half
        │                       wrappers (struct F32x16(pub f32x8, pub f32x8))
        ├── simd_neon.rs      = float32x4_t / uint64x2_t natives + larger shapes
        │                       composed as [float32x4_t; 4] etc.
        ├── simd_nightly/     = std::simd::* polyfill — portable, miri-executable
        │   ├── f32_types.rs    F32x16, F32x8
        │   ├── f64_types.rs    F64x8, F64x4
        │   ├── u8_types.rs     U8x64, U8x32
        │   ├── u_word_types.rs U16x32, U32x16, U64x8
        │   ├── i8_types.rs     I8x64, I8x32
        │   ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
        │   ├── bf16_types.rs   BF16x16, BF16x8
        │   ├── f16_types.rs    F16x16
        │   ├── masks.rs        F32Mask16, F32Mask8, F64Mask4, F64Mask8
        │   └── ops.rs          op impls
        └── scalar (inline `mod scalar` in simd.rs)
                              = pure-Rust fallback for unknown arch

Every simd_<arch>.rs is just a SOURCE of typed primitives. simd.rs chooses the source; the cargo config chooses how simd.rs chooses.

4. Parity matrix — typed lane primitives per backend

Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵 scalar polyfill via core::simd, ❌ missing, ⛔ N/A for this arch.

Lane type simd_avx512 (v4) simd_avx2 (v3) simd_neon (aarch64) simd_nightly scalar
F32x16 __m512 🟡 (f32x8, f32x8) 🟡 [float32x4_t; 4] 🔵 core::simd::f32x16 [f32; 16]
F32x8 __m256 🔵
F64x8 __m512d 🟡 (f64x4, f64x4) 🟡 [float64x2_t; 4] 🔵
F64x4 __m256d 🔵
U8x64 __m512i 🔵
U8x32 __m256i __m256i 🔵
U16x32 __m512i 🔵
U32x16 __m512i 🔵
U64x8 __m512i 🔵
I8x32 __m256i 🔵
I8x64 __m512i 🔵
I16x16 __m256i 🔵
I16x32 __m512i 🔵
I32x16 __m512i 🔵
I64x8 __m512i 🔵
BF16x8 __m128bh
BF16x16 __m256bh 🔵
F16x16 🟡 F16Scaler (scalar) 🔵
F32Mask16 __mmask16 u16 bitmask u16 bitmask 🔵
F64Mask8 __mmask8 u8 bitmask u8 bitmask 🔵

Aarch64-native narrower types (only useful directly when the consumer wants 128-bit shapes): I8x16, I16x8, U8x16, U16x8, U32x4, U64x2, I32x4, I64x2. These are not in the cross-arch parity surface — consumers requesting 256-bit / 512-bit shapes go through the composed wrappers.

Read of the matrix

  • F32x16 + F64x8 are universal — all four backends ship them. Hot paths can rely on these without branching.
  • simd_avx2 is the bottleneck. It only exposes F32x16, F64x8, F32Mask16, F64Mask8, U8x32, and an F16Scaler. Every other cross-arch lane is missing — making the v3 default config crash any consumer that reaches for U64x8, I32x16, U16x32, etc.
  • NEON is even sparser at the 256/512-bit level.
  • simd_nightly is the most complete but is unreachable today because simd.rs has no arm wiring feature = "nightly-simd" to its re-exports.
  • scalar has comprehensive cover and is the safest fallback for any arch the others miss, but lives inline in simd.rs rather than in a dedicated simd_scalar.rs. Symmetry would help.

5. Technical debt matrix

Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).

ID Severity Description Detection Fix scope
TD-SIMD-1 P0 .cargo/config.toml defaults to x86-64-v4 → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on tests/1.95.0. PR #170 CI run Change default to x86-64-v3; add .cargo/config-avx512.toml for the opt-in AVX-512 path. ~5 LoC.
TD-SIMD-2 P0 simd_avx2.rs ships F32x16/F64x8/U8x32 only. Consumers requesting U64x8, I32x16, U16x32, BF16x16, etc. fail to compile on the v3 path. grep pub use crate::simd_avx2:: then cross-ref against the parity matrix Add the missing types as two-half wrappers (U64x8(pub u64x4, pub u64x4) etc.) over native __m256i halves. ~500 LoC.
TD-SIMD-3 P1 simd.rs has no dispatch arm for #[cfg(feature = "nightly-simd")] → the simd_nightly polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to std::simd::*. grep simd_nightly in simd.rs (returns 0 dispatch arms) Add the feature = "nightly-simd" arm at the top of the cascade per § 2. ~30 LoC.
TD-SIMD-4 P1 simd_neon.rs only ships F32x16 / F64x8 cross-arch shapes. Consumers reaching for U8x64, U64x8, I32x16, etc. on aarch64 have no path. grep + parity matrix Compose larger shapes from native NEON 128-bit lanes (U8x64([uint8x16_t; 4]), U64x8([uint64x2_t; 4]), etc.). ~400 LoC.
TD-SIMD-5 P1 Scalar fallback inline in simd.rs (pub(crate) mod scalar) makes symmetry hard — every other backend is its own file. inspection Promote to src/simd_scalar.rs; simd.rs becomes pure dispatch. ~mechanical refactor.
TD-SIMD-6 P2 No runtime-dispatch feature / simd_runtime module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. grep -r "LazyLock<CpuCaps>" only matches reporting code in simd.rs:52-55 New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature.
TD-SIMD-7 P2 Compile-time arms in simd.rs:153-194 are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four #[cfg(...)] arms. inspection Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC.
TD-SIMD-8 P2 F16Scaler in simd_avx2.rs:2566 is a scalar implementation masquerading as a SIMD type. Consumers using F16x16 on v3 get scalar perf without warning. grep F16Scaler Either gate F16x16 behind target_feature = "f16c" or rename / document the scalar nature. ~20 LoC + docs.
TD-SIMD-9 P3 No CI matrix entry for the nightly-simd polyfill path. .github/workflows/ci.yaml Add a nightly-simd-polyfill job that builds with --features nightly-simd on nightly rustc. ~20 LoC YAML.
TD-SIMD-10 P3 No CI matrix entry for .cargo/config-avx512.toml. AVX-512 deployment path silently bit-rots between PRs. .github/workflows/ci.yaml Add an avx-512-explicit job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD.

6. Integration plan — sequenced sprints

Each phase is a single-PR worker (sized for one Sonnet impl-sprint per the .claude/EN/agents/worker-template.md shape). Phases sequence so each lands a green CI; the next phase depends only on shipped state.

Phase 1 — Unblock CI (P0 fixes)

Goal: GitHub tests/1.95.0 job green. The default .cargo/config.toml build runs end-to-end on AVX2-only silicon.

# Worker Scope Files Acceptance
1.1 flip baseline Change target-cpu from v4v3. Add .cargo/config-avx512.toml with the old v4 value. .cargo/config.toml, .cargo/config-avx512.toml cargo check clean on default; tests/1.95.0 no longer SIGILLs
1.2 AVX2 two-half wrappers — float Add U8x64, U64x8, U32x16, U16x32, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8 as two-half wrappers over native AVX2 __m256i halves. src/simd_avx2.rs per-type parity test vs simd_avx512 on AVX-512 host; per-type unit test on AVX2-only
1.3 simd.rs dispatch refresh Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). src/simd.rs cargo check --features approx,serde,rayon clean on default config; cargo check clean on --config .cargo/config-avx512.toml

After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships green CI by default. AVX-512 testing becomes an explicit job.

Phase 2 — Unblock the polyfill (P1: nightly-simd)

Goal: cargo +nightly check --features nightly-simd reaches simd_nightly/* via crate::simd::*. miri can execute the portable path.

# Worker Scope Files Acceptance
2.1 nightly-simd dispatch arm Add #[cfg(feature = "nightly-simd")] arms in simd.rs re-exporting every typed lane from crate::simd_nightly::*. src/simd.rs crate::simd::F32x16 resolves to core::simd::f32x16 under the feature
2.2 nightly-simd parity tests Run the existing simd_ops / simd_soa test suite against the polyfill backend. src/simd_nightly/tests.rs all simd_ops + simd_soa tests pass under --features nightly-simd
2.3 CI matrix Add nightly-simd-polyfill job to .github/workflows/ci.yaml. .github/workflows/ci.yaml job green on nightly rustc with the feature

Phase 3 — NEON parity (P1)

Goal: aarch64 build reaches the same cross-arch lane set as the v3 config.

# Worker Scope Files Acceptance
3.1 NEON quartet wrappers Compose U8x64, U64x8, U32x16, U16x32, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8 from native 128-bit NEON lanes. src/simd_neon.rs parity vs simd_avx2 two-half wrappers on a 16-pair fixture
3.2 simd.rs aarch64 arms Extend aarch64 arms to re-export the new types. src/simd.rs cargo check --target aarch64-unknown-linux-gnu clean

Phase 4 — Symmetry + ergonomics (P1/P2)

# Worker Scope Files Acceptance
4.1 scalar → file Promote mod scalar to src/simd_scalar.rs. src/simd.rs, new src/simd_scalar.rs no behaviour change; cargo check clean on all configs
4.2 dispatch macro Collapse the 4× duplicated #[cfg(...)] blocks into one macro. src/simd.rs adding a new lane type is one macro invocation
4.3 F16 honesty Either rename F16Scaler or gate F16x16 behind f16c. src/simd_avx2.rs scalar perf no longer surprises hot-path consumers

Phase 5 — Runtime dispatch (P2, opt-in)

Goal: ship-once binaries that adapt across heterogeneous deployment silicon.

# Worker Scope Files Acceptance
5.1 simd_runtime module New module compiling all backends in and selecting per-op trampolines via LazyLock<CpuCaps>. src/simd_runtime.rs one binary runs on AVX-512 + AVX2-only hosts from the same artifact
5.2 feature flag New runtime-dispatch cargo feature, mutually exclusive with nightly-simd. Cargo.toml, src/simd.rs cargo check --features runtime-dispatch clean on the v3 baseline
5.3 CI matrix Add a runtime-dispatch-portable job. .github/workflows/ci.yaml job green

Phase 6 — CI matrix for explicit AVX-512 (P3)

# Worker Scope Files Acceptance
6.1 AVX-512 explicit job Add avx-512-explicit to .github/workflows/ci.yaml using --config .cargo/config-avx512.toml. Requires AVX-512-capable runner. .github/workflows/ci.yaml green on the AVX-512 runner

7. Open questions

  1. Runtime trampoline cost class. Phase 5's per-op indirection adds one indirect call per F32x16::add(...). Acceptable for the typical 100+ cycle SIMD-op cost, but consumer benchmarks should sanity-check before declaring the path production-ready.
  2. feature = "nightly-simd" precedence. § 2 puts it at the top of the cascade; alternative reading is "polyfill is for miri only, so put it BELOW the arch-specific arms and only fire on non-x86_64, non-aarch64 targets." The current proposal matches the user's "explicit opt-in wins" framing; revisit if there's a use case for --features nightly-simd on an AVX-512 host wanting the AVX-512 path.
  3. AMX status. simd_amx.rs (Sapphire Rapids+ tile ops) is x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface. Out of scope for this document; tracked under PR-X10 A6 (linalg::distance) follow-ups.

8. Cross-references

  • .claude/knowledge/vertical-simd-consumer-contract.md — W1a consumer contract every new SIMD primitive follows (struct methods on typed wrappers, three-backend parity test, saturating/overflow semantics documented).
  • .claude/knowledge/databend-ndarray-simd-prompt.md — Databend integration consumer of crate::simd::*.
  • .claude/knowledge/ndarray-simd-trojan-horse-prompt.md — ClickHouse + Tantivy injection plan; depends on Phase 1 + 2 landing.
  • src/simd.rs lines 52-55 — existing is_x86_feature_detected! reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
  • src/simd_nightly/mod.rs lines 37-44 — complete pub use set ready to be wired into simd.rs dispatch (Phase 2).

9. TL;DR

Default cargo config drops to x86-64-v3 (AVX2) → GitHub CI green by default. .cargo/config-avx512.toml is the explicit AVX-512 path. simd_avx2.rs needs ~10 missing two-half wrappers (P0, Phase 1). simd.rs needs a nightly-simd dispatch arm so simd_nightly/* becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase 3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing in order keeps every step green.