Skip to content

Commit 2c1942b

Browse files
committed
docs(simd): dispatch architecture + parity matrix + tech debt + integration plan
New design doc at .claude/knowledge/simd-dispatch-architecture.md covering four artifacts in one place: 1. Dispatch architecture — three explicit cargo configs (v3 default, v4 explicit, native explicit) + optional runtime LazyLock path. Each is a conscious cargo invocation; no silent fallback. 2. Parity matrix — typed lane primitives × backend (avx512 / avx2 / neon / nightly / scalar). Surfaces the AVX2 gap: only F32x16, F64x8, U8x32, F16Scaler exist; the other 14 cross-arch lanes are missing → CI SIGILL on the v3 baseline. 3. Technical debt matrix — 10 issues (TD-SIMD-1..10) ranked P0..P3. P0: default config bakes AVX-512 → CI SIGILL on AVX2-only runners. P0: AVX2 missing 10 two-half wrappers (U64x8, I32x16, …). P1: simd.rs has no `feature = "nightly-simd"` arm → simd_nightly/* unreachable despite being the most-complete backend. P1: NEON parity gap symmetric to AVX2. …through P3 CI-matrix entries. 4. Integration plan — six sequenced phases, each a single-PR worker: Phase 1 unblock CI (config flip + AVX2 wrappers + dispatch arms) Phase 2 unblock nightly-simd polyfill Phase 3 NEON parity Phase 4 scalar→file + macro + F16 honesty Phase 5 runtime dispatch (opt-in) Phase 6 AVX-512 explicit CI job Pinned by the discussion thread on PR #170 (CI run 26151746204): the 38 simd_avx2/simd_amx/simd_ops/simd_soa test failures at uniform 19 s timeouts are the v3-runner / v4-baked-binary SIGILL pattern. This doc captures the architecture target, the gaps, and the path.
1 parent 207fc20 commit 2c1942b

1 file changed

Lines changed: 294 additions & 0 deletions

File tree

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# SIMD Dispatch Architecture — design, parity, tech debt, integration plan
2+
3+
> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion).
4+
> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract),
5+
> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`.
6+
7+
## 1. Why this exists
8+
9+
`ndarray::simd::*` is the single public surface every cognitive-shader,
10+
splat, codec, BLAS, and FFI consumer reaches for. The current dispatch
11+
in `src/simd.rs` is **compile-time-only** with arms keyed off
12+
`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar
13+
fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking
14+
AVX-512 into every compiled artifact.
15+
16+
The consequence surfaced on PR #170 (`tests/1.95.0` CI run
17+
[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)):
18+
**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on
19+
a GitHub runner without AVX-512 silicon, all timing out uniformly
20+
~19 s — the symptom of "binary cannot execute" rather than assertion
21+
failure. The same configuration also leaves `simd_nightly/*` (the
22+
portable-SIMD polyfill backend) unreachable because no dispatch arm in
23+
`simd.rs` re-exports from it.
24+
25+
This document pins the target architecture, captures the parity gaps,
26+
ranks the technical debt, and sequences the integration.
27+
28+
## 2. Dispatch model — three build configs, one runtime mode
29+
30+
Each build mode is a **conscious cargo invocation** via a distinct
31+
`.cargo/config*.toml`. No silent fallbacks, no surprise hardware
32+
mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd`
33+
chose it deliberately.
34+
35+
| Config file | `target-cpu` | Dispatch strategy | Default? | Use case |
36+
|---|---|---|---|---|
37+
| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 |
38+
| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment |
39+
| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds |
40+
| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments |
41+
42+
The aarch64 path is automatic: any `target_arch = "aarch64"` build
43+
selects `simd_neon` regardless of the config above.
44+
45+
**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used
46+
when shipping a single release binary that must adapt at process
47+
start across heterogeneous deployment silicon (one binary running on
48+
AVX-512 + AVX2-only machines from the same artifact). It compiles all
49+
backends in and uses `LazyLock<CpuCaps>` trampolines. Reserved for the
50+
release-binary distribution path; never the dev / CI default.
51+
52+
### Dispatch precedence in `simd.rs`
53+
54+
Compile-time arms read like a cascade, **not** like priority overrides
55+
— each cargo config sets exactly one `target_feature` / `feature` such
56+
that exactly one arm matches. The order below is the source-of-truth
57+
ranking the compiler walks:
58+
59+
```rust
60+
// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature)
61+
#[cfg(all(feature = "nightly-simd", any(target_arch = "x86_64", target_arch = "aarch64")))]
62+
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};
63+
64+
// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
65+
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
66+
pub use crate::simd_avx512::{...};
67+
68+
// 3. AVX2 baseline (the v3 / GitHub-CI default)
69+
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
70+
pub use crate::simd_avx2::{...};
71+
72+
// 4. NEON (aarch64)
73+
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
74+
pub use crate::simd_neon::aarch64_simd::{...};
75+
76+
// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without AVX2, etc.)
77+
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", feature = "nightly-simd")))]
78+
pub use scalar::{...};
79+
```
80+
81+
Runtime dispatch via `LazyLock<CpuCaps>` lives in a separate
82+
`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch`
83+
flag, mutually exclusive with the compile-time arms above.
84+
85+
## 3. Module roles
86+
87+
```
88+
crate::simd::* ← user-facing key registry (re-exports only)
89+
90+
├── simd.rs = dispatch arms; no implementation, only `pub use`
91+
92+
├── simd_ops.rs = slice-level ops over crate::simd::* primitives
93+
│ (add_f32, scale_f64, array_chunks, …)
94+
95+
├── simd_avx512.rs = __m512* values, native 512-bit registers
96+
├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half
97+
│ wrappers (struct F32x16(pub f32x8, pub f32x8))
98+
├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes
99+
│ composed as [float32x4_t; 4] etc.
100+
├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable
101+
│ ├── f32_types.rs F32x16, F32x8
102+
│ ├── f64_types.rs F64x8, F64x4
103+
│ ├── u8_types.rs U8x64, U8x32
104+
│ ├── u_word_types.rs U16x32, U32x16, U64x8
105+
│ ├── i8_types.rs I8x64, I8x32
106+
│ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
107+
│ ├── bf16_types.rs BF16x16, BF16x8
108+
│ ├── f16_types.rs F16x16
109+
│ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8
110+
│ └── ops.rs op impls
111+
└── scalar (inline `mod scalar` in simd.rs)
112+
= pure-Rust fallback for unknown arch
113+
```
114+
115+
Every `simd_<arch>.rs` is just a SOURCE of typed primitives. `simd.rs`
116+
chooses the source; the cargo config chooses how `simd.rs` chooses.
117+
118+
## 4. Parity matrix — typed lane primitives per backend
119+
120+
Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
121+
scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch.
122+
123+
| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
124+
|---|---|---|---|---|---|
125+
| `F32x16` |`__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` |`[f32; 16]` |
126+
| `F32x8` |`__m256` ||| 🔵 ||
127+
| `F64x8` |`__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 ||
128+
| `F64x4` |`__m256d` ||| 🔵 ||
129+
| `U8x64` |`__m512i` ||| 🔵 ||
130+
| `U8x32` |`__m256i` |`__m256i` || 🔵 ||
131+
| `U16x32` |`__m512i` ||| 🔵 ||
132+
| `U32x16` |`__m512i` ||| 🔵 ||
133+
| `U64x8` |`__m512i` ||| 🔵 ||
134+
| `I8x32` |`__m256i` ||| 🔵 ||
135+
| `I8x64` |`__m512i` ||| 🔵 ||
136+
| `I16x16` |`__m256i` ||| 🔵 ||
137+
| `I16x32` |`__m512i` ||| 🔵 ||
138+
| `I32x16` |`__m512i` ||| 🔵 ||
139+
| `I64x8` |`__m512i` ||| 🔵 ||
140+
| `BF16x8` |`__m128bh` |||||
141+
| `BF16x16` |`__m256bh` ||| 🔵 ||
142+
| `F16x16` || 🟡 `F16Scaler` (scalar) || 🔵 ||
143+
| `F32Mask16` |`__mmask16` |`u16` bitmask |`u16` bitmask | 🔵 ||
144+
| `F64Mask8` |`__mmask8` |`u8` bitmask |`u8` bitmask | 🔵 ||
145+
146+
**Aarch64-native narrower types** (only useful directly when the
147+
consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
148+
`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
149+
parity surface — consumers requesting 256-bit / 512-bit shapes go
150+
through the composed wrappers.
151+
152+
### Read of the matrix
153+
154+
- **F32x16 + F64x8 are universal** — all four backends ship them. Hot
155+
paths can rely on these without branching.
156+
- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`,
157+
`F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other
158+
cross-arch lane is missing — making the v3 default config crash any
159+
consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc.
160+
- **NEON is even sparser** at the 256/512-bit level.
161+
- **`simd_nightly` is the most complete** but is unreachable today
162+
because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its
163+
re-exports.
164+
- **`scalar`** has comprehensive cover and is the safest fallback for
165+
any arch the others miss, but lives inline in `simd.rs` rather than
166+
in a dedicated `simd_scalar.rs`. Symmetry would help.
167+
168+
## 5. Technical debt matrix
169+
170+
Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
171+
172+
| ID | Severity | Description | Detection | Fix scope |
173+
|---|---|---|---|---|
174+
| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. |
175+
| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. |
176+
| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. |
177+
| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. |
178+
| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
179+
| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
180+
| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
181+
| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
182+
| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
183+
| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
184+
185+
## 6. Integration plan — sequenced sprints
186+
187+
Each phase is a single-PR worker (sized for one Sonnet impl-sprint per
188+
the `.claude/EN/agents/worker-template.md` shape). Phases sequence so
189+
each lands a green CI; the next phase depends only on shipped state.
190+
191+
### Phase 1 — Unblock CI (P0 fixes)
192+
193+
**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml`
194+
build runs end-to-end on AVX2-only silicon.
195+
196+
| # | Worker | Scope | Files | Acceptance |
197+
|---|---|---|---|---|
198+
| 1.1 | flip baseline | Change `target-cpu` from `v4``v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs |
199+
| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only |
200+
| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` |
201+
202+
After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships
203+
green CI by default. AVX-512 testing becomes an explicit job.
204+
205+
### Phase 2 — Unblock the polyfill (P1: `nightly-simd`)
206+
207+
**Goal:** `cargo +nightly check --features nightly-simd` reaches
208+
`simd_nightly/*` via `crate::simd::*`. miri can execute the portable
209+
path.
210+
211+
| # | Worker | Scope | Files | Acceptance |
212+
|---|---|---|---|---|
213+
| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature |
214+
| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` |
215+
| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature |
216+
217+
### Phase 3 — NEON parity (P1)
218+
219+
**Goal:** aarch64 build reaches the same cross-arch lane set as the v3
220+
config.
221+
222+
| # | Worker | Scope | Files | Acceptance |
223+
|---|---|---|---|---|
224+
| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture |
225+
| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean |
226+
227+
### Phase 4 — Symmetry + ergonomics (P1/P2)
228+
229+
| # | Worker | Scope | Files | Acceptance |
230+
|---|---|---|---|---|
231+
| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs |
232+
| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation |
233+
| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers |
234+
235+
### Phase 5 — Runtime dispatch (P2, opt-in)
236+
237+
**Goal:** ship-once binaries that adapt across heterogeneous deployment
238+
silicon.
239+
240+
| # | Worker | Scope | Files | Acceptance |
241+
|---|---|---|---|---|
242+
| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact |
243+
| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline |
244+
| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green |
245+
246+
### Phase 6 — CI matrix for explicit AVX-512 (P3)
247+
248+
| # | Worker | Scope | Files | Acceptance |
249+
|---|---|---|---|---|
250+
| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner |
251+
252+
## 7. Open questions
253+
254+
1. **Runtime trampoline cost class.** Phase 5's per-op indirection
255+
adds one indirect call per `F32x16::add(...)`. Acceptable for the
256+
typical 100+ cycle SIMD-op cost, but consumer benchmarks should
257+
sanity-check before declaring the path production-ready.
258+
2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top
259+
of the cascade; alternative reading is "polyfill is for miri only,
260+
so put it BELOW the arch-specific arms and only fire on non-x86_64,
261+
non-aarch64 targets." The current proposal matches the user's
262+
"explicit opt-in wins" framing; revisit if there's a use case for
263+
`--features nightly-simd` on an AVX-512 host wanting the AVX-512
264+
path.
265+
3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is
266+
x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface.
267+
Out of scope for this document; tracked under PR-X10 A6
268+
(`linalg::distance`) follow-ups.
269+
270+
## 8. Cross-references
271+
272+
- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer
273+
contract every new SIMD primitive follows (struct methods on typed
274+
wrappers, three-backend parity test, saturating/overflow semantics
275+
documented).
276+
- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend
277+
integration consumer of `crate::simd::*`.
278+
- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse +
279+
Tantivy injection plan; depends on Phase 1 + 2 landing.
280+
- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!`
281+
reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
282+
- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set
283+
ready to be wired into `simd.rs` dispatch (Phase 2).
284+
285+
## 9. TL;DR
286+
287+
Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by
288+
default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path.
289+
`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1).
290+
`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*`
291+
becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase
292+
3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are
293+
P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing
294+
in order keeps every step green.

0 commit comments

Comments
 (0)