|
| 1 | +# ndarray Docker CPU Detection & SIMD Dispatch |
| 2 | + |
| 3 | +## Three-Tier Build Strategy |
| 4 | + |
| 5 | +| Target | Dockerfile | RUSTFLAGS | CPU features | Use case | |
| 6 | +|---|---|---|---|---| |
| 7 | +| **Portable (AVX2)** | `Dockerfile` | `-C target-cpu=x86-64-v3` | SSE4.2, AVX, AVX2, FMA, BMI1/2 | GitHub CI, general servers, cloud VMs | |
| 8 | +| **AVX-512 pinned** | `Dockerfile.avx512` | `-C target-cpu=x86-64-v4` | + AVX-512F/BW/CD/DQ/VL | Skylake-X, Ice Lake, Sapphire Rapids, EPYC Genoa | |
| 9 | +| **Local dev** | `.cargo/config.toml` | (per-repo) | Whatever the developer's CPU supports | Developer machines | |
| 10 | + |
| 11 | +## How SIMD Dispatch Works |
| 12 | + |
| 13 | +ndarray uses a **two-layer dispatch** model: |
| 14 | + |
| 15 | +### Layer 1: Compile-time (`cfg(target_feature)`) |
| 16 | + |
| 17 | +When built with `target-cpu=x86-64-v4`, the compiler enables AVX-512 |
| 18 | +intrinsics at compile time. Types in `simd_avx512.rs` use native `__m512` |
| 19 | +registers — zero overhead, everything inlined. |
| 20 | + |
| 21 | +When built with `target-cpu=x86-64-v3`, AVX-512 intrinsics are NOT available |
| 22 | +at compile time. The polyfill in `simd_avx2.rs` provides the same API (`F32x16`, |
| 23 | +`U8x64`, etc.) using pairs of `__m256` operations or scalar loops. |
| 24 | + |
| 25 | +### Layer 2: Runtime detection (`LazyLock<Tier>`) |
| 26 | + |
| 27 | +Regardless of compile target, `src/simd.rs` detects the CPU at startup: |
| 28 | + |
| 29 | +```rust |
| 30 | +static TIER: LazyLock<Tier> = LazyLock::new(|| { |
| 31 | + if is_x86_feature_detected!("avx512f") { return Tier::Avx512; } |
| 32 | + if is_x86_feature_detected!("avx2") { return Tier::Avx2; } |
| 33 | + #[cfg(target_arch = "aarch64")] |
| 34 | + if is_aarch64_feature_detected!("dotprod") { return Tier::NeonDotProd; } |
| 35 | + Tier::Scalar |
| 36 | +}); |
| 37 | +``` |
| 38 | + |
| 39 | +Functions marked `#[target_feature(enable = "avx512f")]` are compiled into |
| 40 | +the binary even at `-C target-cpu=x86-64-v3` and dispatched at runtime via |
| 41 | +the tier detection. This means an AVX2-compiled binary **still uses AVX-512 |
| 42 | +kernels** when running on AVX-512 hardware — the difference is that the |
| 43 | +generic `F32x16` / `U8x64` types use the AVX2 fallback (pairs of 256-bit |
| 44 | +ops) rather than native 512-bit registers. |
| 45 | + |
| 46 | +### What this means in practice |
| 47 | + |
| 48 | +``` |
| 49 | +x86-64-v3 binary on AVX-512 hardware: |
| 50 | + F32x16::mul_add → AVX2 fallback (2× _mm256_fmadd_ps) |
| 51 | + hamming_distance_raw → AVX-512 VPOPCNTDQ (runtime-dispatched) |
| 52 | + bitwise::popcount → AVX-512 VPOPCNTDQ (runtime-dispatched) |
| 53 | + ┌───────────────────────────────────┐ |
| 54 | + │ Generic SIMD types: AVX2 path │ ← compile-time |
| 55 | + │ Per-function kernels: AVX-512 │ ← runtime-detected |
| 56 | + └───────────────────────────────────┘ |
| 57 | +
|
| 58 | +x86-64-v4 binary on AVX-512 hardware: |
| 59 | + F32x16::mul_add → native __m512 (_mm512_fmadd_ps) |
| 60 | + hamming_distance_raw → same AVX-512 VPOPCNTDQ |
| 61 | + ┌───────────────────────────────────┐ |
| 62 | + │ Everything: AVX-512 native │ ← compile-time + runtime |
| 63 | + └───────────────────────────────────┘ |
| 64 | + ~24% faster overall (no 256→512 splitting overhead) |
| 65 | +``` |
| 66 | + |
| 67 | +## AMX Detection (Intel Advanced Matrix Extensions) |
| 68 | + |
| 69 | +AMX is NOT part of any `target-cpu` level. It requires: |
| 70 | +1. CPUID check (AMX-TILE + AMX-INT8 + AMX-BF16 leaves) |
| 71 | +2. OS support via `_xgetbv(0)` bits 17/18 (XTILECFG + XTILEDATA) |
| 72 | +3. Linux: `prctl(ARCH_REQ_XCOMP_PERM)` to enable tile registers |
| 73 | + |
| 74 | +Detection lives in `ndarray::hpc::amx_matmul::amx_available()`. |
| 75 | +AMX kernels are always compiled in (they use inline assembly) and |
| 76 | +gated at runtime. They work with any `-C target-cpu` setting. |
| 77 | + |
| 78 | +## NEON (ARM / aarch64) |
| 79 | + |
| 80 | +NEON is mandatory on aarch64 — always available. The distinction is: |
| 81 | +- **NEON baseline** (ARMv8.0): `float32x4_t`, 4-wide f32 |
| 82 | +- **NEON dotprod** (ARMv8.2+, Pi 5 / A76+): `vdotq_s32`, 4× int8 throughput |
| 83 | + |
| 84 | +Detection: `is_aarch64_feature_detected!("dotprod")` in `simd.rs`. |
| 85 | + |
| 86 | +## Choosing the Right Dockerfile |
| 87 | + |
| 88 | +``` |
| 89 | +┌─────────────────────────────────────────────────┐ |
| 90 | +│ Do you know your deployment hardware? │ |
| 91 | +├───────────────┬─────────────────────────────────┤ |
| 92 | +│ No / Mixed │ Use Dockerfile (AVX2 default) │ |
| 93 | +│ AVX-512 only │ Use Dockerfile.avx512 (+24%) │ |
| 94 | +│ ARM / Pi │ Use Dockerfile (NEON auto) │ |
| 95 | +└───────────────┴─────────────────────────────────┘ |
| 96 | +``` |
| 97 | + |
| 98 | +## Environment Variables |
| 99 | + |
| 100 | +| Variable | Default | Description | |
| 101 | +|---|---|---| |
| 102 | +| `RUSTFLAGS` | (see Dockerfile) | Compiler flags including `-C target-cpu=...` | |
| 103 | +| `CARGO_BUILD_JOBS` | (all cores) | Parallel compilation — reduce if OOM | |
| 104 | + |
| 105 | +## Verifying CPU Features at Runtime |
| 106 | + |
| 107 | +```bash |
| 108 | +# Inside the container: |
| 109 | +cat /proc/cpuinfo | grep -oP 'avx512\w+' | sort -u |
| 110 | +# Or via Rust: |
| 111 | +cargo run --example simd_caps # prints detected SIMD tier |
| 112 | +``` |
| 113 | + |
| 114 | +## Build Examples |
| 115 | + |
| 116 | +```bash |
| 117 | +# Portable (AVX2) — safe for GitHub CI, most cloud VMs |
| 118 | +docker build -t ndarray-test . |
| 119 | + |
| 120 | +# AVX-512 pinned — Sapphire Rapids, Ice Lake, EPYC Genoa |
| 121 | +docker build -f Dockerfile.avx512 -t ndarray-avx512 . |
| 122 | + |
| 123 | +# Override CPU target at build time (e.g., baseline for maximum compat) |
| 124 | +docker build --build-arg RUSTFLAGS="-C target-cpu=x86-64" -t ndarray-compat . |
| 125 | +``` |
0 commit comments