Skip to content

Commit ccd58f9

Browse files
committed
docs: Dockerfile.md — CPU detection & SIMD dispatch documentation
Comprehensive doc covering the three-tier build strategy (AVX2 default / AVX-512 pinned / local dev), two-layer dispatch model (compile-time cfg(target_feature) + runtime LazyLock<Tier>), AMX detection, NEON/ARM, how an AVX2 binary still uses AVX-512 kernels via runtime detection, and the ~24% performance gap between v3 and v4 builds. Also: Dockerfile + Dockerfile.avx512 headers now reference Dockerfile.md. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
1 parent e84ce62 commit ccd58f9

3 files changed

Lines changed: 132 additions & 1 deletion

File tree

Dockerfile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
# ndarray — Railway compile-test image
1+
# ndarray — Railway compile-test image (AVX2 default)
22
# Verifies the HPC module builds cleanly (default + jit-native features)
33
# Requires Rust 1.94.0 (LazyLock, simd_caps, modern std APIs)
44
#
5+
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
6+
# AVX-512 pinned variant: see Dockerfile.avx512
7+
#
58
# Build: docker build -t ndarray-test .
69
# Run: docker run --rm ndarray-test
710

Dockerfile.avx512

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55
# ONLY deploy on AVX-512 hardware (Skylake-X, Ice Lake, Sapphire Rapids, EPYC Genoa).
66
# Will SIGILL on older CPUs.
77
#
8+
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
9+
# Portable (AVX2) variant: see Dockerfile
10+
#
811
# Build: docker build -f Dockerfile.avx512 -t ndarray-avx512 .
912
# Run: docker run --rm ndarray-avx512
1013

Dockerfile.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# ndarray Docker CPU Detection & SIMD Dispatch
2+
3+
## Three-Tier Build Strategy
4+
5+
| Target | Dockerfile | RUSTFLAGS | CPU features | Use case |
6+
|---|---|---|---|---|
7+
| **Portable (AVX2)** | `Dockerfile` | `-C target-cpu=x86-64-v3` | SSE4.2, AVX, AVX2, FMA, BMI1/2 | GitHub CI, general servers, cloud VMs |
8+
| **AVX-512 pinned** | `Dockerfile.avx512` | `-C target-cpu=x86-64-v4` | + AVX-512F/BW/CD/DQ/VL | Skylake-X, Ice Lake, Sapphire Rapids, EPYC Genoa |
9+
| **Local dev** | `.cargo/config.toml` | (per-repo) | Whatever the developer's CPU supports | Developer machines |
10+
11+
## How SIMD Dispatch Works
12+
13+
ndarray uses a **two-layer dispatch** model:
14+
15+
### Layer 1: Compile-time (`cfg(target_feature)`)
16+
17+
When built with `target-cpu=x86-64-v4`, the compiler enables AVX-512
18+
intrinsics at compile time. Types in `simd_avx512.rs` use native `__m512`
19+
registers — zero overhead, everything inlined.
20+
21+
When built with `target-cpu=x86-64-v3`, AVX-512 intrinsics are NOT available
22+
at compile time. The polyfill in `simd_avx2.rs` provides the same API (`F32x16`,
23+
`U8x64`, etc.) using pairs of `__m256` operations or scalar loops.
24+
25+
### Layer 2: Runtime detection (`LazyLock<Tier>`)
26+
27+
Regardless of compile target, `src/simd.rs` detects the CPU at startup:
28+
29+
```rust
30+
static TIER: LazyLock<Tier> = LazyLock::new(|| {
31+
if is_x86_feature_detected!("avx512f") { return Tier::Avx512; }
32+
if is_x86_feature_detected!("avx2") { return Tier::Avx2; }
33+
#[cfg(target_arch = "aarch64")]
34+
if is_aarch64_feature_detected!("dotprod") { return Tier::NeonDotProd; }
35+
Tier::Scalar
36+
});
37+
```
38+
39+
Functions marked `#[target_feature(enable = "avx512f")]` are compiled into
40+
the binary even at `-C target-cpu=x86-64-v3` and dispatched at runtime via
41+
the tier detection. This means an AVX2-compiled binary **still uses AVX-512
42+
kernels** when running on AVX-512 hardware — the difference is that the
43+
generic `F32x16` / `U8x64` types use the AVX2 fallback (pairs of 256-bit
44+
ops) rather than native 512-bit registers.
45+
46+
### What this means in practice
47+
48+
```
49+
x86-64-v3 binary on AVX-512 hardware:
50+
F32x16::mul_add → AVX2 fallback (2× _mm256_fmadd_ps)
51+
hamming_distance_raw → AVX-512 VPOPCNTDQ (runtime-dispatched)
52+
bitwise::popcount → AVX-512 VPOPCNTDQ (runtime-dispatched)
53+
┌───────────────────────────────────┐
54+
│ Generic SIMD types: AVX2 path │ ← compile-time
55+
│ Per-function kernels: AVX-512 │ ← runtime-detected
56+
└───────────────────────────────────┘
57+
58+
x86-64-v4 binary on AVX-512 hardware:
59+
F32x16::mul_add → native __m512 (_mm512_fmadd_ps)
60+
hamming_distance_raw → same AVX-512 VPOPCNTDQ
61+
┌───────────────────────────────────┐
62+
│ Everything: AVX-512 native │ ← compile-time + runtime
63+
└───────────────────────────────────┘
64+
~24% faster overall (no 256→512 splitting overhead)
65+
```
66+
67+
## AMX Detection (Intel Advanced Matrix Extensions)
68+
69+
AMX is NOT part of any `target-cpu` level. It requires:
70+
1. CPUID check (AMX-TILE + AMX-INT8 + AMX-BF16 leaves)
71+
2. OS support via `_xgetbv(0)` bits 17/18 (XTILECFG + XTILEDATA)
72+
3. Linux: `prctl(ARCH_REQ_XCOMP_PERM)` to enable tile registers
73+
74+
Detection lives in `ndarray::hpc::amx_matmul::amx_available()`.
75+
AMX kernels are always compiled in (they use inline assembly) and
76+
gated at runtime. They work with any `-C target-cpu` setting.
77+
78+
## NEON (ARM / aarch64)
79+
80+
NEON is mandatory on aarch64 — always available. The distinction is:
81+
- **NEON baseline** (ARMv8.0): `float32x4_t`, 4-wide f32
82+
- **NEON dotprod** (ARMv8.2+, Pi 5 / A76+): `vdotq_s32`, 4× int8 throughput
83+
84+
Detection: `is_aarch64_feature_detected!("dotprod")` in `simd.rs`.
85+
86+
## Choosing the Right Dockerfile
87+
88+
```
89+
┌─────────────────────────────────────────────────┐
90+
│ Do you know your deployment hardware? │
91+
├───────────────┬─────────────────────────────────┤
92+
│ No / Mixed │ Use Dockerfile (AVX2 default) │
93+
│ AVX-512 only │ Use Dockerfile.avx512 (+24%) │
94+
│ ARM / Pi │ Use Dockerfile (NEON auto) │
95+
└───────────────┴─────────────────────────────────┘
96+
```
97+
98+
## Environment Variables
99+
100+
| Variable | Default | Description |
101+
|---|---|---|
102+
| `RUSTFLAGS` | (see Dockerfile) | Compiler flags including `-C target-cpu=...` |
103+
| `CARGO_BUILD_JOBS` | (all cores) | Parallel compilation — reduce if OOM |
104+
105+
## Verifying CPU Features at Runtime
106+
107+
```bash
108+
# Inside the container:
109+
cat /proc/cpuinfo | grep -oP 'avx512\w+' | sort -u
110+
# Or via Rust:
111+
cargo run --example simd_caps # prints detected SIMD tier
112+
```
113+
114+
## Build Examples
115+
116+
```bash
117+
# Portable (AVX2) — safe for GitHub CI, most cloud VMs
118+
docker build -t ndarray-test .
119+
120+
# AVX-512 pinned — Sapphire Rapids, Ice Lake, EPYC Genoa
121+
docker build -f Dockerfile.avx512 -t ndarray-avx512 .
122+
123+
# Override CPU target at build time (e.g., baseline for maximum compat)
124+
docker build --build-arg RUSTFLAGS="-C target-cpu=x86-64" -t ndarray-compat .
125+
```

0 commit comments

Comments
 (0)