A complete high-performance numerical computing stack built on top of rust-ndarray/ndarray. 55 HPC modules, 880 tests, BLAS L1-L3, LAPACK, FFT, quantized inference, SIMD kernels from Intel AMX to Raspberry Pi NEON — stable Rust 1.94, zero nightly.
Deutsche Version | Full Feature Comparison (146 modules)
| System | Method | Throughput | Latency | Hardware | Watt |
|---|---|---|---|---|---|
| This fork — Sapphire Rapids | Palette u8 + AMX prefetch | ~3,200M/s | ~0.3 ns | Xeon w9-3595X | 350W |
| This fork — i7/i5 11th gen | Palette u8 (AVX-512) | 2,400M/s | 0.4 ns | i7-11700K | 65W |
| This fork — Raspberry Pi 4 | Palette u8 (NEON) | ~400M/s | ~2.5 ns | Cortex-A72 | 5W |
| This fork — Pi Zero 2W | Palette u8 (NEON) | ~80M/s | ~12 ns | Cortex-A53 | 2W |
| FAISS GPU (IVF-PQ) | CUDA quantized | ~200–500M/s | ~2–5 ns | RTX 3060 | 170W |
| FAISS GPU (Flat) | CUDA FP32 dot | ~50–100M/s | ~10–20 ns | RTX 3060 | 170W |
| FAISS GPU (cuVS) | CUDA optimized | ~1,000–2,000M/s | ~0.5–1 ns | H100 80GB | 700W |
| FAISS CPU (Flat) | AVX2 FP32 dot | ~50M/s | ~20 ns | i7 | 65W |
| FAISS CPU (IVF-PQ) | AVX2 quantized | ~100–200M/s | ~5–10 ns | i7 | 65W |
A $35 Raspberry Pi 4 at 5 watts matches or beats a $350 RTX 3060 at 170 watts. A Sapphire Rapids server outperforms an H100 at half the power. A $15 Pi Zero 2W at 2 watts still beats FAISS CPU Flat by 60%.
The trick: GPU must FP32-multiply, FP32-divide, and transfer over PCIe. We read one u8 from a 64KB table that lives in L1 cache. No transfer, no kernel launch, no floating point.
Five layers on top of upstream ndarray's array primitives:
SIMD Polyfill (simd.rs, simd_avx512.rs, simd_avx2.rs, simd_neon.rs) — std::simd-compatible types (F32x16, F64x8, U8x64, I32x16) on stable Rust via core::arch. Detection once via LazyLock<SimdCaps>, dispatch via frozen function pointer table (0.3ns per call).
Backend (backend/) — Pluggable BLAS: pure-Rust Goto-GEMM (default), Intel MKL (feature-gated), OpenBLAS (feature-gated). Native backend: 6×16 f32 + 6×8 f64 microkernels, cache-blocked L1/L2/L3, 16-thread split-borrow parallelism.
HPC Library (hpc/, 146 files) — BLAS L1-L3, LAPACK, FFT, VML, statistics, activations, quantized ops. Every module SIMD-accelerated through the frozen dispatch table.
Codec (fingerprint.rs, bgz17_bridge.rs, cam_pq.rs, palette_distance.rs) — Encoding stack for compressed inference: Fingerprint<256>, Base17, CAM-PQ, palette semiring. O(1) per token — table lookups replace matrix multiplication.
Burn Integration (crates/burn/) — SIMD-augmented burn-ndarray backend wiring F32x16 into tensor ops and activations.
| ISA | Upstream ndarray | This Fork | Speedup |
|---|---|---|---|
| AVX-512 (16×f32) | Scalar fallback | Native __m512 types |
~8× |
| AVX-512 VNNI (int8) | Scalar fallback | 64 MACs/instr + dispatch | ~32× |
| AVX-512 BF16 | Not available | Hardware + RNE emulation | new |
| AVX-512 VPOPCNTDQ | Scalar fallback | Native 512-bit popcount | ~16× |
| AMX (256 MACs) | Not available | Inline asm, stable Rust | ~128× |
| AVX2 + FMA (8×f32) | Via matrixmultiply | Goto-GEMM + dispatch | ~4× |
| AVX2 F16C | Not available | IEEE 754 f16 + precision toolkit | new |
| NEON (4×f32) | Scalar fallback | 3-tier: A53/A72/A76 | ~4× |
| NEON dotprod | Not available | vdotq_s32 (Pi 5) |
~16× |
| NEON fp16 | Not available | FCVTL/FCVTN via asm |
new |
Upstream on x86_64: → matrixmultiply crate (AVX2 if available, no AVX-512)
Upstream on aarch64: → Scalar (no NEON, no intrinsics)
Upstream on wasm: → Scalar
Fork on x86_64: → AVX-512 / AVX2 / SSE2 / Scalar (tiered, auto-detected)
Fork on aarch64: → NEON A76+dotprod / A72 2×pipe / A53 / Scalar (tiered)
Fork on wasm: → WASM SIMD128 (prepared) / Scalar
| Matrix Size | Upstream | This Fork | NumPy | PyTorch CPU | GPU (RTX 3060) |
|---|---|---|---|---|---|
| 512×512 | ~20 GFLOPS | 47 GFLOPS | ~45 | ~40 | ~1,200 |
| 1024×1024 | ~13 GFLOPS | 139 GFLOPS | ~120 | ~100 | ~3,500 |
| 2048×2048 | ~13 GFLOPS | ~150 GFLOPS | ~140 | ~130 | ~5,000 |
10.5× over upstream at 1024×1024 — matches NumPy OpenBLAS.
| Hardware | ISA | tok/s | 50-tok Latency | Power |
|---|---|---|---|---|
| Sapphire Rapids | AMX | 380,000 | 0.13 ms | 250W |
| Xeon | AVX-512 VNNI | 10K–50K | 1–5 ms | 150W |
| Pi 5 | NEON+dotprod | 2K–5K | 10–25 ms | 5W |
| Pi 4 | NEON dual | 500–2K | 25–100 ms | 5W |
| Format | Size | Max Error | Speed |
|---|---|---|---|
| f32 | 60 MB | — | — |
| f16 | 30 MB | 7.3e-6 | 94M/s |
| Scaled-f16 | 30 MB | 4.9e-6 | 91M/s |
| Double-f16 | 60 MB | 5.7e-8 | 42M/s |
- SIMD Polyfill on Stable —
F32x16/F64x8/U8x64viacore::arch, not nightlystd::simd - f16 Without Nightly —
u16carrier + F16C hardware / ARMFCVTLviaasm!() - AMX on Stable —
asm!(".byte ...")encoding, 256 MACs/instruction - Tiered ARM NEON — A53/A72/A76 with pipeline + big.LITTLE awareness
- 0.3ns Dispatch — LazyLock frozen fn-pointer table, no per-call branching
- BF16 RNE Bit-Exact — Pure AVX-512-F emulates
VCVTNEPS2BF16bit-for-bit - Cognitive Codec Stack — Fingerprint → Base17 → CAM-PQ → Palette → bgz7 (201GB → 685MB, O(1) inference)
use ndarray::Array2;
use ndarray::hpc::simd_caps::simd_caps;
let a = Array2::<f32>::ones((1024, 1024));
let c = a.dot(&a); // AVX-512 / AVX2 / NEON — auto
let caps = simd_caps();
if caps.neon { println!("{}", caps.arm_profile().name()); }cargo build --release # auto-detect
cargo build --release --target aarch64-unknown-linux-gnu # Pi 4
RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release # AVX-512
cargo test # 880 tests| Repo | Role |
|---|---|
| lance-graph | Graph query + codec spine |
| home-automation-rs | Smart home + voice AI |
MIT OR Apache-2.0