ndarray — AdaWorldAPI HPC Expansion

A complete high-performance numerical computing stack built on top of rust-ndarray/ndarray. 55 HPC modules, 880 tests, BLAS L1-L3, LAPACK, FFT, quantized inference, SIMD kernels from Intel AMX to Raspberry Pi NEON — stable Rust 1.94, zero nightly.

Deutsche Version | Full Feature Comparison (146 modules)

Cosine Similarity: Us vs. GPU vs. Everyone

System	Method	Throughput	Latency	Hardware	Watt
This fork — Sapphire Rapids	Palette u8 + AMX prefetch	~3,200M/s	~0.3 ns	Xeon w9-3595X	350W
This fork — i7/i5 11th gen	Palette u8 (AVX-512)	2,400M/s	0.4 ns	i7-11700K	65W
This fork — Raspberry Pi 4	Palette u8 (NEON)	~400M/s	~2.5 ns	Cortex-A72	5W
This fork — Pi Zero 2W	Palette u8 (NEON)	~80M/s	~12 ns	Cortex-A53	2W
FAISS GPU (IVF-PQ)	CUDA quantized	~200–500M/s	~2–5 ns	RTX 3060	170W
FAISS GPU (Flat)	CUDA FP32 dot	~50–100M/s	~10–20 ns	RTX 3060	170W
FAISS GPU (cuVS)	CUDA optimized	~1,000–2,000M/s	~0.5–1 ns	H100 80GB	700W
FAISS CPU (Flat)	AVX2 FP32 dot	~50M/s	~20 ns	i7	65W
FAISS CPU (IVF-PQ)	AVX2 quantized	~100–200M/s	~5–10 ns	i7	65W

A $35 Raspberry Pi 4 at 5 watts matches or beats a $350 RTX 3060 at 170 watts. A Sapphire Rapids server outperforms an H100 at half the power. A $15 Pi Zero 2W at 2 watts still beats FAISS CPU Flat by 60%.

The trick: GPU must FP32-multiply, FP32-divide, and transfer over PCIe. We read one u8 from a 64KB table that lives in L1 cache. No transfer, no kernel launch, no floating point.

Core Architecture

Five layers on top of upstream ndarray's array primitives:

SIMD Polyfill (simd.rs, simd_avx512.rs, simd_avx2.rs, simd_neon.rs) — std::simd-compatible types (F32x16, F64x8, U8x64, I32x16) on stable Rust via core::arch. Detection once via LazyLock<SimdCaps>, dispatch via frozen function pointer table (0.3ns per call).

Backend (backend/) — Pluggable BLAS: pure-Rust Goto-GEMM (default), Intel MKL (feature-gated), OpenBLAS (feature-gated). Native backend: 6×16 f32 + 6×8 f64 microkernels, cache-blocked L1/L2/L3, 16-thread split-borrow parallelism.

HPC Library (hpc/, 146 files) — BLAS L1-L3, LAPACK, FFT, VML, statistics, activations, quantized ops. Every module SIMD-accelerated through the frozen dispatch table.

Codec (fingerprint.rs, bgz17_bridge.rs, cam_pq.rs, palette_distance.rs) — Encoding stack for compressed inference: Fingerprint<256>, Base17, CAM-PQ, palette semiring. O(1) per token — table lookups replace matrix multiplication.

Burn Integration (crates/burn/) — SIMD-augmented burn-ndarray backend wiring F32x16 into tensor ops and activations.

Upstream vs. Fork

ISA Coverage

ISA	Upstream ndarray	This Fork	Speedup
AVX-512 (16×f32)	Scalar fallback	Native `__m512` types	~8×
AVX-512 VNNI (int8)	Scalar fallback	64 MACs/instr + dispatch	~32×
AVX-512 BF16	Not available	Hardware + RNE emulation	new
AVX-512 VPOPCNTDQ	Scalar fallback	Native 512-bit popcount	~16×
AMX (256 MACs)	Not available	Inline asm, stable Rust	~128×
AVX2 + FMA (8×f32)	Via matrixmultiply	Goto-GEMM + dispatch	~4×
AVX2 F16C	Not available	IEEE 754 f16 + precision toolkit	new
NEON (4×f32)	Scalar fallback	3-tier: A53/A72/A76	~4×
NEON dotprod	Not available	`vdotq_s32` (Pi 5)	~16×
NEON fp16	Not available	`FCVTL`/`FCVTN` via asm	new

What Upstream Does on Each Target

Upstream on x86_64:  → matrixmultiply crate (AVX2 if available, no AVX-512)
Upstream on aarch64: → Scalar (no NEON, no intrinsics)
Upstream on wasm:    → Scalar

Fork on x86_64:      → AVX-512 / AVX2 / SSE2 / Scalar (tiered, auto-detected)
Fork on aarch64:     → NEON A76+dotprod / A72 2×pipe / A53 / Scalar (tiered)
Fork on wasm:        → WASM SIMD128 (prepared) / Scalar

Performance

GEMM

Matrix Size	Upstream	This Fork	NumPy	PyTorch CPU	GPU (RTX 3060)
512×512	~20 GFLOPS	47 GFLOPS	~45	~40	~1,200
1024×1024	~13 GFLOPS	139 GFLOPS	~120	~100	~3,500
2048×2048	~13 GFLOPS	~150 GFLOPS	~140	~130	~5,000

10.5× over upstream at 1024×1024 — matches NumPy OpenBLAS.

Codebook Inference

Hardware	ISA	tok/s	50-tok Latency	Power
Sapphire Rapids	AMX	380,000	0.13 ms	250W
Xeon	AVX-512 VNNI	10K–50K	1–5 ms	150W
Pi 5	NEON+dotprod	2K–5K	10–25 ms	5W
Pi 4	NEON dual	500–2K	25–100 ms	5W

f16 Weight Transcoding

Format	Size	Max Error	Speed
f32	60 MB	—	—
f16	30 MB	7.3e-6	94M/s
Scaled-f16	30 MB	4.9e-6	91M/s
Double-f16	60 MB	5.7e-8	42M/s

What We Build That Nobody Else Does

SIMD Polyfill on Stable — F32x16/F64x8/U8x64 via core::arch, not nightly std::simd
f16 Without Nightly — u16 carrier + F16C hardware / ARM FCVTL via asm!()
AMX on Stable — asm!(".byte ...") encoding, 256 MACs/instruction
Tiered ARM NEON — A53/A72/A76 with pipeline + big.LITTLE awareness
0.3ns Dispatch — LazyLock frozen fn-pointer table, no per-call branching
BF16 RNE Bit-Exact — Pure AVX-512-F emulates VCVTNEPS2BF16 bit-for-bit
Cognitive Codec Stack — Fingerprint → Base17 → CAM-PQ → Palette → bgz7 (201GB → 685MB, O(1) inference)

Quick Start

use ndarray::Array2;
use ndarray::hpc::simd_caps::simd_caps;

let a = Array2::<f32>::ones((1024, 1024));
let c = a.dot(&a);  // AVX-512 / AVX2 / NEON — auto

let caps = simd_caps();
if caps.neon { println!("{}", caps.arm_profile().name()); }

cargo build --release                                        # auto-detect
cargo build --release --target aarch64-unknown-linux-gnu     # Pi 4
RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release   # AVX-512
cargo test                                                    # 880 tests

Ecosystem

Repo	Role
lance-graph	Graph query + codec spine
home-automation-rs	Smart home + voice AI

License

MIT OR Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ndarray — AdaWorldAPI HPC Expansion

Cosine Similarity: Us vs. GPU vs. Everyone

Core Architecture

Upstream vs. Fork

ISA Coverage

What Upstream Does on Each Target

Performance

GEMM

Codebook Inference

f16 Weight Transcoding

What We Build That Nobody Else Does

Quick Start

Ecosystem

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ndarray — AdaWorldAPI HPC Expansion

Cosine Similarity: Us vs. GPU vs. Everyone

Core Architecture

Upstream vs. Fork

ISA Coverage

What Upstream Does on Each Target

Performance

GEMM

Codebook Inference

f16 Weight Transcoding

What We Build That Nobody Else Does

Quick Start

Ecosystem

License