ndarray — AdaWorldAPI HPC Expansion

A complete high-performance numerical computing stack built on top of rust-ndarray/ndarray. 55 HPC modules, 880 tests, BLAS L1-L3, LAPACK, FFT, quantized inference, SIMD kernels from Intel AMX to Raspberry Pi NEON — stable Rust 1.94, zero nightly.

Deutsche Version | Full Feature Comparison (146 modules)

Why This Exists

What	Us	GPU (RTX 3060)	GPU (H100)	NumPy CPU
Cosine similarity	2,400M/s (palette u8)	~300M/s (IVF-PQ)	~1,500M/s (cuVS)	~50M/s (dot)
GEMM 1024x1024	139 GFLOPS	3,500 GFLOPS	30,000 GFLOPS	120 GFLOPS
Codebook inference	2,000 tok/s @ 5W (Pi 4)	~100K tok/s @ 170W	~500K tok/s @ 700W	N/A
Energy efficiency	37M ops/s/W	1.8M ops/s/W	2.1M ops/s/W	1.8M ops/s/W
Startup latency	0 ms (no kernel launch)	2-10 ms	2-10 ms	50 ms (Python)
Hardware cost	$0 (runs on any CPU)	$350	$30,000	$0
PCIe transfer	None (data in L1 cache)	Required	Required	None
Rust stable	Yes (1.94)	CUDA toolkit	CUDA toolkit	Python

GPU wins at large dense GEMM. We win at everything else: similarity search, latency-sensitive inference, edge deployment, energy efficiency, and cost. A $35 Raspberry Pi 4 at 5 watts outperforms a $350 GPU at 170 watts for codebook inference — because table lookups don't need floating-point hardware.

Core Architecture

Five layers built on top of upstream ndarray's array primitives:

SIMD Polyfill (simd.rs, simd_avx512.rs, simd_avx2.rs, simd_neon.rs) — std::simd-compatible types (F32x16, F64x8, U8x64, I32x16) on stable Rust via core::arch. Detection once via LazyLock<SimdCaps>, dispatch via frozen function pointer table (0.3ns per call).

Backend (backend/) — Pluggable BLAS: pure-Rust Goto-GEMM (default), Intel MKL (feature-gated), OpenBLAS (feature-gated). Native backend: 6x16 f32 + 6x8 f64 microkernels, cache-blocked L1/L2/L3, 16-thread split-borrow parallelism.

HPC Library (hpc/, 146 files) — BLAS L1-L3, LAPACK, FFT, VML, statistics, activations, quantized ops. Every module SIMD-accelerated through the frozen dispatch table.

Codec (fingerprint.rs, bgz17_bridge.rs, cam_pq.rs, palette_distance.rs) — Encoding stack for compressed inference: Fingerprint<256>, Base17, CAM-PQ, palette semiring. O(1) per token — table lookups replace matrix multiplication.

Burn Integration (crates/burn/) — SIMD-augmented burn-ndarray backend wiring F32x16 into tensor ops and activations.

Upstream vs. Fork

ISA Coverage

ISA	Upstream ndarray	This Fork	Speedup
AVX-512 (16×f32)	Scalar fallback	Native `__m512` types	~8×
AVX-512 VNNI (int8)	Scalar fallback	64 MACs/instr + dispatch	~32×
AVX-512 BF16	Not available	Hardware + RNE emulation	new
AVX-512 VPOPCNTDQ	Scalar fallback	Native 512-bit popcount	~16×
AMX (256 MACs)	Not available	Inline asm, stable Rust	~128×
AVX2 + FMA (8×f32)	Via matrixmultiply	Goto-GEMM + dispatch	~4×
AVX2 F16C	Not available	IEEE 754 f16 + precision toolkit	new
NEON (4×f32)	Scalar fallback	3-tier: A53/A72/A76	~4×
NEON dotprod	Not available	`vdotq_s32` (Pi 5)	~16×
NEON fp16	Not available	`FCVTL`/`FCVTN` via asm	new

What Upstream Does on Each Target

Upstream on x86_64:  → matrixmultiply crate (AVX2 if available, no AVX-512)
Upstream on aarch64: → Scalar (no NEON, no intrinsics)
Upstream on wasm:    → Scalar

Fork on x86_64:      → AVX-512 / AVX2 / SSE2 / Scalar (tiered, auto-detected)
Fork on aarch64:     → NEON A76+dotprod / A72 2×pipe / A53 / Scalar (tiered)
Fork on wasm:        → WASM SIMD128 (prepared) / Scalar

Performance

GEMM

Matrix Size	Upstream	This Fork	NumPy	PyTorch CPU	GPU (RTX 3060)
512×512	~20 GFLOPS	47 GFLOPS	~45	~40	~1,200
1024×1024	~13 GFLOPS	139 GFLOPS	~120	~100	~3,500
2048×2048	~13 GFLOPS	~150 GFLOPS	~140	~130	~5,000

10.5× over upstream at 1024×1024 — matches NumPy OpenBLAS.

Codebook Inference

Hardware	ISA	tok/s	50-tok Latency	Power
Sapphire Rapids	AMX	380,000	0.13 ms	250W
Xeon	AVX-512 VNNI	10K–50K	1–5 ms	150W
Pi 5	NEON+dotprod	2K–5K	10–25 ms	5W
Pi 4	NEON dual	500–2K	25–100 ms	5W

Cosine via Palette Distance

Tier	Error	Speed	vs. GPU (RTX 3060)
Foveal (1/40σ)	0.4%	611M/s	~2× faster
Near (1σ)	8%	2,400M/s	~8× faster
F32 exact	0%	50M/s	6× slower
RTX 3060 IVF-PQ	~5%	~300M/s	baseline
H100 cuVS	~2%	~1,500M/s	5× our cost

611M cosine-equivalent lookups/sec using only integer operations. The 256×256 table (64KB) lives in L1 cache — no FP division, no multiplication, no PCIe transfer.

f16 Weight Transcoding

Format	Size	Max Error	Speed
f32	60 MB	—	—
f16	30 MB	7.3e-6	94M/s
Scaled-f16	30 MB	4.9e-6	91M/s
Double-f16	60 MB	5.7e-8	42M/s

What We Build That Nobody Else Does

SIMD Polyfill on Stable — F32x16/F64x8/U8x64 via core::arch, not nightly std::simd
f16 Without Nightly — u16 carrier + F16C hardware / ARM FCVTL via asm!()
AMX on Stable — asm!(".byte ...") encoding, 256 MACs/instruction
Tiered ARM NEON — A53/A72/A76 with pipeline + big.LITTLE awareness
0.3ns Dispatch — LazyLock frozen fn-pointer table, no per-call branching
BF16 RNE Bit-Exact — Pure AVX-512-F emulates VCVTNEPS2BF16 bit-for-bit
Cognitive Codec Stack — Fingerprint → Base17 → CAM-PQ → Palette → bgz7 (201GB → 685MB, O(1) inference)

Quick Start

use ndarray::Array2;
use ndarray::hpc::simd_caps::simd_caps;

let a = Array2::<f32>::ones((1024, 1024));
let c = a.dot(&a);  // AVX-512 / AVX2 / NEON — auto

let caps = simd_caps();
if caps.neon { println!("{}", caps.arm_profile().name()); }

cargo build --release                                        # auto-detect
cargo build --release --target aarch64-unknown-linux-gnu     # Pi 4
RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release   # AVX-512
cargo test                                                    # 880 tests

Ecosystem

Repo	Role
lance-graph	Graph query + codec spine
home-automation-rs	Smart home + voice AI

License

MIT OR Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ndarray — AdaWorldAPI HPC Expansion

Why This Exists

Core Architecture

Upstream vs. Fork

ISA Coverage

What Upstream Does on Each Target

Performance

GEMM

Codebook Inference

Cosine via Palette Distance

f16 Weight Transcoding

What We Build That Nobody Else Does

Quick Start

Ecosystem

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ndarray — AdaWorldAPI HPC Expansion

Why This Exists

Core Architecture

Upstream vs. Fork

ISA Coverage

What Upstream Does on Each Target

Performance

GEMM

Codebook Inference

Cosine via Palette Distance

f16 Weight Transcoding

What We Build That Nobody Else Does

Quick Start

Ecosystem

License