Date: February 25, 2026 Environment: Linux 4.4.0, Rust nightly 1.95.0, stable 1.93.1, x86_64
ndarray is the mature, widely-adopted n-dimensional array library for the Rust ecosystem (v0.17.2, 4.2k stars, ~32k dependents). It provides a NumPy-like API for general-purpose array operations with optional BLAS acceleration.
RustyNum is a specialized, pre-release numerical computation ecosystem (v0.1.0, 0 stars) focused on pure Rust SIMD (AVX-512/VNNI), BLAS/LAPACK/FFT replacement, and Hyperdimensional Computing (HDC/VSA) primitives -- all sharing zero-copy memory via a Blackboard architecture.
They overlap on basic array operations but serve fundamentally different niches.
| Aspect | ndarray | RustyNum |
|---|---|---|
| Core type | ArrayBase<S, D> -- generic over storage and dimensionality |
NumArray<T, S> -- generic over element type and SIMD backend |
| Dimensionality | Type-level (Ix1..Ix6, IxDyn) with compile-time checking |
Runtime shape vectors (Vec<usize>) |
| Memory model | Owned arrays, views, shared (Arc) | Owned flat Vec<T> + optional Blackboard (64-byte aligned arena) |
| SIMD strategy | Relies on LLVM autovectorization + matrixmultiply crate |
Explicit portable_simd (f32x16, f64x8, u8x64) |
| Parallelism | Optional rayon integration | Lock-free split_at_mut + thread::scope (built-in) |
| BLAS | Optional via cblas-sys (pluggable: OpenBLAS, MKL, etc.) |
Built-in pure Rust (rustyblas): cache-blocked Goto GEMM |
| Dependencies | matrixmultiply, rawpointer, num-traits, num-complex |
Zero runtime deps (core crates); smallvec only |
| Rust edition | Stable Rust 1.64+ | Nightly only (#![feature(portable_simd)]) |
| License | MIT/Apache-2.0 | Apache-2.0 |
ndarray uses a sophisticated type system with ArrayBase<S, D> where S controls ownership (owned, view, shared) and D controls dimensionality. This enables zero-cost abstractions: slicing returns views without copying, transpose is a stride manipulation, and the compiler enforces dimension correctness.
RustyNum uses a flat Vec<T> with runtime shape checking and dispatches to explicit SIMD kernels. It trades compile-time dimension safety for explicit hardware control (AVX-512 microkernels, VNNI int8 paths, VPOPCNTDQ hamming).
- N-dimensional arrays with shape/reshape
- Element-wise arithmetic (+, -, *, /)
- Dot product, matrix multiply
- Sum, mean, min, max
- Slicing and views
- Transpose
- Type-safe dimensionality (compile-time dimension checking)
- Views and borrowing (zero-copy slicing, split views, windows)
- Broadcasting (NumPy-style shape broadcasting)
azip!macro (lock-step multi-array iteration)Zipcombinator (efficient parallel traversal)- Rayon parallel iterators (
par_azip!,par_map_inplace) - Serde serialization support
no_stdsupport (withdefault-features = false)- Complex number support (
num-complex) - Stable Rust (no nightly required)
- Pluggable BLAS backends (OpenBLAS, MKL, Accelerate via
blas-src) - Mature ecosystem (ndarray-rand, ndarray-linalg, ndarray-stats)
- Explicit AVX-512 SIMD (
f32x16,u8x64-- not dependent on autovectorization) - Pure Rust BLAS L1/L2/L3 (rustyblas: sgemm, dgemm, int8_gemm, bf16_gemm)
- Pure Rust LAPACK (rustymkl: LU, Cholesky, QR factorization)
- Pure Rust FFT (radix-2 Cooley-Tukey)
- Pure Rust VML (vectorized exp, ln, sqrt, sin, cos)
- INT8 Quantized GEMM (AVX-512 VNNI
vpdpbusd, 64 MACs/instruction) - BF16 Mixed-Precision GEMM (half bandwidth, f32 accumulation)
- HDC/VSA primitives (BIND, BUNDLE, PERMUTE, Hamming distance)
- Adaptive cascade search (3-sigma/2-sigma early-exit, 99.7% candidate elimination)
- INT8 prefiltering (approximate stats, GEMM row pruning)
- Zero-copy Blackboard (64-byte aligned shared memory with split-borrow)
- Tiered compute dispatch (INT8 -> BF16 -> FP32 -> GPU, runtime HW detection)
- CogRecord (domain-specific 8KB container for holographic memory)
- Python bindings (via PyO3)
All benchmarks were run on the same machine with RUSTFLAGS="-C target-cpu=native" and --release optimization.
These results come from rustynum's own array_benchmarks.rs which tests all three libraries (rustynum, ndarray, nalgebra) using Criterion in the same process.
| Operation | Size | rustynum (ns) | ndarray (ns) | nalgebra (ns) | Winner |
|---|---|---|---|---|---|
| Vector Addition | 1,000 | 152 | 158 | 142 | nalgebra |
| 10,000 | 2,414 | 1,634 | 1,650 | ndarray | |
| 100,000 | 23,287 | 15,407 | 16,745 | ndarray | |
| Dot Product | 1,000 | 117 | 165 | 175 | rustynum |
| 10,000 | 1,463 | 2,060 | 2,388 | rustynum | |
| 100,000 | 14,523 | 20,719 | 26,956 | rustynum | |
| Mean | 1,000 | 143 | 124 | 1,625 | ndarray |
| 10,000 | 739 | 1,270 | 16,130 | rustynum | |
| 100,000 | 7,303 | 13,016 | 160,958 | rustynum | |
| Median | 1,000 | 901 | 745 | 743 | nalgebra |
| 10,000 | 8,707 | 7,602 | 7,362 | nalgebra | |
| 100,000 | 83,506 | 71,932 | 74,012 | ndarray |
Key takeaways:
- rustynum wins dot product by 1.4x at all sizes (explicit SIMD vs autovectorization)
- rustynum wins mean at 10K+ by 1.7-1.8x (SIMD reduction)
- ndarray wins addition at 10K+ by 1.5x (more efficient memory allocation/iteration)
- ndarray/nalgebra win median (sort-dominated; both use similar scalar sort)
| Size | rustynum-rs (ms) | ndarray (ms) | nalgebra (ms) | ndarray speedup over rustynum |
|---|---|---|---|---|
| 100x100 | 9.93 | 0.042 | 0.040 | 236x |
| 500x500 | 11.08 | 4.23 | 4.50 | 2.6x |
| 1000x1000 | 39.20 | 32.54 | 48.38 | 1.2x |
Note: The rustynum-rs matrix_multiply function (in the rustynum-rs crate) uses a simpler transpose-dot approach, not the cache-blocked Goto algorithm from rustyblas. The rustyblas::level3::sgemm is much faster:
| Size | rustyblas Old (ms) | rustyblas New Goto+MT (ms) | GFLOPS | ndarray (ms) | ndarray GFLOPS |
|---|---|---|---|---|---|
| 32x32 | 0.02 | 0.01 | 5.34 | 0.001 | 50.57 |
| 64x64 | 0.04 | 0.02 | 25.03 | 0.009 | 56.02 |
| 128x128 | 0.28 | 0.10 | 40.86 | 0.066 | 63.24 |
| 256x256 | 1.90 | 0.72 | 46.32 | 0.497 | 67.54 |
| 512x512 | 12.44 | 8.68 | 30.91 | 3.957 | 67.83 |
| 1024x1024 | 159.93 | 19.30 | 111.29 | 34.82 | 61.68 |
Analysis:
- At small sizes (<=256), ndarray's
matrixmultiplycrate is substantially faster (it has highly tuned kernels with careful cache blocking) - At 512x512, ndarray is still 2.2x faster (68 vs 31 GFLOPS)
- At 1024x1024, rustyblas overtakes ndarray: 111 GFLOPS vs 62 GFLOPS (1.8x faster) thanks to multithreading
- ndarray's
matrixmultiplyis single-threaded by default (can enablematrixmultiply-threadingfeature)
| Size | f32 (ns) | f64 (ns) | i32 (ns) |
|---|---|---|---|
| 4x4 | 101 | 86 | 111 |
| 8x8 | 95 | 120 | 373 |
| 16x16 | 303 | 454 | 2,373 |
| 32x32 | 1,281 | 2,232 | 16,567 |
| 64x64 | 8,958 | 16,322 | 113,109 |
| 127x127 | 64,517 | 124,762 | 982,257 |
| 10000 (mixed) | 5,363,748 | 10,499,984 | -- |
| Size | rustynum (ns) | ndarray (ns) | nalgebra (ns) | Winner |
|---|---|---|---|---|
| 100 | 1,735 | 2,091 | 2,330 | rustynum |
| 500 | 29,836 | 53,176 | 79,276 | rustynum |
| 1000 | 139,390 | 214,805 | 494,346 | rustynum |
rustynum wins matrix-vector multiply at all sizes by 1.2-2.3x (SIMD-optimized GEMV).
From the custom ndarray benchmark:
| Operation | Size | Time (ns) | Notes |
|---|---|---|---|
| Sum f32 | 1,000 | 125 | |
| Sum f32 | 10,000 | 1,260 | |
| Sum f32 | 100,000 | 12,697 | |
| Mean f64 | 10,000 | 1,429 | |
| Std f64 | 10,000 | 66,416 | Two-pass algorithm |
| Std f64 | 100,000 | 653,669 | |
| Zeros f32 | 10,000 | 262 | Very fast allocation |
| Ones f32 | 10,000 | 797 | |
| Linspace f64 | 10,000 | 2,543 | |
| Slice 100K->50K | -- | 11 | Zero-copy view |
| Transpose 1000x1000 | -- | 0.19 | Zero-cost stride swap |
| Sum axis=0 [1000x100] | -- | 15,104 | |
| Sum axis=1 [1000x100] | -- | 15,141 |
ndarray's transpose is a zero-cost operation (0.19 ns) -- it just swaps strides without copying data. RustyNum's transpose involves data movement.
These operations have no ndarray equivalent.
| Operation | SIMD (ns) | Naive Scalar (ns) | Speedup |
|---|---|---|---|
| XOR/BIND 8 KB | 230 | 6,471 | 28x |
| XOR/BIND 16 KB | 495 | 13,161 | 27x |
| XOR/BIND 32 KB | 1,624 | 26,192 | 16x |
| Hamming distance 8 KB | 121 | 2,164 | 18x |
| Hamming distance 16 KB | 236 | 4,338 | 18x |
| Hamming distance 32 KB | 827 | 8,632 | 10x |
| Bundle n=5 (8 KB) | 109,949 | 263,306 | 2.4x |
| Bundle n=16 (8 KB) | 277,796 | 861,417 | 3.1x |
| Bundle n=64 (8 KB) | 658,828 | 3,048,335 | 4.6x |
| Bundle n=1024 (8 KB) | 6,262,564 | 119,318,597 | 19x |
| Int8 dot product 1024D | 278 | -- | (VNNI) |
| Int8 cosine sim 1024D | 803 | -- | (VNNI) |
use ndarray::prelude::*;
let a = array![[1., 2.], [3., 4.]];
let b = Array2::<f64>::eye(2);
let c = a.dot(&b); // Matrix multiply
let view = a.slice(s![.., 0..1]); // Zero-copy slice
let mean = a.mean_axis(Axis(0)); // Axis reduction
let t = a.t(); // Zero-cost transpose
azip!((a in &a, b in &b) { ... }); // Lock-step iterationStrengths: Rich slicing DSL (s![] macro), broadcasting, views, parallel iterators, strong type safety, extensive documentation, large ecosystem.
use rustynum_rs::NumArrayF32;
let a = NumArrayF32::new(vec![1.0, 2.0, 3.0]);
let b = NumArrayF32::new(vec![4.0, 5.0, 6.0]);
let c = &a + &b; // SIMD addition
let dot = a.dot(&b); // SIMD dot product
let mean = a.mean(); // SIMD reduction
// HDC operations (no ndarray equivalent)
use rustynum_rs::NumArrayU8;
let bound = a_hdc ^ b_hdc; // XOR bind
let dist = a_hdc.hamming_distance(&b); // VPOPCNTDQStrengths: Explicit SIMD control, HDC/VSA primitives, INT8/BF16 quantized ops, zero external dependencies.
| Metric | ndarray | RustyNum |
|---|---|---|
| Version | 0.17.2 | 0.1.0 |
| crates.io | Published (64M+ downloads) | Not published |
| GitHub stars | ~4,200 | 0 |
| Dependents | ~31,900 | 0 |
| Contributors | Many (open source community) | 2-3 (private) |
| First commit | ~2015 | 2024 |
| Requires nightly | No (stable Rust 1.64+) | Yes (portable_simd) |
| Documentation | docs.rs, extensive | README only |
| Test coverage | Comprehensive | ~3,000+ test annotations |
| CI/CD | GitHub Actions | GitHub Actions + Miri |
no_std support |
Yes | No |
- You need a general-purpose N-dimensional array library
- Stable Rust is required
- You need broadcasting, views, and NumPy-like ergonomics
- You want a mature, well-documented crate with ecosystem support
- You need complex number support
- You need to interface with BLAS/LAPACK via established backends
- You're building a library that others will depend on
- You need HDC/VSA operations (BIND, BUNDLE, PERMUTE, Hamming distance)
- You need INT8/BF16 quantized computation (ML inference, embeddings)
- You need explicit AVX-512 SIMD control (not relying on autovectorization)
- You need BLAS/LAPACK/FFT without C dependencies (pure Rust deployment)
- You're working with CogRecord or holographic memory systems
- You need the adaptive cascade search pattern (early-exit database scans)
- You're OK with nightly Rust and a pre-release API
- You need dot product and mean at maximum throughput
The two libraries can be used together. RustyNum's benchmark suite already depends on ndarray for comparison. In practice:
- Use ndarray for general array manipulation, slicing, views, broadcasting
- Use RustyNum for hot-path SIMD operations, HDC primitives, and quantized inference
- Transfer data via flat
&[T]slices (both support this)
| Category | Winner | Margin |
|---|---|---|
| Vector addition (10K+) | ndarray | 1.5x |
| Dot product (all sizes) | rustynum | 1.4x |
| Mean (10K+) | rustynum | 1.7x |
| Median | ndarray | 1.1x |
| Matrix-vector multiply | rustynum | 1.5-2.3x |
| GEMM small (<=256) | ndarray | 2-10x |
| GEMM medium (512) | ndarray | 2.2x |
| GEMM large (1024, single-thread) | ndarray | ~1.8x |
| GEMM large (1024, multi-thread) | rustynum (rustyblas) | 1.8x |
| Transpose | ndarray | infinite (zero-cost) |
| Slicing | ndarray | zero-copy views |
| HDC/VSA ops | rustynum (exclusive) | N/A |
| INT8/BF16 GEMM | rustynum (exclusive) | N/A |
| LAPACK/FFT (pure Rust) | rustynum (exclusive) | N/A |
ndarray and RustyNum are not direct competitors but complementary tools:
-
ndarray is the de facto standard for N-dimensional arrays in Rust. It offers mature, well-tested, ergonomic array operations with excellent memory management (views, broadcasting) and works on stable Rust. For most general-purpose numerical computing in Rust, ndarray is the right choice.
-
RustyNum is a specialized performance toolkit focused on explicit SIMD, quantized computation, and domain-specific operations (HDC/VSA). Its strengths are in hot-path operations where explicit hardware control matters: dot products (1.4x faster), mean (1.7x faster), matrix-vector multiply (2.3x faster), and entirely unique capabilities like INT8 GEMM, adaptive cascade search, and HDC primitives. However, it requires nightly Rust, lacks ndarray's type safety and ergonomics, and is pre-release.
The ideal architecture for a high-performance numerical application in Rust might use ndarray for data management and general computation while calling into RustyNum/rustyblas for performance-critical inner loops and specialized operations.