Skip to content

Latest commit

 

History

History
320 lines (249 loc) · 14.9 KB

File metadata and controls

320 lines (249 loc) · 14.9 KB

RustyNum vs ndarray: Comprehensive Comparison

Date: February 25, 2026 Environment: Linux 4.4.0, Rust nightly 1.95.0, stable 1.93.1, x86_64


Executive Summary

ndarray is the mature, widely-adopted n-dimensional array library for the Rust ecosystem (v0.17.2, 4.2k stars, ~32k dependents). It provides a NumPy-like API for general-purpose array operations with optional BLAS acceleration.

RustyNum is a specialized, pre-release numerical computation ecosystem (v0.1.0, 0 stars) focused on pure Rust SIMD (AVX-512/VNNI), BLAS/LAPACK/FFT replacement, and Hyperdimensional Computing (HDC/VSA) primitives -- all sharing zero-copy memory via a Blackboard architecture.

They overlap on basic array operations but serve fundamentally different niches.


1. Architecture and Design Philosophy

Aspect ndarray RustyNum
Core type ArrayBase<S, D> -- generic over storage and dimensionality NumArray<T, S> -- generic over element type and SIMD backend
Dimensionality Type-level (Ix1..Ix6, IxDyn) with compile-time checking Runtime shape vectors (Vec<usize>)
Memory model Owned arrays, views, shared (Arc) Owned flat Vec<T> + optional Blackboard (64-byte aligned arena)
SIMD strategy Relies on LLVM autovectorization + matrixmultiply crate Explicit portable_simd (f32x16, f64x8, u8x64)
Parallelism Optional rayon integration Lock-free split_at_mut + thread::scope (built-in)
BLAS Optional via cblas-sys (pluggable: OpenBLAS, MKL, etc.) Built-in pure Rust (rustyblas): cache-blocked Goto GEMM
Dependencies matrixmultiply, rawpointer, num-traits, num-complex Zero runtime deps (core crates); smallvec only
Rust edition Stable Rust 1.64+ Nightly only (#![feature(portable_simd)])
License MIT/Apache-2.0 Apache-2.0

Key Architectural Differences

ndarray uses a sophisticated type system with ArrayBase<S, D> where S controls ownership (owned, view, shared) and D controls dimensionality. This enables zero-cost abstractions: slicing returns views without copying, transpose is a stride manipulation, and the compiler enforces dimension correctness.

RustyNum uses a flat Vec<T> with runtime shape checking and dispatches to explicit SIMD kernels. It trades compile-time dimension safety for explicit hardware control (AVX-512 microkernels, VNNI int8 paths, VPOPCNTDQ hamming).


2. Feature Comparison

Common Ground (both provide)

  • N-dimensional arrays with shape/reshape
  • Element-wise arithmetic (+, -, *, /)
  • Dot product, matrix multiply
  • Sum, mean, min, max
  • Slicing and views
  • Transpose

ndarray Exclusive Features

  • Type-safe dimensionality (compile-time dimension checking)
  • Views and borrowing (zero-copy slicing, split views, windows)
  • Broadcasting (NumPy-style shape broadcasting)
  • azip! macro (lock-step multi-array iteration)
  • Zip combinator (efficient parallel traversal)
  • Rayon parallel iterators (par_azip!, par_map_inplace)
  • Serde serialization support
  • no_std support (with default-features = false)
  • Complex number support (num-complex)
  • Stable Rust (no nightly required)
  • Pluggable BLAS backends (OpenBLAS, MKL, Accelerate via blas-src)
  • Mature ecosystem (ndarray-rand, ndarray-linalg, ndarray-stats)

RustyNum Exclusive Features

  • Explicit AVX-512 SIMD (f32x16, u8x64 -- not dependent on autovectorization)
  • Pure Rust BLAS L1/L2/L3 (rustyblas: sgemm, dgemm, int8_gemm, bf16_gemm)
  • Pure Rust LAPACK (rustymkl: LU, Cholesky, QR factorization)
  • Pure Rust FFT (radix-2 Cooley-Tukey)
  • Pure Rust VML (vectorized exp, ln, sqrt, sin, cos)
  • INT8 Quantized GEMM (AVX-512 VNNI vpdpbusd, 64 MACs/instruction)
  • BF16 Mixed-Precision GEMM (half bandwidth, f32 accumulation)
  • HDC/VSA primitives (BIND, BUNDLE, PERMUTE, Hamming distance)
  • Adaptive cascade search (3-sigma/2-sigma early-exit, 99.7% candidate elimination)
  • INT8 prefiltering (approximate stats, GEMM row pruning)
  • Zero-copy Blackboard (64-byte aligned shared memory with split-borrow)
  • Tiered compute dispatch (INT8 -> BF16 -> FP32 -> GPU, runtime HW detection)
  • CogRecord (domain-specific 8KB container for holographic memory)
  • Python bindings (via PyO3)

3. Benchmark Results

All benchmarks were run on the same machine with RUSTFLAGS="-C target-cpu=native" and --release optimization.

3.1 Head-to-Head: Vector Operations (Criterion, same benchmark binary)

These results come from rustynum's own array_benchmarks.rs which tests all three libraries (rustynum, ndarray, nalgebra) using Criterion in the same process.

Operation Size rustynum (ns) ndarray (ns) nalgebra (ns) Winner
Vector Addition 1,000 152 158 142 nalgebra
10,000 2,414 1,634 1,650 ndarray
100,000 23,287 15,407 16,745 ndarray
Dot Product 1,000 117 165 175 rustynum
10,000 1,463 2,060 2,388 rustynum
100,000 14,523 20,719 26,956 rustynum
Mean 1,000 143 124 1,625 ndarray
10,000 739 1,270 16,130 rustynum
100,000 7,303 13,016 160,958 rustynum
Median 1,000 901 745 743 nalgebra
10,000 8,707 7,602 7,362 nalgebra
100,000 83,506 71,932 74,012 ndarray

Key takeaways:

  • rustynum wins dot product by 1.4x at all sizes (explicit SIMD vs autovectorization)
  • rustynum wins mean at 10K+ by 1.7-1.8x (SIMD reduction)
  • ndarray wins addition at 10K+ by 1.5x (more efficient memory allocation/iteration)
  • ndarray/nalgebra win median (sort-dominated; both use similar scalar sort)

3.2 Matrix Multiply (GEMM) -- The Critical Benchmark

Criterion head-to-head (rustynum-rs matrix_multiply vs ndarray .dot())

Size rustynum-rs (ms) ndarray (ms) nalgebra (ms) ndarray speedup over rustynum
100x100 9.93 0.042 0.040 236x
500x500 11.08 4.23 4.50 2.6x
1000x1000 39.20 32.54 48.38 1.2x

Note: The rustynum-rs matrix_multiply function (in the rustynum-rs crate) uses a simpler transpose-dot approach, not the cache-blocked Goto algorithm from rustyblas. The rustyblas::level3::sgemm is much faster:

rustyblas Goto GEMM (cache-blocked + multithreaded)

Size rustyblas Old (ms) rustyblas New Goto+MT (ms) GFLOPS ndarray (ms) ndarray GFLOPS
32x32 0.02 0.01 5.34 0.001 50.57
64x64 0.04 0.02 25.03 0.009 56.02
128x128 0.28 0.10 40.86 0.066 63.24
256x256 1.90 0.72 46.32 0.497 67.54
512x512 12.44 8.68 30.91 3.957 67.83
1024x1024 159.93 19.30 111.29 34.82 61.68

Analysis:

  • At small sizes (<=256), ndarray's matrixmultiply crate is substantially faster (it has highly tuned kernels with careful cache blocking)
  • At 512x512, ndarray is still 2.2x faster (68 vs 31 GFLOPS)
  • At 1024x1024, rustyblas overtakes ndarray: 111 GFLOPS vs 62 GFLOPS (1.8x faster) thanks to multithreading
  • ndarray's matrixmultiply is single-threaded by default (can enable matrixmultiply-threading feature)

ndarray Standalone GEMM (from bench1)

Size f32 (ns) f64 (ns) i32 (ns)
4x4 101 86 111
8x8 95 120 373
16x16 303 454 2,373
32x32 1,281 2,232 16,567
64x64 8,958 16,322 113,109
127x127 64,517 124,762 982,257
10000 (mixed) 5,363,748 10,499,984 --

3.3 Matrix-Vector Multiply

Size rustynum (ns) ndarray (ns) nalgebra (ns) Winner
100 1,735 2,091 2,330 rustynum
500 29,836 53,176 79,276 rustynum
1000 139,390 214,805 494,346 rustynum

rustynum wins matrix-vector multiply at all sizes by 1.2-2.3x (SIMD-optimized GEMV).

3.4 ndarray Standalone Performance

From the custom ndarray benchmark:

Operation Size Time (ns) Notes
Sum f32 1,000 125
Sum f32 10,000 1,260
Sum f32 100,000 12,697
Mean f64 10,000 1,429
Std f64 10,000 66,416 Two-pass algorithm
Std f64 100,000 653,669
Zeros f32 10,000 262 Very fast allocation
Ones f32 10,000 797
Linspace f64 10,000 2,543
Slice 100K->50K -- 11 Zero-copy view
Transpose 1000x1000 -- 0.19 Zero-cost stride swap
Sum axis=0 [1000x100] -- 15,104
Sum axis=1 [1000x100] -- 15,141

ndarray's transpose is a zero-cost operation (0.19 ns) -- it just swaps strides without copying data. RustyNum's transpose involves data movement.

3.5 RustyNum-Exclusive: HDC/VSA Operations

These operations have no ndarray equivalent.

Operation SIMD (ns) Naive Scalar (ns) Speedup
XOR/BIND 8 KB 230 6,471 28x
XOR/BIND 16 KB 495 13,161 27x
XOR/BIND 32 KB 1,624 26,192 16x
Hamming distance 8 KB 121 2,164 18x
Hamming distance 16 KB 236 4,338 18x
Hamming distance 32 KB 827 8,632 10x
Bundle n=5 (8 KB) 109,949 263,306 2.4x
Bundle n=16 (8 KB) 277,796 861,417 3.1x
Bundle n=64 (8 KB) 658,828 3,048,335 4.6x
Bundle n=1024 (8 KB) 6,262,564 119,318,597 19x
Int8 dot product 1024D 278 -- (VNNI)
Int8 cosine sim 1024D 803 -- (VNNI)

4. API and Ergonomics

ndarray (Mature, NumPy-inspired)

use ndarray::prelude::*;

let a = array![[1., 2.], [3., 4.]];
let b = Array2::<f64>::eye(2);
let c = a.dot(&b);                    // Matrix multiply
let view = a.slice(s![.., 0..1]);     // Zero-copy slice
let mean = a.mean_axis(Axis(0));      // Axis reduction
let t = a.t();                        // Zero-cost transpose
azip!((a in &a, b in &b) { ... });   // Lock-step iteration

Strengths: Rich slicing DSL (s![] macro), broadcasting, views, parallel iterators, strong type safety, extensive documentation, large ecosystem.

RustyNum (Performance-first, SIMD-explicit)

use rustynum_rs::NumArrayF32;

let a = NumArrayF32::new(vec![1.0, 2.0, 3.0]);
let b = NumArrayF32::new(vec![4.0, 5.0, 6.0]);
let c = &a + &b;                      // SIMD addition
let dot = a.dot(&b);                   // SIMD dot product
let mean = a.mean();                   // SIMD reduction

// HDC operations (no ndarray equivalent)
use rustynum_rs::NumArrayU8;
let bound = a_hdc ^ b_hdc;            // XOR bind
let dist = a_hdc.hamming_distance(&b); // VPOPCNTDQ

Strengths: Explicit SIMD control, HDC/VSA primitives, INT8/BF16 quantized ops, zero external dependencies.


5. Ecosystem and Maturity

Metric ndarray RustyNum
Version 0.17.2 0.1.0
crates.io Published (64M+ downloads) Not published
GitHub stars ~4,200 0
Dependents ~31,900 0
Contributors Many (open source community) 2-3 (private)
First commit ~2015 2024
Requires nightly No (stable Rust 1.64+) Yes (portable_simd)
Documentation docs.rs, extensive README only
Test coverage Comprehensive ~3,000+ test annotations
CI/CD GitHub Actions GitHub Actions + Miri
no_std support Yes No

6. When to Use Which

Choose ndarray when:

  • You need a general-purpose N-dimensional array library
  • Stable Rust is required
  • You need broadcasting, views, and NumPy-like ergonomics
  • You want a mature, well-documented crate with ecosystem support
  • You need complex number support
  • You need to interface with BLAS/LAPACK via established backends
  • You're building a library that others will depend on

Choose RustyNum when:

  • You need HDC/VSA operations (BIND, BUNDLE, PERMUTE, Hamming distance)
  • You need INT8/BF16 quantized computation (ML inference, embeddings)
  • You need explicit AVX-512 SIMD control (not relying on autovectorization)
  • You need BLAS/LAPACK/FFT without C dependencies (pure Rust deployment)
  • You're working with CogRecord or holographic memory systems
  • You need the adaptive cascade search pattern (early-exit database scans)
  • You're OK with nightly Rust and a pre-release API
  • You need dot product and mean at maximum throughput

Complementary use:

The two libraries can be used together. RustyNum's benchmark suite already depends on ndarray for comparison. In practice:

  • Use ndarray for general array manipulation, slicing, views, broadcasting
  • Use RustyNum for hot-path SIMD operations, HDC primitives, and quantized inference
  • Transfer data via flat &[T] slices (both support this)

7. Performance Summary Table

Category Winner Margin
Vector addition (10K+) ndarray 1.5x
Dot product (all sizes) rustynum 1.4x
Mean (10K+) rustynum 1.7x
Median ndarray 1.1x
Matrix-vector multiply rustynum 1.5-2.3x
GEMM small (<=256) ndarray 2-10x
GEMM medium (512) ndarray 2.2x
GEMM large (1024, single-thread) ndarray ~1.8x
GEMM large (1024, multi-thread) rustynum (rustyblas) 1.8x
Transpose ndarray infinite (zero-cost)
Slicing ndarray zero-copy views
HDC/VSA ops rustynum (exclusive) N/A
INT8/BF16 GEMM rustynum (exclusive) N/A
LAPACK/FFT (pure Rust) rustynum (exclusive) N/A

8. Conclusion

ndarray and RustyNum are not direct competitors but complementary tools:

  • ndarray is the de facto standard for N-dimensional arrays in Rust. It offers mature, well-tested, ergonomic array operations with excellent memory management (views, broadcasting) and works on stable Rust. For most general-purpose numerical computing in Rust, ndarray is the right choice.

  • RustyNum is a specialized performance toolkit focused on explicit SIMD, quantized computation, and domain-specific operations (HDC/VSA). Its strengths are in hot-path operations where explicit hardware control matters: dot products (1.4x faster), mean (1.7x faster), matrix-vector multiply (2.3x faster), and entirely unique capabilities like INT8 GEMM, adaptive cascade search, and HDC primitives. However, it requires nightly Rust, lacks ndarray's type safety and ergonomics, and is pre-release.

The ideal architecture for a high-performance numerical application in Rust might use ndarray for data management and general computation while calling into RustyNum/rustyblas for performance-critical inner loops and specialized operations.