Skip to content

Latest commit

 

History

History
837 lines (675 loc) · 41.5 KB

File metadata and controls

837 lines (675 loc) · 41.5 KB

yscv Performance Benchmarks

Measured CPU-inference and single-operation benchmarks for yscv, compared against ONNX Runtime, PyTorch, NumPy, OpenCV, and ffmpeg.

Last updated: 2026-06-16 · commit fa80f66

This document has two parts. The CPU sections (Siamese tracker, single-op) are the current measurement focus: fixed hardware, pinned competitor versions, isolated per-op processes, regenerable from a script in the repo. They cover three hosts — AMD Ryzen 5 7500F (Zen 4), Orange Pi Zero 3 (Cortex-A53), and Apple M1. The Metal / video sections below them are retained for reference but were measured on different hardware and dates; where a number could not be reproduced on current tooling it is marked pending re-measurement. Treat those as provisional.

Where yscv is at parity with or behind a competitor, this is stated plainly. Ratios are written as competitor / yscv (a ratio above 1.0 means yscv is faster).


Siamese tracker — CPU inference vs ONNX Runtime

The primary end-to-end target. Model is a public two-tower Siamese single-object tracker exported to ONNX (~156 ops after graph optimization), with two inputs: a 1×3×128×128 template branch and a 1×3×256×256 search branch, fp32 zero-fill. The reported figure is the minimum wall-clock latency over 300 iterations (after warmup), which isolates the steady-state compute path from scheduler and allocator jitter.

x86 — AMD Ryzen 5 7500F (Zen 4, 6C/12T)

yscv (commit 241f36c) vs ONNX Runtime 1.24.4, CPUExecutionProvider, on the same host.

Here the ratio is yscv / ORT (yscv's slowdown factor); >1.0 means yscv is slower, since on this model ORT is ahead.

Threads yscv min yscv FPS ORT min yscv / ORT
1 8.63 ms 116 8.03 ms 1.07×
4 3.06 ms 327 2.35 ms 1.30×
6 2.52 ms 396 1.74 ms 1.45×

Honest reading. yscv is roughly 7% behind ORT single-threaded (8.63 vs 8.03 ms) and the gap widens with thread count: ORT scales better across cores on this model, reaching 1.74 ms at 6 threads against yscv's 2.52 ms (ORT is ~1.45× faster at 6T). The single-thread compute path is close; the deficit is in multi-thread scaling, not per-core kernel throughput. The remaining structural difference is layout — ORT runs NCHWc throughout while yscv runs NHWC. See docs/onnx-cpu-kernels.md for the per-op hot-path map.

ARM — Orange Pi Zero 3 (Cortex-A53, 4C)

yscv (commit d8d43ea on the device) vs ONNX Runtime 1.26.0 (CPUExecutionProvider), same host, same model and inputs. On this edge-ARM target yscv is faster than ORT at both thread counts — the inverse of the x86 picture, and the project's actual deployment target.

Ratio here is ORT / yscv (>1.0 means yscv is faster).

Threads yscv min yscv FPS ORT min ORT / yscv
1 321 ms 3.1 496 ms 1.54×
4 101 ms 9.9 161 ms 1.59×

Reference context (prior measurement, not from this round): yscv has also measured near-parity with XNNPACK on the A53 at roughly 307 ms / 1T and 98.5 ms / 4T; cited for scale only.

Apple Silicon — Apple M1 (4 P-cores + 4 E-cores)

yscv vs ONNX Runtime 1.19.2 (CPUExecutionProvider), same host, same model and inputs, fp32 zero-fill, p50 over 300 iterations. On Apple Silicon yscv is well ahead of ORT-CPU at every thread count.

Ratio here is ORT / yscv (>1.0 means yscv is faster).

Threads yscv p50 yscv FPS ORT p50 ORT / yscv
1 15.37 ms 65 29.80 ms 1.94×
4 5.46 ms 183 23.64 ms 4.33×
8 6.56 ms 153 37.52 ms 5.72×

Four threads is the sweet spot — the M1's four performance cores saturate there, and pushing to eight pulls in the efficiency cores, which regresses both runtimes.


Single-operation compute (1 thread)

x86 — AMD Ryzen 5 7500F (Zen 4)

Per-operation isolated microbenchmarks at 1 thread, 1000 iterations after 200 warmup, each op in a fresh process. yscv / PyTorch / ONNX Runtime figures are generated by run-single-compute.sh and recorded in single-compute-zen4-2026-06-09.md (commit 241f36c, PyTorch 2.12.0, ONNX Runtime 1.26.0). The NumPy column is produced separately by bench_numpy_single_ops.py (NumPy 2.4.6). All times are p50 microseconds; lower is faster.

Operation Shape yscv NumPy PyTorch ORT
add 1024×1024 100 99 98 111
mul 1024×1024 97 96 97 116
sum 1024×1024 29 114 32 83
max 1024×1024 30 36 79 56
add (broadcast last dim) 1024×1024 + 1024 72 148 67 91
sub (broadcast row − matrix) 1024 − 1024×1024 69 144 68 93
exp 1024×1024 110 364 634 138
relu 921600 55 248 57 64
sigmoid 921600 74 521 364 202
tanh 1024×1024 188 288 2004 208
gelu (sigmoid approx) 1024×1024 176 4325 3384 419
silu 1024×1024 162 3026 350 227
softmax 32×1000 7 29 15 9
log_softmax 32×1000 7 38 14 9
softmax 512×256 26 143 63 36
layer_norm 512×256 11 187 39 111
batch_norm 1×64×64×3 ↔ 1×3×64×64 2 10 8 3

Honest reading.

  • Memory-bound elementwise (add, mul, broadcast): yscv is at parity with NumPy and PyTorch — these ops are limited by memory bandwidth, not arithmetic, so all three single-thread implementations converge. yscv is marginally slower than PyTorch on add (100 vs 98 µs) and the broadcast variants (72 vs 67, 69 vs 68 µs). It is faster than ORT on all of them (1.1–1.4×).
  • Transcendentals / activations: yscv's polynomial-approximation kernels win substantially over PyTorch — exp ~5.8×, tanh ~10.7×, gelu ~19×, sigmoid ~4.9×, silu ~2.2× — and beat NumPy by similar or larger margins (gelu ~25×, silu ~19×). These are approximations to within a float tolerance, not bit-exact transcendental functions; that is the tradeoff that buys the speed.
  • Reductions / normalization: yscv wins on sum, max, softmax, layer_norm, and batch_norm against all three.
  • vs ORT-CPU overall: yscv is ahead on every op in the table, by roughly 1.1–2.9× (and 10× on layer_norm, where ORT is unexpectedly slow at this shape).

gelu uses the sigmoid approximation x · sigmoid(1.702·x); yscv uses NHWC for batch_norm while PyTorch and ORT use NCHW over the same data volume.

Reproduction

# yscv / PyTorch / ONNX Runtime (writes the dated markdown snapshot)
RAYON_NUM_THREADS=1 YSCV_POOL_SPIN_US=200 ITERS=1000 WARMUP=200 \
  OUT=benchmarks/single-compute-zen4-$(date -u +%F).md \
  bash benchmarks/run-single-compute.sh

# NumPy column (single thread)
python3 benchmarks/python/bench_numpy_single_ops.py --iters 1000 --threads 1

The yscv/PyTorch/ORT snapshot records git commit and dirty state, toolchain and Python runtime versions, raw per-backend logs under artifacts/, and min / p50 / avg rows per backend. Each operation is measured in a fresh process: PyTorch full-suite single-process runs showed allocator/cache contamination on later memory-bound ops, so isolated per-op p50 is the source of truth. yscv uses YSCV_POOL=yscv and YSCV_POOL_SPIN_US=200 for this tight standalone loop; the normal inference default is unchanged. p50 deltas within 1 µs are reported as parity.

ARM — Orange Pi Zero 3 (Cortex-A53)

Same per-op isolated methodology on the A53 (yscv compute_gap, PyTorch 2.12.0+cpu, ONNX Runtime 1.26.0, NumPy 2.4.6; 300 iterations, p50 µs, 1 thread).

Operation Shape yscv NumPy PyTorch ORT
add 1024×1024 6645 6690 8250 6938
mul 1024×1024 6647 7317 7810 7295
exp 1024×1024 13161 41029 23915 17716
relu 921600 3406 5924 3012 3280
sigmoid 921600 4100 47008 22677 16139
tanh 1024×1024 5907 21555 44393 16507
gelu (sigmoid approx) 1024×1024 7321 79255 37741 25720
silu 1024×1024 5989 67883 27541 23827
softmax 512×256 1069 7276 3212 3413
layer_norm 512×256 508 4495 1561 2401
batch_norm 1×3×64×64 23 221 186 88

Honest reading (A53). On the weak in-order core, memory-bound elementwise (add, relu) is at parity across all four — limited by the Pi's DRAM bandwidth, not arithmetic (PyTorch is the only one to edge ahead on relu, 3012 vs yscv 3406 µs). yscv's NEON polynomial-approximation kernels then pull far ahead on transcendentals/activations: sigmoid ~3.9× vs ORT, ~5.5× vs PyTorch, ~11× vs NumPy; tanh ~2.8× / ~7.5× / ~3.6×; gelu ~3.5× / ~5.2× / ~10.8×; silu ~4.0× / ~4.6× / ~11.3×; plus softmax / layer_norm / batch_norm 3–9× over every backend. On ARM yscv beats NumPy, PyTorch, and ORT-CPU on every op outside memory-bound parity — consistent with the tracker result, where yscv is 1.5–1.6× faster than ORT on the A53.

Reproduce on the device:

cargo build --release -p yscv-llm-bench --bin compute_gap
RAYON_NUM_THREADS=1 ./target/release/compute_gap --iters 300
python3 benchmarks/python/bench_ort_single_ops.py   --iters 300 --threads 1
python3 benchmarks/python/bench_numpy_single_ops.py --iters 300 --threads 1
python3 benchmarks/python/bench_torch_single_ops.py --iters 300 --threads 1

Apple Silicon — Apple M1

Same per-op isolated methodology on the M1 (yscv compute_gap, PyTorch 2.8.0, ONNX Runtime 1.19.2, NumPy 2.0.2; 1000 iterations after 200 warmup, p50 µs, 1 thread), recorded in single-compute-m1-2026-06-16.md.

Operation Shape yscv NumPy PyTorch ORT
add 1024×1024 95 141 143 141
mul 1024×1024 92 140 144 146
sum 1024×1024 67 176 51 127
max 1024×1024 57 54 167 87
add (broadcast last dim) 1024×1024 + 1024 128 154 101 143
sub (broadcast row − matrix) 1024 − 1024×1024 131 155 101 145
exp 1024×1024 446 1763 979 739
relu 921600 81 326 79 81
sigmoid 921600 184 1800 981 520
tanh 1024×1024 436 1054 3727 573
gelu (sigmoid approx) 1024×1024 442 2280 1366 756
silu 1024×1024 436 2193 1149 752
softmax 32×1000 23 77 47 28
log_softmax 32×1000 23 87 40 28
softmax 512×256 76 303 193 110
layer_norm 512×256 55 213 95 263
batch_norm 1×64×64×3 ↔ 1×3×64×64 2 13 10 4

Honest reading (M1). Against ORT-CPU yscv is ahead on every op except relu (81 vs 81 µs, parity), by ~1.1–2.8× on the activations and 4.8× on layer_norm. Against PyTorch the picture is mixed: yscv wins decisively on the polynomial-approximation activations (tanh ~8.5×, sigmoid ~5.3×, gelu ~3.1×, silu ~2.6×) but is slower on sum (67 vs 51 µs) and the broadcast add/sub (128 vs 101, 131 vs 101 µs), where PyTorch's reduction/broadcast kernels are better tuned for this core; relu is parity. Against NumPy yscv wins across the board. As on the other hosts, the activation wins are the NEON polynomial approximations trading a float-tolerance error for speed.


Methodology summary

  • Hardware: AMD Ryzen 5 7500F (Zen 4, 6C/12T) for x86; Orange Pi Zero 3 (Cortex-A53, 4C) for ARM; Apple M1 (4 P + 4 E cores) for Apple Silicon. Rust 1.95 stable, --release with codegen-units = 1, mimalloc global allocator.
  • Tracker: wall-clock latency over 300 iterations after warmup, fp32, dual input as described above. Competitor is ONNX Runtime CPUExecutionProvider (plus CoreMLExecutionProvider for the M1 GPU comparison). The Zen 4 and A53 tables report min; the M1 tables report p50.
  • Single-op: p50 of 1000 iterations after 200 warmup (300 on the A53), each op isolated in its own process to avoid cross-op cache/allocator contamination.
  • Ratios are competitor / yscv; >1.0 means yscv is faster.
  • All landings in the kernel path are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.

Reference benchmarks (macOS / Metal / video)

The sections below were measured on Apple-Silicon hardware; the Metal tracker tables are current, while the YOLO-inference, elementwise, and video sections were measured on earlier dates with older tooling and are individually marked pending re-measurement. Treat those as provisional.

Tensor Elementwise Ops (1M f32, vs NumPy) — macOS, measured Apr 2026, pending re-measurement

Operation yscv NumPy Ratio Status
add 0.128ms 0.142ms 1.11× WIN
sub 0.154ms 0.142ms 0.92× PARITY
mul 0.134ms 0.142ms 1.06× PARITY
sum 0.020ms 0.172ms 8.6× WIN
max 0.020ms 0.053ms 2.7× WIN
min 0.020ms 0.053ms 2.7× WIN
exp 0.389ms 1.704ms 4.4× WIN
relu 0.082ms 0.402ms 4.9× WIN
argmax <0.001ms 0.429ms >400× WIN
gt/eq/lt _into 0.116-0.130ms 0.314ms 2.5× WIN
transpose 512² 0.112ms 0.184ms 1.6× WIN

Tensor Unary Ops (1M f32, vs NumPy) — macOS, measured Apr 2026, pending re-measurement

Operation yscv NumPy Ratio Status
abs 0.080ms 0.088ms 1.1× WIN
neg 0.080ms ~0.126ms 1.6× WIN
floor 0.077ms 0.088ms 1.1× WIN
ceil 0.077ms ~0.350ms 4.5× WIN
round 0.077ms ~0.350ms 4.5× WIN
sign 0.099ms ~0.350ms 3.5× WIN
reciprocal 0.083ms ~0.200ms 2.4× WIN
clamp 0.090ms ~0.350ms 3.9× WIN
sqrt 0.156ms 0.163ms 1.04× PARITY
ln 0.370ms ~1.200ms 3.2× WIN

Activations (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Current Zen 4 single-thread activation numbers are in the single-operation compute section above; this macOS table is retained for reference only.

Operation yscv PyTorch Ratio Status
sigmoid 921K f32 0.217ms 1.296ms 6.0× WIN
softmax 512×256 0.098ms 0.216ms 2.2× WIN
relu 921K f32 0.069ms 0.105ms 1.5× WIN
layer_norm 512×256 0.065ms 0.117ms 1.8× WIN
gelu 2.522ms WIN (old: 0.333ms vs ~0.400ms)

MatMul & Conv2d (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Operation yscv PyTorch Ratio Status
matmul 128² 0.0055ms 0.0062ms 1.13× WIN
conv2d 32² 3×3 0.074ms 0.080ms 1.08× WIN

Normalization (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Operation yscv PyTorch Ratio Status
layer_norm 512×256 0.065ms 0.117ms 1.80× WIN
batch_norm 64²×16 0.028ms 0.045ms 1.61× WIN

u8 Image Processing (640×480, vs OpenCV 4.13) — macOS, measured Apr 2026, pending re-measurement

Operation yscv OpenCV Ratio Status
resize nearest 320→640 0.048ms 0.157ms 3.27× WIN
resize bilinear 320→640 0.068ms 0.201ms 2.96× WIN
sobel 3×3 0.074ms 0.169ms 2.28× WIN
dilate 3×3 0.031ms 0.047ms 1.52× WIN
erode 3×3 0.030ms 0.051ms 1.70× WIN
box blur 3×3 0.049ms 0.071ms 1.45× WIN
grayscale 0.025ms 0.030ms 1.20× WIN
gaussian 3×3 0.049ms 0.063ms 1.29× WIN
median 3×3 0.029ms 0.072ms 2.48× WIN

f32 Image Processing (ImageF32, 480×640, vs OpenCV) — macOS, measured Apr 2026, pending re-measurement

Operation yscv OpenCV Ratio Status
grayscale 0.022ms 0.027ms 1.23× WIN
gaussian 3×3 0.051ms 0.113ms 2.22× WIN
box blur 3×3 0.049ms 0.131ms 2.67× WIN
dilate 3×3 0.047ms 0.104ms 2.21× WIN
sobel 3×3 0.055ms 0.297ms 5.40× WIN
threshold 0.015ms 0.017ms 1.13× WIN

Video Decode (vs ffmpeg, single-threaded) — macOS, measured Apr 2026, pending re-measurement

H.264 and HEVC MP4 decode. Pure Rust decoder vs ffmpeg libavcodec (C, ffmpeg -threads 1).

Test methodology:

  • Hardware: Apple M-series (unified memory, NEON SIMD)
  • Build: --release, LTO=thin, codegen-units=1
  • Both decoders single-threaded for fair comparison
  • Best of 5 runs, cold CPU between runs
  • ffmpeg command: ffmpeg -threads 1 -benchmark -i <file> -f null -
  • yscv command: cargo run --release --example bench_video_decode -- <file>
  • Correctness: all frames decoded, pixel_range [0,255], frame count matches ffprobe
  • Memory: streaming reader (27MB RSS for 41MB file, O(1) relative to file size)
  • Date: April 2026

H.264

Video Frames yscv ffmpeg Ratio Pixels
H.264 Baseline 1080p 300 324ms 519ms 1.60× [0, 255] ✓
H.264 High 1080p 300 332ms 760ms 2.28× [0, 255] ✓
Real Camera H.264 1080p60 1100 1187ms 5372ms 4.52× [0, 255] ✓

HEVC (full color — chroma MC enabled)

Video Frames yscv ffmpeg Ratio Pixels
HEVC Main 1080p P/B 5s 300 575ms 806ms 1.40× [0, 255] ✓
HEVC Main 1080p P/B 10s 600 1288ms 1808ms 1.40× [0, 255] ✓
HEVC Main 1080p I-only 180 1538ms 1483ms 0.97× [0, 255] ✓

Key observations

H.264:

  • 1.6–4.5× faster than ffmpeg across all profiles — pure Rust with SIMD IDCT/dequant (NEON + SSE2), rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames
  • Real camera 1080p60: 4.5× faster — 1100 frames decoded in 1.2 seconds
  • Full pixel range [0, 255] on all supported profiles
  • Weighted prediction, 8x8 DCT (High profile), sub-MB partitions (16x8, 8x16, 8x8)
  • Streaming reader: O(1) memory — 27MB RSS for 41MB file (no full-file loading)

HEVC:

  • 1.4× faster than ffmpeg on P/B frames (full color decode with chroma MC + deblock + SAO + YUV→RGB)
  • I-frame near-parity (0.97×) — intra-only content is CABAC-bound
  • Full color output: chroma motion compensation with 4-tap filter, real YUV420→RGB
  • All profiles decode correctly ([0, 255]) including 10-bit Main10 (u16 DPB)
  • BS=0 edge skip eliminates ~85% of deblock work on inter-coded frames

Memory:

  • Streaming MP4 reader: reads only moov box at open (1-5MB), samples lazily via seek
  • 41MB H.264 file: 27MB RSS (< file size)
  • 3.2MB HEVC file: 129MB RSS (DPB + recon buffers for 1080p)
  • No unbounded growth — DPB bounded by SPS, all buffers reused across frames

Optimizations applied:

  • Branchless CABAC: mask-based MPS/LPS selection, packed transition tables (128-entry lookup), CLZ batch renormalize, 32-bit buffered bit reader
  • Unsafe hot paths: get_unchecked for all CABAC table lookups, ptr::add for deblock filter, pre-computed scan/context tables, branchless sign (val ^ -sign) + sign
  • Zero-copy frame management: reusable mv_field, CU list, Y-plane, recon buffers across frames
  • NEON (29 blocks) + SSE2 (31 blocks): MC 8-tap horizontal/vertical filter, bipred/unipred clip, DC intra prediction, dequant, DCT 16x16/32x32, i16→u8 saturation, Y→grayscale RGB interleave
  • Deblock: BS=0 skip (pred_mode grid), pre-computed tc/beta thresholds, early whole-edge skip, luma-only mode (skip chroma deblock)
  • SAO: CTU-only 4KB stack buffer (not full-frame copy)

Supported formats:

  • H.264: Baseline (CAVLC), Main (CABAC), High (CABAC + 8x8 transform), I/P/B slices, weighted prediction, sub-MB partitions, scaling lists, parallel deblocking
  • HEVC: Main, Main10 (10-bit u16 DPB), I/P/B slices, CABAC, deblocking + SAO, CTU quad-tree, tiles (parsed), chroma residual parsing
  • MP4 container: avcC/hvcC parameter extraction, stbl/stco/stsz sample table navigation
  • MKV/WebM container: EBML demuxer with track/cluster parsing
  • Annex B raw stream parser (H.264 + HEVC)
  • SIMD: NEON (aarch64) 29 blocks, SSE2 (x86_64) 31 blocks — full cross-architecture coverage
  • Parallelism: rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames

Video (vs OpenCV) — macOS, measured Apr 2026, pending re-measurement

Operation yscv OpenCV Ratio
YUV420→RGB 1080p 0.166ms 0.178ms 1.07×

Additional Operations (Apple Silicon, March 2026) — pending re-measurement

Operation Time
Tensor add 100K 0.0143ms
Tensor mul 100K 0.0118ms
Broadcast add 0.226ms
Broadcast mul 0.211ms
matmul 128² 0.0055ms
matmul rect 96×192×64 0.0036ms
ReLU 921K f32 0.069ms (threaded: 0.062ms)
Sigmoid 921K f32 0.217ms
Add 921K same-shape 0.126ms
BatchNorm 64²×16 0.028ms (threaded: 0.023ms)
Softmax 512×256 0.098ms (threaded: 0.063ms)
LayerNorm 512×256 0.065ms (threaded: 0.044ms)
Conv2d 32² 3×3 0.074ms
MaxPool 120×160 0.159ms (threaded: 0.096ms)
Grayscale u8 0.025ms
Resize nearest u8 0.048ms
Resize bilinear u8 0.068ms
Dilate u8 0.031ms
Erode u8 0.030ms
Box blur u8 0.049ms
Sobel u8 0.074ms
Autograd backward 32² 0.0041ms
Autograd broadcast 0.0067ms
Model linear batch32 0.000905ms
Model linear+relu+linear 0.0024ms
SGD step batch16 0.0096ms
SGD step batch64 0.0147ms
Detect people 0.060ms
Detect faces 0.165ms
Detect heatmap 0.046ms
Track 0.487ms
Recognize query 0.000448ms
CLI people pipeline 0.075ms
CLI face pipeline 0.162ms

ONNX Inference (YOLOv8n / YOLO11n, 640×640 input) — Apple M1, measured Mar–Apr 2026, pending re-measurement

End-to-end model inference benchmarks against onnxruntime, Apple CoreML, and tract. Methodology: 50 timed runs after warmup, min reported. Apple M1 MacBook Air.

CPU Inference

Runtime YOLOv8n YOLO11n Notes
yscv 30.4ms 33.7ms Pure Rust, NHWC layout, BLAS matmul
onnxruntime 1.19 CPU 37.4ms 35.2ms* *Requires opset 21 conversion; native opset 22 fails
onnxruntime 1.19 CoreML 15.5ms 47.6ms* CoreML accelerator; YOLO11n perf degrades with partial coverage
tract 0.21 217.2ms FAILED TDim parse error

yscv CPU is 1.2× faster than onnxruntime on YOLOv8n (30.4ms vs 37.4ms) and comparable on YOLO11n (33.7ms vs 35.2ms). onnxruntime requires manual opset downgrade (22→21) for YOLO11n; yscv handles opset 22 natively.

yscv MPSGraph is 4× faster than ORT CoreML on YOLOv8n (3.5ms vs 15.5ms). ORT CoreML on YOLO11n degrades to 47.6ms due to partial operator coverage.

GPU Inference (Metal, Apple M1)

Runtime YOLOv8n YOLO11n Notes
yscv MPSGraph 3.5ms 5.0ms Whole-model graph compilation, single GPU dispatch
yscv Metal per-op 12.1ms 12.6ms Per-op command buffer, Winograd + MPS GEMM
onnxruntime CoreML 14.2ms FAILED Apple Neural Engine delegation

MPSGraph compiles the entire ONNX model into an MPSGraphExecutable and runs it as a single GPU dispatch — eliminating per-op encoder transitions. 4× faster than CoreML on YOLOv8n.

yscv is the only runtime that runs both YOLOv8n and YOLO11n on GPU. CoreML fails on YOLO11n (opset 22).

Pipelined Throughput (MPSGraph submit/wait, Apple M1)

The pipelined API (submit_mpsgraph_plan + wait_mpsgraph_plan) triple-buffers input/output buffers and overlaps CPU marshaling with GPU compute. Sustained per-frame wall-time (300 iter, Siamese tracker, 2 inputs @ 1×3×128×128 + 1×3×256×256, fp16):

Mode p50 p99 Sustained FPS
yscv sync (--pipeline 1) 1.26 ms 1.80 ms 792
yscv --pipeline 2 0.37 ms 0.48 ms 2688
yscv --pipeline 3 0.47 ms 0.65 ms 2151
yscv --pipeline 4 0.56 ms 0.64 ms 1776
onnxruntime CoreML MLProgram 1.62 ms 1.98 ms 617

Depth 2 is the throughput sweet spot (4.4× vs ORT CoreML); depth 4 trades raw p50 for the tightest tail (p99 = 0.64 ms). Pipeline depth is chosen via YSCV_MPS_PIPELINE env var (default 3, clamped 1..=8). The API itself is safe regardless: submit_mpsgraph_plan back-pressures if the caller has more outstanding handles than the pipeline depth.

Siamese Tracker — Full Backend Comparison (Apple M1)

Same two-tower Siamese tracker (inputs 1×3×128×128 + 1×3×256×256, fp32 zero-fill on CPU, fp16 on GPU). Compares every backend yscv ships against the corresponding ORT provider on the identical host. p50 over 300 iterations after warmup; yscv vs ORT 1.19.2.

Backend yscv ORT 1.19.2 yscv vs ORT
CPU 1T 65 FPS (15.37 ms) 34 FPS (29.80 ms) 1.94×
CPU 4T 183 FPS (5.46 ms) 42 FPS (23.64 ms) 4.33×
GPU sync 792 FPS (1.26 ms) 617 FPS (1.62 ms, CoreML) 1.28×
GPU pipelined×2 2688 FPS (0.37 ms) 4.4× over ORT CoreML
GPU peak burst 3623 FPS (0.28 ms min)

Why yscv CPU beats ORT CPU on Apple Silicon: Accelerate.framework dispatches BLAS through Apple's AMX (Advanced Matrix eXtensions) block — a dedicated matrix accelerator inside the CPU complex, separate from the Neural Engine. ORT's CPUExecutionProvider uses its own general-purpose SIMD kernels that don't hit AMX, so it leaves ~1.9× throughput on the table single-thread, widening to 4.3× at four threads as ORT scales poorly across the M1's P-cores. On Intel the opposite holds: ORT's oneDNN is highly-tuned for AVX-512 where yscv (which dispatches through OpenBLAS) is typically 2-3× slower.

Why yscv Metal beats ORT CoreML on Apple Silicon: ORT's CoreMLExecutionProvider compiles the graph to CoreML and routes compatible ops to the Apple Neural Engine (ANE) + Metal hybrid. On the Siamese tracker 216/219 ops run on CoreML; the remaining 3 fall back to CPU. Every fallback crosses a CPU↔accelerator boundary, costing synchronization + marshalling. yscv's MPSGraph path compiles 100% of the graph to pure Metal and avoids the hybrid overhead entirely. Pipelining (3-buffered submit/wait) then overlaps CPU marshal with GPU compute for another 2× on sustained throughput.

Pipelined Throughput (RKNN submit/wait, RK3588)

RknnPipelinedPool applies the same pattern to Rockchip NPU cores: one slot per NpuCoreMask, pre-allocated + pre-bound RknnMem per input and output, back-pressured submit/wait. On RK3588 the pool can drive all 3 NPU cores concurrently; on RV1106 pass &[Core0] for a cleanly-typed single-slot async path.

On-device numbers (YOLO / Siamese tracker, int8-quantized .rknn) will be added once captured against a physical Rock 4D. Relative gains are expected to mirror the MPSGraph path — pipeline depth equal to NPU-core count ≈ 3× sync throughput, with tail latency tightening as CPU and NPU stop serialising their handshake.

Metal Pipeline Architecture

The Metal backend compiles an ONNX graph into a sequence of MetalOps executed in a single fused command buffer. Key optimizations (in order of impact):

Optimization Impact Description
Winograd F(4×4, 3×3) ~40% of GPU time 2.25× FLOP reduction for stride-1 3×3 convs; SIMD group matrix multiply with f32 accumulation
F16 inter-op pipeline Halves bandwidth All intermediate buffers use f16; weights pre-packed as f16 at compile time
NEON input upload Eliminates GPU cast CPU-side fcvtn+st3 converts f32 NCHW → f16 NHWC faster than GPU kernel
Conv+SiLU+Add fusion Fewer ops Residual addition and activation fused into conv write-back epilogue
Vectorized f16 kernels Better throughput All utility ops (concat, split, permute, resize) use half4 vectorized I/O
Concat fusion Eliminates copies Conv outputs write directly into concat buffer via out_stride/out_offset
Detection head fusion Fused permute+concat NHWC→NCHW permutations + spatial concat fused into single NhwcToFlatConcat kernel
Zero-cost buffer aliasing No-op reshapes Reshape/Flatten/Squeeze/Unsqueeze alias existing buffers
Parallel softmax Threadgroup reduction Adaptive threadgroup size (32/128/256) with shared-memory reduction
Widened SiLU look-ahead Fewer Metal ops Detects SiLU patterns up to 5 nodes ahead (detection head interleaving)
In-place SiLU/Binary Fewer buffers Dead input buffers reused as output for elementwise ops

Metal Per-Op Distribution (YOLOv8n: 110 ops, YOLO11n: 204 ops)

Op Type YOLOv8n YOLO11n
ConvWinograd (3×3 stride=1) 32 28
MpsConv (MPS GEMM for 1×1+) 30 51
Concat 13 34
SplitFused 8 9
CpuReshape (GPU permute) 4 13
Binary/BroadcastBinary 6 28
DepthwiseConv 7
Other (MaxPool, Resize, etc.) 17 34

VballNetGrid Inference (DSConv model, 16.3 GFLOP) — Apple Silicon, measured Mar–Apr 2026, pending re-measurement

Model: VballNetGridV1b — 13 DSConvBlocks (depthwise 3×3 + pointwise 1×1), 4 MaxPool, head Conv+Sigmoid. Input [1, 9, 432, 768], output [1, 27, 27, 48], 42 ONNX nodes.

Optimization Progression (Apple Silicon)

Stage Time FPS Speedup What changed
yscv BEFORE 558 ms 1.7 Single-threaded, scalar depthwise
+ Multi-threading 257 ms 3.9 2.1× ParallelElementwiseConfig::default() in public API
+ SIMD depthwise 124.1 ms 8.1 4.5× NEON/AVX/SSE vectorized depthwise conv
onnxruntime CPU 196.7 ms 5.1 CPUExecutionProvider baseline
onnxruntime CoreML CPU_ONLY 8.6 ms 116 BNNS/AMX via CoreML delegate
yscv Metal per-op 47.3 ms 21.1 11.8× Metal-native fused pipeline, MPS GEMM
yscv MPSGraph 7.8 ms 128 71.5× Whole-model GPU graph compilation

Key Takeaway

yscv CPU (124.1ms) is 1.6× faster than onnxruntime CPU (196.7ms) on depthwise-separable models — no special flags needed. MPSGraph (7.8ms) beats CoreML CPU_ONLY (8.6ms) which uses Apple's dedicated AMX coprocessor via BNNS — a 1.1× speedup. MPSGraph provides 16× over CPU, reaching 128 FPS on Apple Silicon.

ONNX Siamese Tracker (Zen 4 historical arc, AMD Ryzen 5 7500F, 6C/12T, fp32 CPU)

Historical optimization arc, retained for context. The current Zen 4 tracker figures are in the Siamese tracker — CPU inference section at the top of this document (min-of-300, commit 241f36c). The p50 numbers below predate that re-measurement and use a different run protocol.

Model: Siamese tracker, 156 ops after graph optimization, two input branches (input.1 1×3×128×128 template, input.249 1×3×256×256 search) joined in connect_model. Primary fp32 CPU benchmark target of the S./A./R.* perf arc (Apr 2026, 19 sessions).

Methodology: RAYON_NUM_THREADS=N ./onnx-fps --iters 500, median of 3 runs per thread-count, bitwise-identical outputs across all Ns. ORT 1.24.4 CPUExecutionProvider as the reference.

Threads yscv p50 ORT p50 gap yscv scaling ORT scaling
1 11.43 ms 8.05 ms 1.42× 1.00× 1.00×
2 6.55 ms 4.42 ms 1.48× 1.74× 1.82×
4 4.15 ms 2.36 ms 1.76× 2.75× 3.41×
6 3.66 ms 1.74 ms 2.10× 3.12× 4.62×
8 3.87 ms 2.28 ms 1.70× 2.95× 3.53×
12 4.02 ms 1.93 ms 2.08× 2.84× 4.16×

6T is the sweet spot (physical-core count). Beyond 6T, SMT contention hurts both engines; 12T is strictly worse than 6T.

Where the remaining gap lives (6T profile, sequential sums)

op yscv ORT gap ratio
Conv 6.31 ms (78 ops) 2.22 ms (114 ops) +4.09 ms 2.84×
MatMul 0.19 ms (2) 0.04 ms (2) +0.14 ms 4.17×
Reshape 0.11 ms (5) 0.01 ms (5) +0.10 ms 8.01×
Reorder (NCHWc) 0.05 ms (7) ORT-only

Conv dominates 94% of the gap — the bulk live in mid-sized pointwise and inverted-bottleneck layers; ORT uses NCHWc layout throughout while yscv runs NHWC.

Key kernel optimizations

The custom CPU path layers several fusions and SIMD kernels on top of the blocked GEMM:

  • AVX 8×8 / NEON 4×4 NCHW↔NHWC block transposes for layout conversion.
  • AVX-512 / AVX2 / NEON depthwise row kernels with fused activation.
  • Row-level parallelism for the first 3×3 stride-2 layer.
  • Streaming FusedPwDw (PW-expand → DW 3×3) with register-blocked accumulators, so the expanded intermediate never hits DRAM.
  • FusedTransposeMatMul, mirroring ORT's MatmulTransposeFusion.

All landings are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.

GPU rerun on the same host (2026-04-25)

Same model and inputs, RTX 4060 added through --features gpu (wgpu backend over Vulkan) and onnxruntime-gpu 1.25 with CUDAExecutionProvider:

backend p50 min output 882 max drift vs CPU ref
ORT CUDA EP fp32 (cuDNN, Tensor Cores) 1.42 ms 1.40 ms 48.9193 −0.17 (TF32)
yscv gpu fp16 (wgpu Vulkan) 5.25 ms 5.18 ms 48.1491 −0.94
yscv gpu fp32 (YSCV_GPU_FP32=1) 5.82 ms 5.72 ms 49.0915 −1 ULP
yscv CPU 6T no-BLAS 3.05 ms 2.84 ms 49.0920 +1 ULP
ORT CPU 1T 8.07 ms 8.03 ms 49.0916 reference

Two takeaways:

  • yscv fp32 GPU output is 1-ULP from the CPU reference, while ORT CUDA EP drifts −0.17 because cuDNN auto-uses Tensor Cores in TF32 / mixed precision on Ampere+ and downcasts implicitly. That is, our fp32 GPU path is the more numerically faithful one against the CPU reference.
  • ORT CUDA EP is 4× faster than yscv wgpu — structural (cuDNN ships shape-specific kernels and uses Tensor Cores via cooperative_matrix, wgpu compute shaders are vendor-portable WGSL with no MMA path). Closing this requires either a vendor Vulkan extension wgpu doesn't expose or a separate cuda-backend.

For Conv-heavy small-batch graphs like this tracker, the CPU runner (3.05 ms) beats wgpu (5.25 ms) on the same host — GPU launch latency dominates the actual compute. wgpu starts to win above batch ≥ 4 or inference ≥ 30 ms on CPU. See gpu-backend-guide.md for the full positioning matrix and YSCV_GPU_FP32 env-flag docs.

Latest rerun (2026-04-25, post-R10 correctness fix)

R10 fixed a silent-drop residual_tile in microkernel_4x8_dispatch on the x86 1×NR-tile path — SIMD/ASM 4×8 variants never received the residual pointer and dropped the add for Conv+Add shapes whose n left an 8-wide scalar tail. Output 882 max went 84.78 → 49.0920 (ORT: 49.0916, 1-ULP FP-ordering drift). Perf also improved because non-BLAS path is now fully correct on tracker shapes.

build 1T p50 6T p50 12T p50 output 882 max
yscv no-BLAS (default for onnx-fps) 11.22 ms 3.17 ms 3.34 ms 49.0920 ✓
yscv with BLAS (OpenBLAS 0.3.31) 13.00 ms 8.75 ms 9.01 ms 49.0922 ✓
ORT 1.24.4 CPU 8.07 ms 1.74 ms 1.91 ms 49.0916 ✓

yscv no-BLAS vs ORT: 1T 1.39×, 6T 1.82×, 12T 1.75× behind.

On this graph BLAS is a net regression — 2.76× slower at 6T than the non-BLAS path. Root causes: matmul_2d_slices_fused with BLAS splits into blas_sgemm + apply_epilogue_fallback (two passes over out, ~5.5 ms of extra L2/L3 traffic at 6T), and the whole arc (R4/R7/R9/A2) only fires on the non-BLAS branch. OPENBLAS_NUM_THREADS=1 does not close the gap, so it's not pure thread oversubscription — rayon workers block serially on sgemm instead of running their own A/B tiles in parallel.

See feature-flags.md for the full when-to-enable / when-to-disable BLAS checklist.

Cross-Platform SIMD Coverage

Operation NEON SSE AVX
Tensor binary/unary (1M f32) ✅ 4× unroll ✅ 4-wide ✅ 4× unroll (32 elem)
Activations (sigmoid/tanh/silu) ✅ 3-term poly ✅ poly ✅ poly
Softmax/LogSoftmax ✅ fused ✅ fused ✅ fused
MatMul ✅ BLAS ✅ BLAS ✅ BLAS + FMA
Conv2d 3×3 ✅ direct NEON ✅ direct SSE ✅ im2col + BLAS
Depthwise Conv2d ✅ 4-wide FMA ✅ 4-wide ✅ 8-wide
u8 morphology/filter/sobel ✅ 16B/iter ✅ 16B/iter ✅ 32B/iter (AVX2)
f32 filter/morphology/geometry ✅ 4-wide ✅ 4-wide ✅ 8-wide
Median u8 ✅ sort network ✅ sort network
YUV→RGB ✅ NEON + GCD ✅ SSE + threads ✅ AVX2 + threads

Optimization Techniques

  • 315 #[target_feature]-gated SIMD functions with runtime CPU detection
  • All dispatch functions #[inline] for cross-crate inlining
  • AlignedVec::uninitialized — skip output zeroing in hot paths
  • ImageU8/ImageF32 — zero-overhead wrappers bypass Tensor allocation
  • GCD dispatch_apply — macOS near-zero threading (~0.3µs)
  • mimalloc — thread-local arena pools
  • Fused kernels — single-pass softmax, sigmoid, attention
  • im2col + BLAS — Accelerate/OpenBLAS for matmul/conv2d/conv3d
  • Flash Attention — tiled O(Br×Bc) memory, online softmax
  • Integer GEMM — quantized matmul with i32 accumulation (no dequant overhead)

Test Infrastructure

Test Suite Summary (April 2026)

Crate Tests Coverage
yscv-model 365 Serialization, training loops, data loading, distributed
yscv-imgproc 225 All u8/f32 ops, SIMD paths, color conversion
yscv-video 230 H.264/HEVC decode (Main + Main10 + Rext + weighted prediction + tiles + WPP + chroma deblock/SAO), MP4/MKV parsing, HW detect
yscv-tensor 207 Elementwise, matmul, broadcast, BLAS dispatch
yscv-kernels 120 CPU ops, GPU backend, SIMD activation, GEMM
yscv-autograd 106 Forward/backward graph, all op gradients
yscv-eval 95 COCO/YOLO/VOC/KITTI/WiderFace/MOT metrics
yscv-onnx 166 Per-operator coverage for all 122 CPU dispatch arms, fusion regressions, quantization, vision ops
yscv-optim 76 SGD, Adam, LR schedulers, weight decay
yscv-detect 60 YOLOv8/v11 decode, NMS, letterbox
yscv-track 57 Hungarian, Kalman, IoU matching
yscv-cli 42 Config parsing, diagnostics
yscv-recognize 16 Embedding extraction, cosine similarity
Total 1808

CI Matrix

Platform Runner Features What's Tested
macOS (ARM) macos-latest default + videotoolbox Full workspace + HW decode
Linux (x86) ubuntu-latest default + gpu Full workspace + WGPU
Linux (ARM) ubuntu-24.04-arm default Full workspace + NEON
Windows windows-latest default Full workspace

CI Jobs

  • workspace-compat: Multi-platform build + test
  • quality: fmt + clippy + CLI integration + benchmark gates + eval format verification
  • miri: Unsafe code soundness (yscv-tensor, yscv-kernels)
  • hw-decode: VideoToolbox (macOS), SW fallback (Linux/Windows)
  • benchmark gates: Criterion microbenchmarks for 12 crates with trend tracking

Test Data

  • Video: examples/src/CENSUSWITHOUTLOGO.mp4 (41MB, H.264 1080p60, 1100 frames)
  • Images: examples/src/testtraffic.png, testtraffic2.png (3.4-3.5MB)
  • Models: examples/src/slowwork/yolo{v8n,11n}.onnx (10-12MB, gitignored)
  • Eval samples: benchmarks/eval-* (all formats, <10KB each)
  • Fuzz corpus: fuzz/corpus/ (H.264, HEVC, MKV seed files)
  • Baselines: benchmarks/ci-baseline-*.txt, trend-baseline-*.tsv

How to Run

# Linux-local CI subset before pushing
bash scripts/check-ci-local.sh

# Broader local pass: release tests, extended proptests, UX, benchmark gates
bash scripts/check-ci-local.sh --full

# Add local cross-target smoke where toolchains are available
bash scripts/check-ci-local.sh --cross --all-features

# Full workspace test (1693 tests)
cargo test

# Single crate
cargo test -p yscv-video

# Video decode benchmark
cargo run --release --example bench_video_decode -- examples/src/CENSUSWITHOUTLOGO.mp4

# Compare with ffmpeg
ffmpeg -threads 1 -benchmark -i examples/src/CENSUSWITHOUTLOGO.mp4 -f null -

# Criterion microbenchmarks
cargo bench -p yscv-kernels
cargo bench -p yscv-imgproc

# Miri soundness check
cargo +nightly miri test -p yscv-tensor --lib

# Fuzz testing
cd fuzz && cargo fuzz run fuzz_h264_nal -- -max_total_time=60