yscv Performance Benchmarks

Measured CPU-inference and single-operation benchmarks for yscv, compared against ONNX Runtime, PyTorch, NumPy, OpenCV, and ffmpeg.

Last updated: 2026-06-16 · commit fa80f66

This document has two parts. The CPU sections (Siamese tracker, single-op) are the current measurement focus: fixed hardware, pinned competitor versions, isolated per-op processes, regenerable from a script in the repo. They cover three hosts — AMD Ryzen 5 7500F (Zen 4), Orange Pi Zero 3 (Cortex-A53), and Apple M1. The Metal / video sections below them are retained for reference but were measured on different hardware and dates; where a number could not be reproduced on current tooling it is marked pending re-measurement. Treat those as provisional.

Where yscv is at parity with or behind a competitor, this is stated plainly. Ratios are written as competitor / yscv (a ratio above 1.0 means yscv is faster).

Siamese tracker — CPU inference vs ONNX Runtime

The primary end-to-end target. Model is a public two-tower Siamese single-object tracker exported to ONNX (~156 ops after graph optimization), with two inputs: a 1×3×128×128 template branch and a 1×3×256×256 search branch, fp32 zero-fill. The reported figure is the minimum wall-clock latency over 300 iterations (after warmup), which isolates the steady-state compute path from scheduler and allocator jitter.

x86 — AMD Ryzen 5 7500F (Zen 4, 6C/12T)

yscv (commit 241f36c) vs ONNX Runtime 1.24.4, CPUExecutionProvider, on the same host.

Here the ratio is yscv / ORT (yscv's slowdown factor); >1.0 means yscv is slower, since on this model ORT is ahead.

Threads	yscv min	yscv FPS	ORT min	yscv / ORT
1	8.63 ms	116	8.03 ms	1.07×
4	3.06 ms	327	2.35 ms	1.30×
6	2.52 ms	396	1.74 ms	1.45×

Honest reading. yscv is roughly 7% behind ORT single-threaded (8.63 vs 8.03 ms) and the gap widens with thread count: ORT scales better across cores on this model, reaching 1.74 ms at 6 threads against yscv's 2.52 ms (ORT is ~1.45× faster at 6T). The single-thread compute path is close; the deficit is in multi-thread scaling, not per-core kernel throughput. The remaining structural difference is layout — ORT runs NCHWc throughout while yscv runs NHWC. See docs/onnx-cpu-kernels.md for the per-op hot-path map.

ARM — Orange Pi Zero 3 (Cortex-A53, 4C)

yscv (commit d8d43ea on the device) vs ONNX Runtime 1.26.0 (CPUExecutionProvider), same host, same model and inputs. On this edge-ARM target yscv is faster than ORT at both thread counts — the inverse of the x86 picture, and the project's actual deployment target.

Ratio here is ORT / yscv (>1.0 means yscv is faster).

Threads	yscv min	yscv FPS	ORT min	ORT / yscv
1	321 ms	3.1	496 ms	1.54×
4	101 ms	9.9	161 ms	1.59×

Reference context (prior measurement, not from this round): yscv has also measured near-parity with XNNPACK on the A53 at roughly 307 ms / 1T and 98.5 ms / 4T; cited for scale only.

Apple Silicon — Apple M1 (4 P-cores + 4 E-cores)

yscv vs ONNX Runtime 1.19.2 (CPUExecutionProvider), same host, same model and inputs, fp32 zero-fill, p50 over 300 iterations. On Apple Silicon yscv is well ahead of ORT-CPU at every thread count.

Ratio here is ORT / yscv (>1.0 means yscv is faster).

Threads	yscv p50	yscv FPS	ORT p50	ORT / yscv
1	15.37 ms	65	29.80 ms	1.94×
4	5.46 ms	183	23.64 ms	4.33×
8	6.56 ms	153	37.52 ms	5.72×

Four threads is the sweet spot — the M1's four performance cores saturate there, and pushing to eight pulls in the efficiency cores, which regresses both runtimes.

Single-operation compute (1 thread)

x86 — AMD Ryzen 5 7500F (Zen 4)

Per-operation isolated microbenchmarks at 1 thread, 1000 iterations after 200 warmup, each op in a fresh process. yscv / PyTorch / ONNX Runtime figures are generated by run-single-compute.sh and recorded in single-compute-zen4-2026-06-09.md (commit 241f36c, PyTorch 2.12.0, ONNX Runtime 1.26.0). The NumPy column is produced separately by bench_numpy_single_ops.py (NumPy 2.4.6). All times are p50 microseconds; lower is faster.

Operation	Shape	yscv	NumPy	PyTorch	ORT
add	1024×1024	100	99	98	111
mul	1024×1024	97	96	97	116
sum	1024×1024	29	114	32	83
max	1024×1024	30	36	79	56
add (broadcast last dim)	1024×1024 + 1024	72	148	67	91
sub (broadcast row − matrix)	1024 − 1024×1024	69	144	68	93
exp	1024×1024	110	364	634	138
relu	921600	55	248	57	64
sigmoid	921600	74	521	364	202
tanh	1024×1024	188	288	2004	208
gelu (sigmoid approx)	1024×1024	176	4325	3384	419
silu	1024×1024	162	3026	350	227
softmax	32×1000	7	29	15	9
log_softmax	32×1000	7	38	14	9
softmax	512×256	26	143	63	36
layer_norm	512×256	11	187	39	111
batch_norm	1×64×64×3 ↔ 1×3×64×64	2	10	8	3

Honest reading.

Memory-bound elementwise (add, mul, broadcast): yscv is at parity with NumPy and PyTorch — these ops are limited by memory bandwidth, not arithmetic, so all three single-thread implementations converge. yscv is marginally slower than PyTorch on add (100 vs 98 µs) and the broadcast variants (72 vs 67, 69 vs 68 µs). It is faster than ORT on all of them (1.1–1.4×).
Transcendentals / activations: yscv's polynomial-approximation kernels win substantially over PyTorch — exp ~5.8×, tanh ~10.7×, gelu ~19×, sigmoid ~4.9×, silu ~2.2× — and beat NumPy by similar or larger margins (gelu ~25×, silu ~19×). These are approximations to within a float tolerance, not bit-exact transcendental functions; that is the tradeoff that buys the speed.
Reductions / normalization: yscv wins on sum, max, softmax, layer_norm, and batch_norm against all three.
vs ORT-CPU overall: yscv is ahead on every op in the table, by roughly 1.1–2.9× (and 10× on layer_norm, where ORT is unexpectedly slow at this shape).

gelu uses the sigmoid approximation x · sigmoid(1.702·x); yscv uses NHWC for batch_norm while PyTorch and ORT use NCHW over the same data volume.

Reproduction

# yscv / PyTorch / ONNX Runtime (writes the dated markdown snapshot)
RAYON_NUM_THREADS=1 YSCV_POOL_SPIN_US=200 ITERS=1000 WARMUP=200 \
  OUT=benchmarks/single-compute-zen4-$(date -u +%F).md \
  bash benchmarks/run-single-compute.sh

# NumPy column (single thread)
python3 benchmarks/python/bench_numpy_single_ops.py --iters 1000 --threads 1

The yscv/PyTorch/ORT snapshot records git commit and dirty state, toolchain and Python runtime versions, raw per-backend logs under artifacts/, and min / p50 / avg rows per backend. Each operation is measured in a fresh process: PyTorch full-suite single-process runs showed allocator/cache contamination on later memory-bound ops, so isolated per-op p50 is the source of truth. yscv uses YSCV_POOL=yscv and YSCV_POOL_SPIN_US=200 for this tight standalone loop; the normal inference default is unchanged. p50 deltas within 1 µs are reported as parity.

ARM — Orange Pi Zero 3 (Cortex-A53)

Same per-op isolated methodology on the A53 (yscv compute_gap, PyTorch 2.12.0+cpu, ONNX Runtime 1.26.0, NumPy 2.4.6; 300 iterations, p50 µs, 1 thread).

Operation	Shape	yscv	NumPy	PyTorch	ORT
add	1024×1024	6645	6690	8250	6938
mul	1024×1024	6647	7317	7810	7295
exp	1024×1024	13161	41029	23915	17716
relu	921600	3406	5924	3012	3280
sigmoid	921600	4100	47008	22677	16139
tanh	1024×1024	5907	21555	44393	16507
gelu (sigmoid approx)	1024×1024	7321	79255	37741	25720
silu	1024×1024	5989	67883	27541	23827
softmax	512×256	1069	7276	3212	3413
layer_norm	512×256	508	4495	1561	2401
batch_norm	1×3×64×64	23	221	186	88

Honest reading (A53). On the weak in-order core, memory-bound elementwise (add, relu) is at parity across all four — limited by the Pi's DRAM bandwidth, not arithmetic (PyTorch is the only one to edge ahead on relu, 3012 vs yscv 3406 µs). yscv's NEON polynomial-approximation kernels then pull far ahead on transcendentals/activations: sigmoid ~3.9× vs ORT, ~5.5× vs PyTorch, ~11× vs NumPy; tanh ~2.8× / ~7.5× / ~3.6×; gelu ~3.5× / ~5.2× / ~10.8×; silu ~4.0× / ~4.6× / ~11.3×; plus softmax / layer_norm / batch_norm 3–9× over every backend. On ARM yscv beats NumPy, PyTorch, and ORT-CPU on every op outside memory-bound parity — consistent with the tracker result, where yscv is 1.5–1.6× faster than ORT on the A53.

Reproduce on the device:

cargo build --release -p yscv-llm-bench --bin compute_gap
RAYON_NUM_THREADS=1 ./target/release/compute_gap --iters 300
python3 benchmarks/python/bench_ort_single_ops.py   --iters 300 --threads 1
python3 benchmarks/python/bench_numpy_single_ops.py --iters 300 --threads 1
python3 benchmarks/python/bench_torch_single_ops.py --iters 300 --threads 1

Apple Silicon — Apple M1

Same per-op isolated methodology on the M1 (yscv compute_gap, PyTorch 2.8.0, ONNX Runtime 1.19.2, NumPy 2.0.2; 1000 iterations after 200 warmup, p50 µs, 1 thread), recorded in single-compute-m1-2026-06-16.md.

Operation	Shape	yscv	NumPy	PyTorch	ORT
add	1024×1024	95	141	143	141
mul	1024×1024	92	140	144	146
sum	1024×1024	67	176	51	127
max	1024×1024	57	54	167	87
add (broadcast last dim)	1024×1024 + 1024	128	154	101	143
sub (broadcast row − matrix)	1024 − 1024×1024	131	155	101	145
exp	1024×1024	446	1763	979	739
relu	921600	81	326	79	81
sigmoid	921600	184	1800	981	520
tanh	1024×1024	436	1054	3727	573
gelu (sigmoid approx)	1024×1024	442	2280	1366	756
silu	1024×1024	436	2193	1149	752
softmax	32×1000	23	77	47	28
log_softmax	32×1000	23	87	40	28
softmax	512×256	76	303	193	110
layer_norm	512×256	55	213	95	263
batch_norm	1×64×64×3 ↔ 1×3×64×64	2	13	10	4

Honest reading (M1). Against ORT-CPU yscv is ahead on every op except relu (81 vs 81 µs, parity), by ~1.1–2.8× on the activations and 4.8× on layer_norm. Against PyTorch the picture is mixed: yscv wins decisively on the polynomial-approximation activations (tanh ~8.5×, sigmoid ~5.3×, gelu ~3.1×, silu ~2.6×) but is slower on sum (67 vs 51 µs) and the broadcast add/sub (128 vs 101, 131 vs 101 µs), where PyTorch's reduction/broadcast kernels are better tuned for this core; relu is parity. Against NumPy yscv wins across the board. As on the other hosts, the activation wins are the NEON polynomial approximations trading a float-tolerance error for speed.

Methodology summary

Hardware: AMD Ryzen 5 7500F (Zen 4, 6C/12T) for x86; Orange Pi Zero 3 (Cortex-A53, 4C) for ARM; Apple M1 (4 P + 4 E cores) for Apple Silicon. Rust 1.95 stable, --release with codegen-units = 1, mimalloc global allocator.
Tracker: wall-clock latency over 300 iterations after warmup, fp32, dual input as described above. Competitor is ONNX Runtime CPUExecutionProvider (plus CoreMLExecutionProvider for the M1 GPU comparison). The Zen 4 and A53 tables report min; the M1 tables report p50.
Single-op: p50 of 1000 iterations after 200 warmup (300 on the A53), each op isolated in its own process to avoid cross-op cache/allocator contamination.
Ratios are competitor / yscv; >1.0 means yscv is faster.
All landings in the kernel path are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.

Reference benchmarks (macOS / Metal / video)

The sections below were measured on Apple-Silicon hardware; the Metal tracker tables are current, while the YOLO-inference, elementwise, and video sections were measured on earlier dates with older tooling and are individually marked pending re-measurement. Treat those as provisional.

Tensor Elementwise Ops (1M f32, vs NumPy) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	NumPy	Ratio	Status
add	0.128ms	0.142ms	1.11×	WIN
sub	0.154ms	0.142ms	0.92×	PARITY
mul	0.134ms	0.142ms	1.06×	PARITY
sum	0.020ms	0.172ms	8.6×	WIN
max	0.020ms	0.053ms	2.7×	WIN
min	0.020ms	0.053ms	2.7×	WIN
exp	0.389ms	1.704ms	4.4×	WIN
relu	0.082ms	0.402ms	4.9×	WIN
argmax	<0.001ms	0.429ms	>400×	WIN
gt/eq/lt _into	0.116-0.130ms	0.314ms	2.5×	WIN
transpose 512²	0.112ms	0.184ms	1.6×	WIN

Tensor Unary Ops (1M f32, vs NumPy) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	NumPy	Ratio	Status
abs	0.080ms	0.088ms	1.1×	WIN
neg	0.080ms	~0.126ms	1.6×	WIN
floor	0.077ms	0.088ms	1.1×	WIN
ceil	0.077ms	~0.350ms	4.5×	WIN
round	0.077ms	~0.350ms	4.5×	WIN
sign	0.099ms	~0.350ms	3.5×	WIN
reciprocal	0.083ms	~0.200ms	2.4×	WIN
clamp	0.090ms	~0.350ms	3.9×	WIN
sqrt	0.156ms	0.163ms	1.04×	PARITY
ln	0.370ms	~1.200ms	3.2×	WIN

Activations (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Current Zen 4 single-thread activation numbers are in the single-operation compute section above; this macOS table is retained for reference only.

Operation	yscv	PyTorch	Ratio	Status
sigmoid 921K f32	0.217ms	1.296ms	6.0×	WIN
softmax 512×256	0.098ms	0.216ms	2.2×	WIN
relu 921K f32	0.069ms	0.105ms	1.5×	WIN
layer_norm 512×256	0.065ms	0.117ms	1.8×	WIN
gelu	—	2.522ms	—	WIN (old: 0.333ms vs ~0.400ms)

MatMul & Conv2d (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	PyTorch	Ratio	Status
matmul 128²	0.0055ms	0.0062ms	1.13×	WIN
conv2d 32² 3×3	0.074ms	0.080ms	1.08×	WIN

Normalization (vs PyTorch 2.8, CPU 1-thread) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	PyTorch	Ratio	Status
layer_norm 512×256	0.065ms	0.117ms	1.80×	WIN
batch_norm 64²×16	0.028ms	0.045ms	1.61×	WIN

u8 Image Processing (640×480, vs OpenCV 4.13) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	OpenCV	Ratio	Status
resize nearest 320→640	0.048ms	0.157ms	3.27×	WIN
resize bilinear 320→640	0.068ms	0.201ms	2.96×	WIN
sobel 3×3	0.074ms	0.169ms	2.28×	WIN
dilate 3×3	0.031ms	0.047ms	1.52×	WIN
erode 3×3	0.030ms	0.051ms	1.70×	WIN
box blur 3×3	0.049ms	0.071ms	1.45×	WIN
grayscale	0.025ms	0.030ms	1.20×	WIN
gaussian 3×3	0.049ms	0.063ms	1.29×	WIN
median 3×3	0.029ms	0.072ms	2.48×	WIN

f32 Image Processing (ImageF32, 480×640, vs OpenCV) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	OpenCV	Ratio	Status
grayscale	0.022ms	0.027ms	1.23×	WIN
gaussian 3×3	0.051ms	0.113ms	2.22×	WIN
box blur 3×3	0.049ms	0.131ms	2.67×	WIN
dilate 3×3	0.047ms	0.104ms	2.21×	WIN
sobel 3×3	0.055ms	0.297ms	5.40×	WIN
threshold	0.015ms	0.017ms	1.13×	WIN

Video Decode (vs ffmpeg, single-threaded) — macOS, measured Apr 2026, pending re-measurement

H.264 and HEVC MP4 decode. Pure Rust decoder vs ffmpeg libavcodec (C, ffmpeg -threads 1).

Test methodology:

Hardware: Apple M-series (unified memory, NEON SIMD)
Build: --release, LTO=thin, codegen-units=1
Both decoders single-threaded for fair comparison
Best of 5 runs, cold CPU between runs
ffmpeg command: ffmpeg -threads 1 -benchmark -i <file> -f null -
yscv command: cargo run --release --example bench_video_decode -- <file>
Correctness: all frames decoded, pixel_range [0,255], frame count matches ffprobe
Memory: streaming reader (27MB RSS for 41MB file, O(1) relative to file size)
Date: April 2026

H.264

Video	Frames	yscv	ffmpeg	Ratio	Pixels
H.264 Baseline 1080p	300	324ms	519ms	1.60×	[0, 255] ✓
H.264 High 1080p	300	332ms	760ms	2.28×	[0, 255] ✓
Real Camera H.264 1080p60	1100	1187ms	5372ms	4.52×	[0, 255] ✓

HEVC (full color — chroma MC enabled)

Video	Frames	yscv	ffmpeg	Ratio	Pixels
HEVC Main 1080p P/B 5s	300	575ms	806ms	1.40×	[0, 255] ✓
HEVC Main 1080p P/B 10s	600	1288ms	1808ms	1.40×	[0, 255] ✓
HEVC Main 1080p I-only	180	1538ms	1483ms	0.97×	[0, 255] ✓

Key observations

H.264:

1.6–4.5× faster than ffmpeg across all profiles — pure Rust with SIMD IDCT/dequant (NEON + SSE2), rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames
Real camera 1080p60: 4.5× faster — 1100 frames decoded in 1.2 seconds
Full pixel range [0, 255] on all supported profiles
Weighted prediction, 8x8 DCT (High profile), sub-MB partitions (16x8, 8x16, 8x8)
Streaming reader: O(1) memory — 27MB RSS for 41MB file (no full-file loading)

HEVC:

1.4× faster than ffmpeg on P/B frames (full color decode with chroma MC + deblock + SAO + YUV→RGB)
I-frame near-parity (0.97×) — intra-only content is CABAC-bound
Full color output: chroma motion compensation with 4-tap filter, real YUV420→RGB
All profiles decode correctly ([0, 255]) including 10-bit Main10 (u16 DPB)
BS=0 edge skip eliminates ~85% of deblock work on inter-coded frames

Memory:

Streaming MP4 reader: reads only moov box at open (1-5MB), samples lazily via seek
41MB H.264 file: 27MB RSS (< file size)
3.2MB HEVC file: 129MB RSS (DPB + recon buffers for 1080p)
No unbounded growth — DPB bounded by SPS, all buffers reused across frames

Optimizations applied:

Branchless CABAC: mask-based MPS/LPS selection, packed transition tables (128-entry lookup), CLZ batch renormalize, 32-bit buffered bit reader
Unsafe hot paths: get_unchecked for all CABAC table lookups, ptr::add for deblock filter, pre-computed scan/context tables, branchless sign (val ^ -sign) + sign
Zero-copy frame management: reusable mv_field, CU list, Y-plane, recon buffers across frames
NEON (29 blocks) + SSE2 (31 blocks): MC 8-tap horizontal/vertical filter, bipred/unipred clip, DC intra prediction, dequant, DCT 16x16/32x32, i16→u8 saturation, Y→grayscale RGB interleave
Deblock: BS=0 skip (pred_mode grid), pre-computed tc/beta thresholds, early whole-edge skip, luma-only mode (skip chroma deblock)
SAO: CTU-only 4KB stack buffer (not full-frame copy)

Supported formats:

H.264: Baseline (CAVLC), Main (CABAC), High (CABAC + 8x8 transform), I/P/B slices, weighted prediction, sub-MB partitions, scaling lists, parallel deblocking
HEVC: Main, Main10 (10-bit u16 DPB), I/P/B slices, CABAC, deblocking + SAO, CTU quad-tree, tiles (parsed), chroma residual parsing
MP4 container: avcC/hvcC parameter extraction, stbl/stco/stsz sample table navigation
MKV/WebM container: EBML demuxer with track/cluster parsing
Annex B raw stream parser (H.264 + HEVC)
SIMD: NEON (aarch64) 29 blocks, SSE2 (x86_64) 31 blocks — full cross-architecture coverage
Parallelism: rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames

Video (vs OpenCV) — macOS, measured Apr 2026, pending re-measurement

Operation	yscv	OpenCV	Ratio
YUV420→RGB 1080p	0.166ms	0.178ms	1.07×

Additional Operations (Apple Silicon, March 2026) — pending re-measurement

Operation	Time
Tensor add 100K	0.0143ms
Tensor mul 100K	0.0118ms
Broadcast add	0.226ms
Broadcast mul	0.211ms
matmul 128²	0.0055ms
matmul rect 96×192×64	0.0036ms
ReLU 921K f32	0.069ms (threaded: 0.062ms)
Sigmoid 921K f32	0.217ms
Add 921K same-shape	0.126ms
BatchNorm 64²×16	0.028ms (threaded: 0.023ms)
Softmax 512×256	0.098ms (threaded: 0.063ms)
LayerNorm 512×256	0.065ms (threaded: 0.044ms)
Conv2d 32² 3×3	0.074ms
MaxPool 120×160	0.159ms (threaded: 0.096ms)
Grayscale u8	0.025ms
Resize nearest u8	0.048ms
Resize bilinear u8	0.068ms
Dilate u8	0.031ms
Erode u8	0.030ms
Box blur u8	0.049ms
Sobel u8	0.074ms
Autograd backward 32²	0.0041ms
Autograd broadcast	0.0067ms
Model linear batch32	0.000905ms
Model linear+relu+linear	0.0024ms
SGD step batch16	0.0096ms
SGD step batch64	0.0147ms
Detect people	0.060ms
Detect faces	0.165ms
Detect heatmap	0.046ms
Track	0.487ms
Recognize query	0.000448ms
CLI people pipeline	0.075ms
CLI face pipeline	0.162ms

ONNX Inference (YOLOv8n / YOLO11n, 640×640 input) — Apple M1, measured Mar–Apr 2026, pending re-measurement

End-to-end model inference benchmarks against onnxruntime, Apple CoreML, and tract. Methodology: 50 timed runs after warmup, min reported. Apple M1 MacBook Air.

CPU Inference

Runtime	YOLOv8n	YOLO11n	Notes
yscv	30.4ms	33.7ms	Pure Rust, NHWC layout, BLAS matmul
onnxruntime 1.19 CPU	37.4ms	35.2ms*	*Requires opset 21 conversion; native opset 22 fails
onnxruntime 1.19 CoreML	15.5ms	47.6ms*	CoreML accelerator; YOLO11n perf degrades with partial coverage
tract 0.21	217.2ms	FAILED	TDim parse error

yscv CPU is 1.2× faster than onnxruntime on YOLOv8n (30.4ms vs 37.4ms) and comparable on YOLO11n (33.7ms vs 35.2ms). onnxruntime requires manual opset downgrade (22→21) for YOLO11n; yscv handles opset 22 natively.

yscv MPSGraph is 4× faster than ORT CoreML on YOLOv8n (3.5ms vs 15.5ms). ORT CoreML on YOLO11n degrades to 47.6ms due to partial operator coverage.

GPU Inference (Metal, Apple M1)

Runtime	YOLOv8n	YOLO11n	Notes
yscv MPSGraph	3.5ms	5.0ms	Whole-model graph compilation, single GPU dispatch
yscv Metal per-op	12.1ms	12.6ms	Per-op command buffer, Winograd + MPS GEMM
onnxruntime CoreML	14.2ms	FAILED	Apple Neural Engine delegation

MPSGraph compiles the entire ONNX model into an MPSGraphExecutable and runs it as a single GPU dispatch — eliminating per-op encoder transitions. 4× faster than CoreML on YOLOv8n.

yscv is the only runtime that runs both YOLOv8n and YOLO11n on GPU. CoreML fails on YOLO11n (opset 22).

Pipelined Throughput (MPSGraph submit/wait, Apple M1)

The pipelined API (submit_mpsgraph_plan + wait_mpsgraph_plan) triple-buffers input/output buffers and overlaps CPU marshaling with GPU compute. Sustained per-frame wall-time (300 iter, Siamese tracker, 2 inputs @ 1×3×128×128 + 1×3×256×256, fp16):

Mode	p50	p99	Sustained FPS
yscv sync (`--pipeline 1`)	1.26 ms	1.80 ms	792
yscv `--pipeline 2`	0.37 ms	0.48 ms	2688
yscv `--pipeline 3`	0.47 ms	0.65 ms	2151
yscv `--pipeline 4`	0.56 ms	0.64 ms	1776
onnxruntime CoreML MLProgram	1.62 ms	1.98 ms	617

Depth 2 is the throughput sweet spot (4.4× vs ORT CoreML); depth 4 trades raw p50 for the tightest tail (p99 = 0.64 ms). Pipeline depth is chosen via YSCV_MPS_PIPELINE env var (default 3, clamped 1..=8). The API itself is safe regardless: submit_mpsgraph_plan back-pressures if the caller has more outstanding handles than the pipeline depth.

Siamese Tracker — Full Backend Comparison (Apple M1)

Same two-tower Siamese tracker (inputs 1×3×128×128 + 1×3×256×256, fp32 zero-fill on CPU, fp16 on GPU). Compares every backend yscv ships against the corresponding ORT provider on the identical host. p50 over 300 iterations after warmup; yscv vs ORT 1.19.2.

Backend	yscv	ORT 1.19.2	yscv vs ORT
CPU 1T	65 FPS (15.37 ms)	34 FPS (29.80 ms)	1.94×
CPU 4T	183 FPS (5.46 ms)	42 FPS (23.64 ms)	4.33×
GPU sync	792 FPS (1.26 ms)	617 FPS (1.62 ms, CoreML)	1.28×
GPU pipelined×2	2688 FPS (0.37 ms)	—	4.4× over ORT CoreML
GPU peak burst	3623 FPS (0.28 ms min)	—	—

Why yscv CPU beats ORT CPU on Apple Silicon: Accelerate.framework dispatches BLAS through Apple's AMX (Advanced Matrix eXtensions) block — a dedicated matrix accelerator inside the CPU complex, separate from the Neural Engine. ORT's CPUExecutionProvider uses its own general-purpose SIMD kernels that don't hit AMX, so it leaves ~1.9× throughput on the table single-thread, widening to 4.3× at four threads as ORT scales poorly across the M1's P-cores. On Intel the opposite holds: ORT's oneDNN is highly-tuned for AVX-512 where yscv (which dispatches through OpenBLAS) is typically 2-3× slower.

Why yscv Metal beats ORT CoreML on Apple Silicon: ORT's CoreMLExecutionProvider compiles the graph to CoreML and routes compatible ops to the Apple Neural Engine (ANE) + Metal hybrid. On the Siamese tracker 216/219 ops run on CoreML; the remaining 3 fall back to CPU. Every fallback crosses a CPU↔accelerator boundary, costing synchronization + marshalling. yscv's MPSGraph path compiles 100% of the graph to pure Metal and avoids the hybrid overhead entirely. Pipelining (3-buffered submit/wait) then overlaps CPU marshal with GPU compute for another 2× on sustained throughput.

Pipelined Throughput (RKNN submit/wait, RK3588)

RknnPipelinedPool applies the same pattern to Rockchip NPU cores: one slot per NpuCoreMask, pre-allocated + pre-bound RknnMem per input and output, back-pressured submit/wait. On RK3588 the pool can drive all 3 NPU cores concurrently; on RV1106 pass &[Core0] for a cleanly-typed single-slot async path.

On-device numbers (YOLO / Siamese tracker, int8-quantized .rknn) will be added once captured against a physical Rock 4D. Relative gains are expected to mirror the MPSGraph path — pipeline depth equal to NPU-core count ≈ 3× sync throughput, with tail latency tightening as CPU and NPU stop serialising their handshake.

Metal Pipeline Architecture

The Metal backend compiles an ONNX graph into a sequence of MetalOps executed in a single fused command buffer. Key optimizations (in order of impact):

Optimization	Impact	Description
Winograd F(4×4, 3×3)	~40% of GPU time	2.25× FLOP reduction for stride-1 3×3 convs; SIMD group matrix multiply with f32 accumulation
F16 inter-op pipeline	Halves bandwidth	All intermediate buffers use f16; weights pre-packed as f16 at compile time
NEON input upload	Eliminates GPU cast	CPU-side `fcvtn`+`st3` converts f32 NCHW → f16 NHWC faster than GPU kernel
Conv+SiLU+Add fusion	Fewer ops	Residual addition and activation fused into conv write-back epilogue
Vectorized f16 kernels	Better throughput	All utility ops (concat, split, permute, resize) use `half4` vectorized I/O
Concat fusion	Eliminates copies	Conv outputs write directly into concat buffer via `out_stride`/`out_offset`
Detection head fusion	Fused permute+concat	NHWC→NCHW permutations + spatial concat fused into single `NhwcToFlatConcat` kernel
Zero-cost buffer aliasing	No-op reshapes	Reshape/Flatten/Squeeze/Unsqueeze alias existing buffers
Parallel softmax	Threadgroup reduction	Adaptive threadgroup size (32/128/256) with shared-memory reduction
Widened SiLU look-ahead	Fewer Metal ops	Detects SiLU patterns up to 5 nodes ahead (detection head interleaving)
In-place SiLU/Binary	Fewer buffers	Dead input buffers reused as output for elementwise ops

Metal Per-Op Distribution (YOLOv8n: 110 ops, YOLO11n: 204 ops)

Op Type	YOLOv8n	YOLO11n
ConvWinograd (3×3 stride=1)	32	28
MpsConv (MPS GEMM for 1×1+)	30	51
Concat	13	34
SplitFused	8	9
CpuReshape (GPU permute)	4	13
Binary/BroadcastBinary	6	28
DepthwiseConv	—	7
Other (MaxPool, Resize, etc.)	17	34

VballNetGrid Inference (DSConv model, 16.3 GFLOP) — Apple Silicon, measured Mar–Apr 2026, pending re-measurement

Model: VballNetGridV1b — 13 DSConvBlocks (depthwise 3×3 + pointwise 1×1), 4 MaxPool, head Conv+Sigmoid. Input [1, 9, 432, 768], output [1, 27, 27, 48], 42 ONNX nodes.

Optimization Progression (Apple Silicon)

Stage	Time	FPS	Speedup	What changed
yscv BEFORE	558 ms	1.7	—	Single-threaded, scalar depthwise
+ Multi-threading	257 ms	3.9	2.1×	`ParallelElementwiseConfig::default()` in public API
+ SIMD depthwise	124.1 ms	8.1	4.5×	NEON/AVX/SSE vectorized depthwise conv
onnxruntime CPU	196.7 ms	5.1	—	CPUExecutionProvider baseline
onnxruntime CoreML CPU_ONLY	8.6 ms	116	—	BNNS/AMX via CoreML delegate
yscv Metal per-op	47.3 ms	21.1	11.8×	Metal-native fused pipeline, MPS GEMM
yscv MPSGraph	7.8 ms	128	71.5×	Whole-model GPU graph compilation

Key Takeaway

yscv CPU (124.1ms) is 1.6× faster than onnxruntime CPU (196.7ms) on depthwise-separable models — no special flags needed. MPSGraph (7.8ms) beats CoreML CPU_ONLY (8.6ms) which uses Apple's dedicated AMX coprocessor via BNNS — a 1.1× speedup. MPSGraph provides 16× over CPU, reaching 128 FPS on Apple Silicon.

ONNX Siamese Tracker (Zen 4 historical arc, AMD Ryzen 5 7500F, 6C/12T, fp32 CPU)

Historical optimization arc, retained for context. The current Zen 4 tracker figures are in the Siamese tracker — CPU inference section at the top of this document (min-of-300, commit 241f36c). The p50 numbers below predate that re-measurement and use a different run protocol.

Model: Siamese tracker, 156 ops after graph optimization, two input branches (input.1 1×3×128×128 template, input.249 1×3×256×256 search) joined in connect_model. Primary fp32 CPU benchmark target of the S./A./R.* perf arc (Apr 2026, 19 sessions).

Methodology: RAYON_NUM_THREADS=N ./onnx-fps --iters 500, median of 3 runs per thread-count, bitwise-identical outputs across all Ns. ORT 1.24.4 CPUExecutionProvider as the reference.

Threads	yscv p50	ORT p50	gap	yscv scaling	ORT scaling
1	11.43 ms	8.05 ms	1.42×	1.00×	1.00×
2	6.55 ms	4.42 ms	1.48×	1.74×	1.82×
4	4.15 ms	2.36 ms	1.76×	2.75×	3.41×
6	3.66 ms	1.74 ms	2.10×	3.12×	4.62×
8	3.87 ms	2.28 ms	1.70×	2.95×	3.53×
12	4.02 ms	1.93 ms	2.08×	2.84×	4.16×

6T is the sweet spot (physical-core count). Beyond 6T, SMT contention hurts both engines; 12T is strictly worse than 6T.

Where the remaining gap lives (6T profile, sequential sums)

op	yscv	ORT	gap	ratio
Conv	6.31 ms (78 ops)	2.22 ms (114 ops)	+4.09 ms	2.84×
MatMul	0.19 ms (2)	0.04 ms (2)	+0.14 ms	4.17×
Reshape	0.11 ms (5)	0.01 ms (5)	+0.10 ms	8.01×
Reorder (NCHWc)	—	0.05 ms (7)	—	ORT-only

Conv dominates 94% of the gap — the bulk live in mid-sized pointwise and inverted-bottleneck layers; ORT uses NCHWc layout throughout while yscv runs NHWC.

Key kernel optimizations

The custom CPU path layers several fusions and SIMD kernels on top of the blocked GEMM:

AVX 8×8 / NEON 4×4 NCHW↔NHWC block transposes for layout conversion.
AVX-512 / AVX2 / NEON depthwise row kernels with fused activation.
Row-level parallelism for the first 3×3 stride-2 layer.
Streaming FusedPwDw (PW-expand → DW 3×3) with register-blocked accumulators, so the expanded intermediate never hits DRAM.
FusedTransposeMatMul, mirroring ORT's MatmulTransposeFusion.

All landings are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.

GPU rerun on the same host (2026-04-25)

Same model and inputs, RTX 4060 added through --features gpu (wgpu backend over Vulkan) and onnxruntime-gpu 1.25 with CUDAExecutionProvider:

backend	p50	min	output `882` max	drift vs CPU ref
ORT CUDA EP fp32 (cuDNN, Tensor Cores)	1.42 ms	1.40 ms	48.9193	−0.17 (TF32)
yscv `gpu` fp16 (wgpu Vulkan)	5.25 ms	5.18 ms	48.1491	−0.94
yscv `gpu` fp32 (`YSCV_GPU_FP32=1`)	5.82 ms	5.72 ms	49.0915	−1 ULP
yscv CPU 6T no-BLAS	3.05 ms	2.84 ms	49.0920	+1 ULP
ORT CPU 1T	8.07 ms	8.03 ms	49.0916	reference

Two takeaways:

yscv fp32 GPU output is 1-ULP from the CPU reference, while ORT CUDA EP drifts −0.17 because cuDNN auto-uses Tensor Cores in TF32 / mixed precision on Ampere+ and downcasts implicitly. That is, our fp32 GPU path is the more numerically faithful one against the CPU reference.
ORT CUDA EP is 4× faster than yscv wgpu — structural (cuDNN ships shape-specific kernels and uses Tensor Cores via cooperative_matrix, wgpu compute shaders are vendor-portable WGSL with no MMA path). Closing this requires either a vendor Vulkan extension wgpu doesn't expose or a separate cuda-backend.

For Conv-heavy small-batch graphs like this tracker, the CPU runner (3.05 ms) beats wgpu (5.25 ms) on the same host — GPU launch latency dominates the actual compute. wgpu starts to win above batch ≥ 4 or inference ≥ 30 ms on CPU. See gpu-backend-guide.md for the full positioning matrix and YSCV_GPU_FP32 env-flag docs.

Latest rerun (2026-04-25, post-R10 correctness fix)

R10 fixed a silent-drop residual_tile in microkernel_4x8_dispatch on the x86 1×NR-tile path — SIMD/ASM 4×8 variants never received the residual pointer and dropped the add for Conv+Add shapes whose n left an 8-wide scalar tail. Output 882 max went 84.78 → 49.0920 (ORT: 49.0916, 1-ULP FP-ordering drift). Perf also improved because non-BLAS path is now fully correct on tracker shapes.

build	1T p50	6T p50	12T p50	output 882 max
yscv no-BLAS (default for onnx-fps)	11.22 ms	3.17 ms	3.34 ms	49.0920 ✓
yscv with BLAS (OpenBLAS 0.3.31)	13.00 ms	8.75 ms	9.01 ms	49.0922 ✓
ORT 1.24.4 CPU	8.07 ms	1.74 ms	1.91 ms	49.0916 ✓

yscv no-BLAS vs ORT: 1T 1.39×, 6T 1.82×, 12T 1.75× behind.

On this graph BLAS is a net regression — 2.76× slower at 6T than the non-BLAS path. Root causes: matmul_2d_slices_fused with BLAS splits into blas_sgemm + apply_epilogue_fallback (two passes over out, ~5.5 ms of extra L2/L3 traffic at 6T), and the whole arc (R4/R7/R9/A2) only fires on the non-BLAS branch. OPENBLAS_NUM_THREADS=1 does not close the gap, so it's not pure thread oversubscription — rayon workers block serially on sgemm instead of running their own A/B tiles in parallel.

See feature-flags.md for the full when-to-enable / when-to-disable BLAS checklist.

Cross-Platform SIMD Coverage

Operation	NEON	SSE	AVX
Tensor binary/unary (1M f32)	✅ 4× unroll	✅ 4-wide	✅ 4× unroll (32 elem)
Activations (sigmoid/tanh/silu)	✅ 3-term poly	✅ poly	✅ poly
Softmax/LogSoftmax	✅ fused	✅ fused	✅ fused
MatMul	✅ BLAS	✅ BLAS	✅ BLAS + FMA
Conv2d 3×3	✅ direct NEON	✅ direct SSE	✅ im2col + BLAS
Depthwise Conv2d	✅ 4-wide FMA	✅ 4-wide	✅ 8-wide
u8 morphology/filter/sobel	✅ 16B/iter	✅ 16B/iter	✅ 32B/iter (AVX2)
f32 filter/morphology/geometry	✅ 4-wide	✅ 4-wide	✅ 8-wide
Median u8	✅ sort network	✅ sort network	—
YUV→RGB	✅ NEON + GCD	✅ SSE + threads	✅ AVX2 + threads

Optimization Techniques

315 #[target_feature]-gated SIMD functions with runtime CPU detection
All dispatch functions #[inline] for cross-crate inlining
AlignedVec::uninitialized — skip output zeroing in hot paths
ImageU8/ImageF32 — zero-overhead wrappers bypass Tensor allocation
GCD dispatch_apply — macOS near-zero threading (~0.3µs)
mimalloc — thread-local arena pools
Fused kernels — single-pass softmax, sigmoid, attention
im2col + BLAS — Accelerate/OpenBLAS for matmul/conv2d/conv3d
Flash Attention — tiled O(Br×Bc) memory, online softmax
Integer GEMM — quantized matmul with i32 accumulation (no dequant overhead)

Test Infrastructure

Test Suite Summary (April 2026)

Crate	Tests	Coverage
yscv-model	365	Serialization, training loops, data loading, distributed
yscv-imgproc	225	All u8/f32 ops, SIMD paths, color conversion
yscv-video	230	H.264/HEVC decode (Main + Main10 + Rext + weighted prediction + tiles + WPP + chroma deblock/SAO), MP4/MKV parsing, HW detect
yscv-tensor	207	Elementwise, matmul, broadcast, BLAS dispatch
yscv-kernels	120	CPU ops, GPU backend, SIMD activation, GEMM
yscv-autograd	106	Forward/backward graph, all op gradients
yscv-eval	95	COCO/YOLO/VOC/KITTI/WiderFace/MOT metrics
yscv-onnx	166	Per-operator coverage for all 122 CPU dispatch arms, fusion regressions, quantization, vision ops
yscv-optim	76	SGD, Adam, LR schedulers, weight decay
yscv-detect	60	YOLOv8/v11 decode, NMS, letterbox
yscv-track	57	Hungarian, Kalman, IoU matching
yscv-cli	42	Config parsing, diagnostics
yscv-recognize	16	Embedding extraction, cosine similarity
Total	1808

CI Matrix

Platform	Runner	Features	What's Tested
macOS (ARM)	macos-latest	default + videotoolbox	Full workspace + HW decode
Linux (x86)	ubuntu-latest	default + gpu	Full workspace + WGPU
Linux (ARM)	ubuntu-24.04-arm	default	Full workspace + NEON
Windows	windows-latest	default	Full workspace

CI Jobs

workspace-compat: Multi-platform build + test
quality: fmt + clippy + CLI integration + benchmark gates + eval format verification
miri: Unsafe code soundness (yscv-tensor, yscv-kernels)
hw-decode: VideoToolbox (macOS), SW fallback (Linux/Windows)
benchmark gates: Criterion microbenchmarks for 12 crates with trend tracking

Test Data

Video: examples/src/CENSUSWITHOUTLOGO.mp4 (41MB, H.264 1080p60, 1100 frames)
Images: examples/src/testtraffic.png, testtraffic2.png (3.4-3.5MB)
Models: examples/src/slowwork/yolo{v8n,11n}.onnx (10-12MB, gitignored)
Eval samples: benchmarks/eval-* (all formats, <10KB each)
Fuzz corpus: fuzz/corpus/ (H.264, HEVC, MKV seed files)
Baselines: benchmarks/ci-baseline-*.txt, trend-baseline-*.tsv

How to Run

# Linux-local CI subset before pushing
bash scripts/check-ci-local.sh

# Broader local pass: release tests, extended proptests, UX, benchmark gates
bash scripts/check-ci-local.sh --full

# Add local cross-target smoke where toolchains are available
bash scripts/check-ci-local.sh --cross --all-features

# Full workspace test (1693 tests)
cargo test

# Single crate
cargo test -p yscv-video

# Video decode benchmark
cargo run --release --example bench_video_decode -- examples/src/CENSUSWITHOUTLOGO.mp4

# Compare with ffmpeg
ffmpeg -threads 1 -benchmark -i examples/src/CENSUSWITHOUTLOGO.mp4 -f null -

# Criterion microbenchmarks
cargo bench -p yscv-kernels
cargo bench -p yscv-imgproc

# Miri soundness check
cargo +nightly miri test -p yscv-tensor --lib

# Fuzz testing
cd fuzz && cargo fuzz run fuzz_h264_nal -- -max_total_time=60

FilesExpand file tree

performance-benchmarks.md

Latest commit

History