Measured CPU-inference and single-operation benchmarks for yscv, compared against ONNX Runtime, PyTorch, NumPy, OpenCV, and ffmpeg.
Last updated: 2026-06-16 · commit fa80f66
This document has two parts. The CPU sections (Siamese tracker, single-op) are the current measurement focus: fixed hardware, pinned competitor versions, isolated per-op processes, regenerable from a script in the repo. They cover three hosts — AMD Ryzen 5 7500F (Zen 4), Orange Pi Zero 3 (Cortex-A53), and Apple M1. The Metal / video sections below them are retained for reference but were measured on different hardware and dates; where a number could not be reproduced on current tooling it is marked pending re-measurement. Treat those as provisional.
Where yscv is at parity with or behind a competitor, this is stated plainly.
Ratios are written as competitor / yscv (a ratio above 1.0 means yscv is
faster).
The primary end-to-end target. Model is a public two-tower Siamese
single-object tracker exported to ONNX (~156 ops after graph optimization),
with two inputs: a 1×3×128×128 template branch and a 1×3×256×256 search
branch, fp32 zero-fill. The reported figure is the minimum wall-clock
latency over 300 iterations (after warmup), which isolates the steady-state
compute path from scheduler and allocator jitter.
yscv (commit 241f36c) vs ONNX Runtime 1.24.4, CPUExecutionProvider, on the
same host.
Here the ratio is yscv / ORT (yscv's slowdown factor); >1.0 means yscv is
slower, since on this model ORT is ahead.
| Threads | yscv min | yscv FPS | ORT min | yscv / ORT |
|---|---|---|---|---|
| 1 | 8.63 ms | 116 | 8.03 ms | 1.07× |
| 4 | 3.06 ms | 327 | 2.35 ms | 1.30× |
| 6 | 2.52 ms | 396 | 1.74 ms | 1.45× |
Honest reading. yscv is roughly 7% behind ORT single-threaded
(8.63 vs 8.03 ms) and the gap widens with thread count: ORT scales better
across cores on this model, reaching 1.74 ms at 6 threads against yscv's
2.52 ms (ORT is ~1.45× faster at 6T). The single-thread compute path is close;
the deficit is in multi-thread scaling, not per-core kernel throughput. The
remaining structural difference is layout — ORT runs NCHWc throughout while
yscv runs NHWC. See docs/onnx-cpu-kernels.md for the
per-op hot-path map.
yscv (commit d8d43ea on the device) vs ONNX Runtime 1.26.0
(CPUExecutionProvider), same host, same model and inputs. On this edge-ARM
target yscv is faster than ORT at both thread counts — the inverse of the
x86 picture, and the project's actual deployment target.
Ratio here is ORT / yscv (>1.0 means yscv is faster).
| Threads | yscv min | yscv FPS | ORT min | ORT / yscv |
|---|---|---|---|---|
| 1 | 321 ms | 3.1 | 496 ms | 1.54× |
| 4 | 101 ms | 9.9 | 161 ms | 1.59× |
Reference context (prior measurement, not from this round): yscv has also measured near-parity with XNNPACK on the A53 at roughly 307 ms / 1T and 98.5 ms / 4T; cited for scale only.
yscv vs ONNX Runtime 1.19.2 (CPUExecutionProvider), same host, same model and
inputs, fp32 zero-fill, p50 over 300 iterations. On Apple Silicon yscv is well
ahead of ORT-CPU at every thread count.
Ratio here is ORT / yscv (>1.0 means yscv is faster).
| Threads | yscv p50 | yscv FPS | ORT p50 | ORT / yscv |
|---|---|---|---|---|
| 1 | 15.37 ms | 65 | 29.80 ms | 1.94× |
| 4 | 5.46 ms | 183 | 23.64 ms | 4.33× |
| 8 | 6.56 ms | 153 | 37.52 ms | 5.72× |
Four threads is the sweet spot — the M1's four performance cores saturate there, and pushing to eight pulls in the efficiency cores, which regresses both runtimes.
Per-operation isolated microbenchmarks at 1 thread, 1000 iterations after
200 warmup, each op in a fresh process. yscv / PyTorch / ONNX Runtime figures
are generated by run-single-compute.sh
and recorded in
single-compute-zen4-2026-06-09.md
(commit 241f36c, PyTorch 2.12.0, ONNX Runtime 1.26.0). The NumPy column is
produced separately by
bench_numpy_single_ops.py
(NumPy 2.4.6). All times are p50 microseconds; lower is faster.
| Operation | Shape | yscv | NumPy | PyTorch | ORT |
|---|---|---|---|---|---|
| add | 1024×1024 | 100 | 99 | 98 | 111 |
| mul | 1024×1024 | 97 | 96 | 97 | 116 |
| sum | 1024×1024 | 29 | 114 | 32 | 83 |
| max | 1024×1024 | 30 | 36 | 79 | 56 |
| add (broadcast last dim) | 1024×1024 + 1024 | 72 | 148 | 67 | 91 |
| sub (broadcast row − matrix) | 1024 − 1024×1024 | 69 | 144 | 68 | 93 |
| exp | 1024×1024 | 110 | 364 | 634 | 138 |
| relu | 921600 | 55 | 248 | 57 | 64 |
| sigmoid | 921600 | 74 | 521 | 364 | 202 |
| tanh | 1024×1024 | 188 | 288 | 2004 | 208 |
| gelu (sigmoid approx) | 1024×1024 | 176 | 4325 | 3384 | 419 |
| silu | 1024×1024 | 162 | 3026 | 350 | 227 |
| softmax | 32×1000 | 7 | 29 | 15 | 9 |
| log_softmax | 32×1000 | 7 | 38 | 14 | 9 |
| softmax | 512×256 | 26 | 143 | 63 | 36 |
| layer_norm | 512×256 | 11 | 187 | 39 | 111 |
| batch_norm | 1×64×64×3 ↔ 1×3×64×64 | 2 | 10 | 8 | 3 |
Honest reading.
- Memory-bound elementwise (add, mul, broadcast): yscv is at parity with
NumPy and PyTorch — these ops are limited by memory bandwidth, not
arithmetic, so all three single-thread implementations converge. yscv is
marginally slower than PyTorch on
add(100 vs 98 µs) and the broadcast variants (72 vs 67, 69 vs 68 µs). It is faster than ORT on all of them (1.1–1.4×). - Transcendentals / activations: yscv's polynomial-approximation kernels win
substantially over PyTorch —
exp~5.8×,tanh~10.7×,gelu~19×,sigmoid~4.9×,silu~2.2× — and beat NumPy by similar or larger margins (gelu~25×,silu~19×). These are approximations to within a float tolerance, not bit-exact transcendental functions; that is the tradeoff that buys the speed. - Reductions / normalization: yscv wins on
sum,max,softmax,layer_norm, andbatch_normagainst all three. - vs ORT-CPU overall: yscv is ahead on every op in the table, by roughly
1.1–2.9× (and 10× on
layer_norm, where ORT is unexpectedly slow at this shape).
gelu uses the sigmoid approximation x · sigmoid(1.702·x); yscv uses NHWC
for batch_norm while PyTorch and ORT use NCHW over the same data volume.
# yscv / PyTorch / ONNX Runtime (writes the dated markdown snapshot)
RAYON_NUM_THREADS=1 YSCV_POOL_SPIN_US=200 ITERS=1000 WARMUP=200 \
OUT=benchmarks/single-compute-zen4-$(date -u +%F).md \
bash benchmarks/run-single-compute.sh
# NumPy column (single thread)
python3 benchmarks/python/bench_numpy_single_ops.py --iters 1000 --threads 1The yscv/PyTorch/ORT snapshot records git commit and dirty state, toolchain
and Python runtime versions, raw per-backend logs under artifacts/, and
min / p50 / avg rows per backend. Each operation is measured in a fresh
process: PyTorch full-suite single-process runs showed allocator/cache
contamination on later memory-bound ops, so isolated per-op p50 is the source
of truth. yscv uses YSCV_POOL=yscv and YSCV_POOL_SPIN_US=200 for this tight
standalone loop; the normal inference default is unchanged. p50 deltas within
1 µs are reported as parity.
Same per-op isolated methodology on the A53 (yscv compute_gap, PyTorch
2.12.0+cpu, ONNX Runtime 1.26.0, NumPy 2.4.6; 300 iterations, p50 µs, 1 thread).
| Operation | Shape | yscv | NumPy | PyTorch | ORT |
|---|---|---|---|---|---|
| add | 1024×1024 | 6645 | 6690 | 8250 | 6938 |
| mul | 1024×1024 | 6647 | 7317 | 7810 | 7295 |
| exp | 1024×1024 | 13161 | 41029 | 23915 | 17716 |
| relu | 921600 | 3406 | 5924 | 3012 | 3280 |
| sigmoid | 921600 | 4100 | 47008 | 22677 | 16139 |
| tanh | 1024×1024 | 5907 | 21555 | 44393 | 16507 |
| gelu (sigmoid approx) | 1024×1024 | 7321 | 79255 | 37741 | 25720 |
| silu | 1024×1024 | 5989 | 67883 | 27541 | 23827 |
| softmax | 512×256 | 1069 | 7276 | 3212 | 3413 |
| layer_norm | 512×256 | 508 | 4495 | 1561 | 2401 |
| batch_norm | 1×3×64×64 | 23 | 221 | 186 | 88 |
Honest reading (A53). On the weak in-order core, memory-bound elementwise
(add, relu) is at parity across all four — limited by the Pi's DRAM
bandwidth, not arithmetic (PyTorch is the only one to edge ahead on relu,
3012 vs yscv 3406 µs). yscv's NEON polynomial-approximation kernels then pull
far ahead on transcendentals/activations: sigmoid ~3.9× vs ORT, ~5.5× vs
PyTorch, ~11× vs NumPy; tanh ~2.8× / ~7.5× / ~3.6×; gelu ~3.5× / ~5.2× /
~10.8×; silu ~4.0× / ~4.6× / ~11.3×; plus softmax / layer_norm /
batch_norm 3–9× over every backend. On ARM yscv beats NumPy, PyTorch, and
ORT-CPU on every op outside memory-bound parity — consistent with the tracker
result, where yscv is 1.5–1.6× faster than ORT on the A53.
Reproduce on the device:
cargo build --release -p yscv-llm-bench --bin compute_gap
RAYON_NUM_THREADS=1 ./target/release/compute_gap --iters 300
python3 benchmarks/python/bench_ort_single_ops.py --iters 300 --threads 1
python3 benchmarks/python/bench_numpy_single_ops.py --iters 300 --threads 1
python3 benchmarks/python/bench_torch_single_ops.py --iters 300 --threads 1Same per-op isolated methodology on the M1 (yscv compute_gap, PyTorch 2.8.0,
ONNX Runtime 1.19.2, NumPy 2.0.2; 1000 iterations after 200 warmup, p50 µs,
1 thread), recorded in
single-compute-m1-2026-06-16.md.
| Operation | Shape | yscv | NumPy | PyTorch | ORT |
|---|---|---|---|---|---|
| add | 1024×1024 | 95 | 141 | 143 | 141 |
| mul | 1024×1024 | 92 | 140 | 144 | 146 |
| sum | 1024×1024 | 67 | 176 | 51 | 127 |
| max | 1024×1024 | 57 | 54 | 167 | 87 |
| add (broadcast last dim) | 1024×1024 + 1024 | 128 | 154 | 101 | 143 |
| sub (broadcast row − matrix) | 1024 − 1024×1024 | 131 | 155 | 101 | 145 |
| exp | 1024×1024 | 446 | 1763 | 979 | 739 |
| relu | 921600 | 81 | 326 | 79 | 81 |
| sigmoid | 921600 | 184 | 1800 | 981 | 520 |
| tanh | 1024×1024 | 436 | 1054 | 3727 | 573 |
| gelu (sigmoid approx) | 1024×1024 | 442 | 2280 | 1366 | 756 |
| silu | 1024×1024 | 436 | 2193 | 1149 | 752 |
| softmax | 32×1000 | 23 | 77 | 47 | 28 |
| log_softmax | 32×1000 | 23 | 87 | 40 | 28 |
| softmax | 512×256 | 76 | 303 | 193 | 110 |
| layer_norm | 512×256 | 55 | 213 | 95 | 263 |
| batch_norm | 1×64×64×3 ↔ 1×3×64×64 | 2 | 13 | 10 | 4 |
Honest reading (M1). Against ORT-CPU yscv is ahead on every op except
relu (81 vs 81 µs, parity), by ~1.1–2.8× on the activations and 4.8× on
layer_norm. Against PyTorch the picture is mixed: yscv wins decisively on the
polynomial-approximation activations (tanh ~8.5×, sigmoid ~5.3×, gelu
~3.1×, silu ~2.6×) but is slower on sum (67 vs 51 µs) and the broadcast
add/sub (128 vs 101, 131 vs 101 µs), where PyTorch's reduction/broadcast
kernels are better tuned for this core; relu is parity. Against NumPy yscv
wins across the board. As on the other hosts, the activation wins are the
NEON polynomial approximations trading a float-tolerance error for speed.
- Hardware: AMD Ryzen 5 7500F (Zen 4, 6C/12T) for x86; Orange Pi Zero 3
(Cortex-A53, 4C) for ARM; Apple M1 (4 P + 4 E cores) for Apple Silicon. Rust
1.95 stable,
--releasewithcodegen-units = 1, mimalloc global allocator. - Tracker: wall-clock latency over 300 iterations after warmup, fp32, dual
input as described above. Competitor is ONNX Runtime
CPUExecutionProvider(plusCoreMLExecutionProviderfor the M1 GPU comparison). The Zen 4 and A53 tables report min; the M1 tables report p50. - Single-op: p50 of 1000 iterations after 200 warmup (300 on the A53), each op isolated in its own process to avoid cross-op cache/allocator contamination.
- Ratios are
competitor / yscv; >1.0 means yscv is faster. - All landings in the kernel path are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.
The sections below were measured on Apple-Silicon hardware; the Metal tracker tables are current, while the YOLO-inference, elementwise, and video sections were measured on earlier dates with older tooling and are individually marked pending re-measurement. Treat those as provisional.
| Operation | yscv | NumPy | Ratio | Status |
|---|---|---|---|---|
| add | 0.128ms | 0.142ms | 1.11× | WIN |
| sub | 0.154ms | 0.142ms | 0.92× | PARITY |
| mul | 0.134ms | 0.142ms | 1.06× | PARITY |
| sum | 0.020ms | 0.172ms | 8.6× | WIN |
| max | 0.020ms | 0.053ms | 2.7× | WIN |
| min | 0.020ms | 0.053ms | 2.7× | WIN |
| exp | 0.389ms | 1.704ms | 4.4× | WIN |
| relu | 0.082ms | 0.402ms | 4.9× | WIN |
| argmax | <0.001ms | 0.429ms | >400× | WIN |
| gt/eq/lt _into | 0.116-0.130ms | 0.314ms | 2.5× | WIN |
| transpose 512² | 0.112ms | 0.184ms | 1.6× | WIN |
| Operation | yscv | NumPy | Ratio | Status |
|---|---|---|---|---|
| abs | 0.080ms | 0.088ms | 1.1× | WIN |
| neg | 0.080ms | ~0.126ms | 1.6× | WIN |
| floor | 0.077ms | 0.088ms | 1.1× | WIN |
| ceil | 0.077ms | ~0.350ms | 4.5× | WIN |
| round | 0.077ms | ~0.350ms | 4.5× | WIN |
| sign | 0.099ms | ~0.350ms | 3.5× | WIN |
| reciprocal | 0.083ms | ~0.200ms | 2.4× | WIN |
| clamp | 0.090ms | ~0.350ms | 3.9× | WIN |
| sqrt | 0.156ms | 0.163ms | 1.04× | PARITY |
| ln | 0.370ms | ~1.200ms | 3.2× | WIN |
Current Zen 4 single-thread activation numbers are in the single-operation compute section above; this macOS table is retained for reference only.
| Operation | yscv | PyTorch | Ratio | Status |
|---|---|---|---|---|
| sigmoid 921K f32 | 0.217ms | 1.296ms | 6.0× | WIN |
| softmax 512×256 | 0.098ms | 0.216ms | 2.2× | WIN |
| relu 921K f32 | 0.069ms | 0.105ms | 1.5× | WIN |
| layer_norm 512×256 | 0.065ms | 0.117ms | 1.8× | WIN |
| gelu | — | 2.522ms | — | WIN (old: 0.333ms vs ~0.400ms) |
| Operation | yscv | PyTorch | Ratio | Status |
|---|---|---|---|---|
| matmul 128² | 0.0055ms | 0.0062ms | 1.13× | WIN |
| conv2d 32² 3×3 | 0.074ms | 0.080ms | 1.08× | WIN |
| Operation | yscv | PyTorch | Ratio | Status |
|---|---|---|---|---|
| layer_norm 512×256 | 0.065ms | 0.117ms | 1.80× | WIN |
| batch_norm 64²×16 | 0.028ms | 0.045ms | 1.61× | WIN |
| Operation | yscv | OpenCV | Ratio | Status |
|---|---|---|---|---|
| resize nearest 320→640 | 0.048ms | 0.157ms | 3.27× | WIN |
| resize bilinear 320→640 | 0.068ms | 0.201ms | 2.96× | WIN |
| sobel 3×3 | 0.074ms | 0.169ms | 2.28× | WIN |
| dilate 3×3 | 0.031ms | 0.047ms | 1.52× | WIN |
| erode 3×3 | 0.030ms | 0.051ms | 1.70× | WIN |
| box blur 3×3 | 0.049ms | 0.071ms | 1.45× | WIN |
| grayscale | 0.025ms | 0.030ms | 1.20× | WIN |
| gaussian 3×3 | 0.049ms | 0.063ms | 1.29× | WIN |
| median 3×3 | 0.029ms | 0.072ms | 2.48× | WIN |
f32 Image Processing (ImageF32, 480×640, vs OpenCV) — macOS, measured Apr 2026, pending re-measurement
| Operation | yscv | OpenCV | Ratio | Status |
|---|---|---|---|---|
| grayscale | 0.022ms | 0.027ms | 1.23× | WIN |
| gaussian 3×3 | 0.051ms | 0.113ms | 2.22× | WIN |
| box blur 3×3 | 0.049ms | 0.131ms | 2.67× | WIN |
| dilate 3×3 | 0.047ms | 0.104ms | 2.21× | WIN |
| sobel 3×3 | 0.055ms | 0.297ms | 5.40× | WIN |
| threshold | 0.015ms | 0.017ms | 1.13× | WIN |
H.264 and HEVC MP4 decode. Pure Rust decoder vs ffmpeg libavcodec (C, ffmpeg -threads 1).
Test methodology:
- Hardware: Apple M-series (unified memory, NEON SIMD)
- Build:
--release, LTO=thin, codegen-units=1 - Both decoders single-threaded for fair comparison
- Best of 5 runs, cold CPU between runs
- ffmpeg command:
ffmpeg -threads 1 -benchmark -i <file> -f null - - yscv command:
cargo run --release --example bench_video_decode -- <file> - Correctness: all frames decoded, pixel_range [0,255], frame count matches ffprobe
- Memory: streaming reader (27MB RSS for 41MB file, O(1) relative to file size)
- Date: April 2026
| Video | Frames | yscv | ffmpeg | Ratio | Pixels |
|---|---|---|---|---|---|
| H.264 Baseline 1080p | 300 | 324ms | 519ms | 1.60× | [0, 255] ✓ |
| H.264 High 1080p | 300 | 332ms | 760ms | 2.28× | [0, 255] ✓ |
| Real Camera H.264 1080p60 | 1100 | 1187ms | 5372ms | 4.52× | [0, 255] ✓ |
| Video | Frames | yscv | ffmpeg | Ratio | Pixels |
|---|---|---|---|---|---|
| HEVC Main 1080p P/B 5s | 300 | 575ms | 806ms | 1.40× | [0, 255] ✓ |
| HEVC Main 1080p P/B 10s | 600 | 1288ms | 1808ms | 1.40× | [0, 255] ✓ |
| HEVC Main 1080p I-only | 180 | 1538ms | 1483ms | 0.97× | [0, 255] ✓ |
H.264:
- 1.6–4.5× faster than ffmpeg across all profiles — pure Rust with SIMD IDCT/dequant (NEON + SSE2), rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames
- Real camera 1080p60: 4.5× faster — 1100 frames decoded in 1.2 seconds
- Full pixel range [0, 255] on all supported profiles
- Weighted prediction, 8x8 DCT (High profile), sub-MB partitions (16x8, 8x16, 8x8)
- Streaming reader: O(1) memory — 27MB RSS for 41MB file (no full-file loading)
HEVC:
- 1.4× faster than ffmpeg on P/B frames (full color decode with chroma MC + deblock + SAO + YUV→RGB)
- I-frame near-parity (0.97×) — intra-only content is CABAC-bound
- Full color output: chroma motion compensation with 4-tap filter, real YUV420→RGB
- All profiles decode correctly ([0, 255]) including 10-bit Main10 (u16 DPB)
- BS=0 edge skip eliminates ~85% of deblock work on inter-coded frames
Memory:
- Streaming MP4 reader: reads only moov box at open (1-5MB), samples lazily via seek
- 41MB H.264 file: 27MB RSS (< file size)
- 3.2MB HEVC file: 129MB RSS (DPB + recon buffers for 1080p)
- No unbounded growth — DPB bounded by SPS, all buffers reused across frames
Optimizations applied:
- Branchless CABAC: mask-based MPS/LPS selection, packed transition tables (128-entry lookup), CLZ batch renormalize, 32-bit buffered bit reader
- Unsafe hot paths:
get_uncheckedfor all CABAC table lookups,ptr::addfor deblock filter, pre-computed scan/context tables, branchless sign(val ^ -sign) + sign - Zero-copy frame management: reusable mv_field, CU list, Y-plane, recon buffers across frames
- NEON (29 blocks) + SSE2 (31 blocks): MC 8-tap horizontal/vertical filter, bipred/unipred clip, DC intra prediction, dequant, DCT 16x16/32x32, i16→u8 saturation, Y→grayscale RGB interleave
- Deblock: BS=0 skip (pred_mode grid), pre-computed tc/beta thresholds, early whole-edge skip, luma-only mode (skip chroma deblock)
- SAO: CTU-only 4KB stack buffer (not full-frame copy)
Supported formats:
- H.264: Baseline (CAVLC), Main (CABAC), High (CABAC + 8x8 transform), I/P/B slices, weighted prediction, sub-MB partitions, scaling lists, parallel deblocking
- HEVC: Main, Main10 (10-bit u16 DPB), I/P/B slices, CABAC, deblocking + SAO, CTU quad-tree, tiles (parsed), chroma residual parsing
- MP4 container: avcC/hvcC parameter extraction, stbl/stco/stsz sample table navigation
- MKV/WebM container: EBML demuxer with track/cluster parsing
- Annex B raw stream parser (H.264 + HEVC)
- SIMD: NEON (aarch64) 29 blocks, SSE2 (x86_64) 31 blocks — full cross-architecture coverage
- Parallelism: rayon parallel deblocking, skip-aware edge filtering, zero-copy reference frames
| Operation | yscv | OpenCV | Ratio |
|---|---|---|---|
| YUV420→RGB 1080p | 0.166ms | 0.178ms | 1.07× |
| Operation | Time |
|---|---|
| Tensor add 100K | 0.0143ms |
| Tensor mul 100K | 0.0118ms |
| Broadcast add | 0.226ms |
| Broadcast mul | 0.211ms |
| matmul 128² | 0.0055ms |
| matmul rect 96×192×64 | 0.0036ms |
| ReLU 921K f32 | 0.069ms (threaded: 0.062ms) |
| Sigmoid 921K f32 | 0.217ms |
| Add 921K same-shape | 0.126ms |
| BatchNorm 64²×16 | 0.028ms (threaded: 0.023ms) |
| Softmax 512×256 | 0.098ms (threaded: 0.063ms) |
| LayerNorm 512×256 | 0.065ms (threaded: 0.044ms) |
| Conv2d 32² 3×3 | 0.074ms |
| MaxPool 120×160 | 0.159ms (threaded: 0.096ms) |
| Grayscale u8 | 0.025ms |
| Resize nearest u8 | 0.048ms |
| Resize bilinear u8 | 0.068ms |
| Dilate u8 | 0.031ms |
| Erode u8 | 0.030ms |
| Box blur u8 | 0.049ms |
| Sobel u8 | 0.074ms |
| Autograd backward 32² | 0.0041ms |
| Autograd broadcast | 0.0067ms |
| Model linear batch32 | 0.000905ms |
| Model linear+relu+linear | 0.0024ms |
| SGD step batch16 | 0.0096ms |
| SGD step batch64 | 0.0147ms |
| Detect people | 0.060ms |
| Detect faces | 0.165ms |
| Detect heatmap | 0.046ms |
| Track | 0.487ms |
| Recognize query | 0.000448ms |
| CLI people pipeline | 0.075ms |
| CLI face pipeline | 0.162ms |
ONNX Inference (YOLOv8n / YOLO11n, 640×640 input) — Apple M1, measured Mar–Apr 2026, pending re-measurement
End-to-end model inference benchmarks against onnxruntime, Apple CoreML, and tract. Methodology: 50 timed runs after warmup, min reported. Apple M1 MacBook Air.
| Runtime | YOLOv8n | YOLO11n | Notes |
|---|---|---|---|
| yscv | 30.4ms | 33.7ms | Pure Rust, NHWC layout, BLAS matmul |
| onnxruntime 1.19 CPU | 37.4ms | 35.2ms* | *Requires opset 21 conversion; native opset 22 fails |
| onnxruntime 1.19 CoreML | 15.5ms | 47.6ms* | CoreML accelerator; YOLO11n perf degrades with partial coverage |
| tract 0.21 | 217.2ms | FAILED | TDim parse error |
yscv CPU is 1.2× faster than onnxruntime on YOLOv8n (30.4ms vs 37.4ms) and comparable on YOLO11n (33.7ms vs 35.2ms). onnxruntime requires manual opset downgrade (22→21) for YOLO11n; yscv handles opset 22 natively.
yscv MPSGraph is 4× faster than ORT CoreML on YOLOv8n (3.5ms vs 15.5ms). ORT CoreML on YOLO11n degrades to 47.6ms due to partial operator coverage.
| Runtime | YOLOv8n | YOLO11n | Notes |
|---|---|---|---|
| yscv MPSGraph | 3.5ms | 5.0ms | Whole-model graph compilation, single GPU dispatch |
| yscv Metal per-op | 12.1ms | 12.6ms | Per-op command buffer, Winograd + MPS GEMM |
| onnxruntime CoreML | 14.2ms | FAILED | Apple Neural Engine delegation |
MPSGraph compiles the entire ONNX model into an MPSGraphExecutable and runs it as a single GPU dispatch — eliminating per-op encoder transitions. 4× faster than CoreML on YOLOv8n.
yscv is the only runtime that runs both YOLOv8n and YOLO11n on GPU. CoreML fails on YOLO11n (opset 22).
The pipelined API (submit_mpsgraph_plan + wait_mpsgraph_plan) triple-buffers input/output buffers and overlaps CPU marshaling with GPU compute. Sustained per-frame wall-time (300 iter, Siamese tracker, 2 inputs @ 1×3×128×128 + 1×3×256×256, fp16):
| Mode | p50 | p99 | Sustained FPS |
|---|---|---|---|
yscv sync (--pipeline 1) |
1.26 ms | 1.80 ms | 792 |
yscv --pipeline 2 |
0.37 ms | 0.48 ms | 2688 |
yscv --pipeline 3 |
0.47 ms | 0.65 ms | 2151 |
yscv --pipeline 4 |
0.56 ms | 0.64 ms | 1776 |
| onnxruntime CoreML MLProgram | 1.62 ms | 1.98 ms | 617 |
Depth 2 is the throughput sweet spot (4.4× vs ORT CoreML); depth 4 trades raw p50 for the tightest tail (p99 = 0.64 ms). Pipeline depth is chosen via YSCV_MPS_PIPELINE env var (default 3, clamped 1..=8). The API itself is safe regardless: submit_mpsgraph_plan back-pressures if the caller has more outstanding handles than the pipeline depth.
Same two-tower Siamese tracker (inputs 1×3×128×128 + 1×3×256×256, fp32 zero-fill on CPU, fp16 on GPU). Compares every backend yscv ships against the corresponding ORT provider on the identical host. p50 over 300 iterations after warmup; yscv vs ORT 1.19.2.
| Backend | yscv | ORT 1.19.2 | yscv vs ORT |
|---|---|---|---|
| CPU 1T | 65 FPS (15.37 ms) | 34 FPS (29.80 ms) | 1.94× |
| CPU 4T | 183 FPS (5.46 ms) | 42 FPS (23.64 ms) | 4.33× |
| GPU sync | 792 FPS (1.26 ms) | 617 FPS (1.62 ms, CoreML) | 1.28× |
| GPU pipelined×2 | 2688 FPS (0.37 ms) | — | 4.4× over ORT CoreML |
| GPU peak burst | 3623 FPS (0.28 ms min) | — | — |
Why yscv CPU beats ORT CPU on Apple Silicon: Accelerate.framework
dispatches BLAS through Apple's AMX (Advanced Matrix eXtensions) block
— a dedicated matrix accelerator inside the CPU complex, separate from
the Neural Engine. ORT's CPUExecutionProvider uses its own
general-purpose SIMD kernels that don't hit AMX, so it leaves ~1.9×
throughput on the table single-thread, widening to 4.3× at four threads
as ORT scales poorly across the M1's P-cores. On Intel the opposite holds:
ORT's oneDNN is highly-tuned for AVX-512 where yscv (which dispatches
through OpenBLAS) is typically 2-3× slower.
Why yscv Metal beats ORT CoreML on Apple Silicon: ORT's
CoreMLExecutionProvider compiles the graph to CoreML and routes
compatible ops to the Apple Neural Engine (ANE) + Metal hybrid. On
the Siamese tracker 216/219 ops run on CoreML; the remaining 3 fall
back to CPU. Every fallback crosses a CPU↔accelerator boundary,
costing synchronization + marshalling. yscv's MPSGraph path compiles
100% of the graph to pure Metal and avoids the hybrid overhead
entirely. Pipelining (3-buffered submit/wait) then overlaps CPU
marshal with GPU compute for another 2× on sustained throughput.
RknnPipelinedPool applies the same pattern to Rockchip NPU cores: one slot per NpuCoreMask, pre-allocated + pre-bound RknnMem per input and output, back-pressured submit/wait. On RK3588 the pool can drive all 3 NPU cores concurrently; on RV1106 pass &[Core0] for a cleanly-typed single-slot async path.
On-device numbers (YOLO / Siamese tracker, int8-quantized .rknn) will be added once captured against a physical Rock 4D. Relative gains are expected to mirror the MPSGraph path — pipeline depth equal to NPU-core count ≈ 3× sync throughput, with tail latency tightening as CPU and NPU stop serialising their handshake.
The Metal backend compiles an ONNX graph into a sequence of MetalOps executed in a single
fused command buffer. Key optimizations (in order of impact):
| Optimization | Impact | Description |
|---|---|---|
| Winograd F(4×4, 3×3) | ~40% of GPU time | 2.25× FLOP reduction for stride-1 3×3 convs; SIMD group matrix multiply with f32 accumulation |
| F16 inter-op pipeline | Halves bandwidth | All intermediate buffers use f16; weights pre-packed as f16 at compile time |
| NEON input upload | Eliminates GPU cast | CPU-side fcvtn+st3 converts f32 NCHW → f16 NHWC faster than GPU kernel |
| Conv+SiLU+Add fusion | Fewer ops | Residual addition and activation fused into conv write-back epilogue |
| Vectorized f16 kernels | Better throughput | All utility ops (concat, split, permute, resize) use half4 vectorized I/O |
| Concat fusion | Eliminates copies | Conv outputs write directly into concat buffer via out_stride/out_offset |
| Detection head fusion | Fused permute+concat | NHWC→NCHW permutations + spatial concat fused into single NhwcToFlatConcat kernel |
| Zero-cost buffer aliasing | No-op reshapes | Reshape/Flatten/Squeeze/Unsqueeze alias existing buffers |
| Parallel softmax | Threadgroup reduction | Adaptive threadgroup size (32/128/256) with shared-memory reduction |
| Widened SiLU look-ahead | Fewer Metal ops | Detects SiLU patterns up to 5 nodes ahead (detection head interleaving) |
| In-place SiLU/Binary | Fewer buffers | Dead input buffers reused as output for elementwise ops |
| Op Type | YOLOv8n | YOLO11n |
|---|---|---|
| ConvWinograd (3×3 stride=1) | 32 | 28 |
| MpsConv (MPS GEMM for 1×1+) | 30 | 51 |
| Concat | 13 | 34 |
| SplitFused | 8 | 9 |
| CpuReshape (GPU permute) | 4 | 13 |
| Binary/BroadcastBinary | 6 | 28 |
| DepthwiseConv | — | 7 |
| Other (MaxPool, Resize, etc.) | 17 | 34 |
VballNetGrid Inference (DSConv model, 16.3 GFLOP) — Apple Silicon, measured Mar–Apr 2026, pending re-measurement
Model: VballNetGridV1b — 13 DSConvBlocks (depthwise 3×3 + pointwise 1×1), 4 MaxPool, head Conv+Sigmoid.
Input [1, 9, 432, 768], output [1, 27, 27, 48], 42 ONNX nodes.
| Stage | Time | FPS | Speedup | What changed |
|---|---|---|---|---|
| yscv BEFORE | 558 ms | 1.7 | — | Single-threaded, scalar depthwise |
| + Multi-threading | 257 ms | 3.9 | 2.1× | ParallelElementwiseConfig::default() in public API |
| + SIMD depthwise | 124.1 ms | 8.1 | 4.5× | NEON/AVX/SSE vectorized depthwise conv |
| onnxruntime CPU | 196.7 ms | 5.1 | — | CPUExecutionProvider baseline |
| onnxruntime CoreML CPU_ONLY | 8.6 ms | 116 | — | BNNS/AMX via CoreML delegate |
| yscv Metal per-op | 47.3 ms | 21.1 | 11.8× | Metal-native fused pipeline, MPS GEMM |
| yscv MPSGraph | 7.8 ms | 128 | 71.5× | Whole-model GPU graph compilation |
yscv CPU (124.1ms) is 1.6× faster than onnxruntime CPU (196.7ms) on depthwise-separable models — no special flags needed. MPSGraph (7.8ms) beats CoreML CPU_ONLY (8.6ms) which uses Apple's dedicated AMX coprocessor via BNNS — a 1.1× speedup. MPSGraph provides 16× over CPU, reaching 128 FPS on Apple Silicon.
Historical optimization arc, retained for context. The current Zen 4 tracker figures are in the Siamese tracker — CPU inference section at the top of this document (min-of-300, commit
241f36c). The p50 numbers below predate that re-measurement and use a different run protocol.
Model: Siamese tracker, 156 ops after graph optimization, two input branches
(input.1 1×3×128×128 template, input.249 1×3×256×256 search)
joined in connect_model. Primary fp32 CPU benchmark target of the
S./A./R.* perf arc (Apr 2026, 19 sessions).
Methodology: RAYON_NUM_THREADS=N ./onnx-fps --iters 500, median of
3 runs per thread-count, bitwise-identical outputs across all Ns.
ORT 1.24.4 CPUExecutionProvider as the reference.
| Threads | yscv p50 | ORT p50 | gap | yscv scaling | ORT scaling |
|---|---|---|---|---|---|
| 1 | 11.43 ms | 8.05 ms | 1.42× | 1.00× | 1.00× |
| 2 | 6.55 ms | 4.42 ms | 1.48× | 1.74× | 1.82× |
| 4 | 4.15 ms | 2.36 ms | 1.76× | 2.75× | 3.41× |
| 6 | 3.66 ms | 1.74 ms | 2.10× | 3.12× | 4.62× |
| 8 | 3.87 ms | 2.28 ms | 1.70× | 2.95× | 3.53× |
| 12 | 4.02 ms | 1.93 ms | 2.08× | 2.84× | 4.16× |
6T is the sweet spot (physical-core count). Beyond 6T, SMT contention hurts both engines; 12T is strictly worse than 6T.
| op | yscv | ORT | gap | ratio |
|---|---|---|---|---|
| Conv | 6.31 ms (78 ops) | 2.22 ms (114 ops) | +4.09 ms | 2.84× |
| MatMul | 0.19 ms (2) | 0.04 ms (2) | +0.14 ms | 4.17× |
| Reshape | 0.11 ms (5) | 0.01 ms (5) | +0.10 ms | 8.01× |
| Reorder (NCHWc) | — | 0.05 ms (7) | — | ORT-only |
Conv dominates 94% of the gap — the bulk live in mid-sized pointwise and inverted-bottleneck layers; ORT uses NCHWc layout throughout while yscv runs NHWC.
The custom CPU path layers several fusions and SIMD kernels on top of the blocked GEMM:
- AVX 8×8 / NEON 4×4 NCHW↔NHWC block transposes for layout conversion.
- AVX-512 / AVX2 / NEON depthwise row kernels with fused activation.
- Row-level parallelism for the first 3×3 stride-2 layer.
- Streaming
FusedPwDw(PW-expand → DW 3×3) with register-blocked accumulators, so the expanded intermediate never hits DRAM. FusedTransposeMatMul, mirroring ORT'sMatmulTransposeFusion.
All landings are bitwise-identical or 1-ULP-close to the reference; the suite builds clean on x86_64 + aarch64 and passes with and without BLAS.
Same model and inputs, RTX 4060 added through --features gpu (wgpu
backend over Vulkan) and onnxruntime-gpu 1.25 with CUDAExecutionProvider:
| backend | p50 | min | output 882 max |
drift vs CPU ref |
|---|---|---|---|---|
| ORT CUDA EP fp32 (cuDNN, Tensor Cores) | 1.42 ms | 1.40 ms | 48.9193 | −0.17 (TF32) |
yscv gpu fp16 (wgpu Vulkan) |
5.25 ms | 5.18 ms | 48.1491 | −0.94 |
yscv gpu fp32 (YSCV_GPU_FP32=1) |
5.82 ms | 5.72 ms | 49.0915 | −1 ULP |
| yscv CPU 6T no-BLAS | 3.05 ms | 2.84 ms | 49.0920 | +1 ULP |
| ORT CPU 1T | 8.07 ms | 8.03 ms | 49.0916 | reference |
Two takeaways:
- yscv fp32 GPU output is 1-ULP from the CPU reference, while ORT CUDA EP drifts −0.17 because cuDNN auto-uses Tensor Cores in TF32 / mixed precision on Ampere+ and downcasts implicitly. That is, our fp32 GPU path is the more numerically faithful one against the CPU reference.
- ORT CUDA EP is 4× faster than yscv wgpu — structural (cuDNN
ships shape-specific kernels and uses Tensor Cores via
cooperative_matrix, wgpu compute shaders are vendor-portable WGSL with no MMA path). Closing this requires either a vendor Vulkan extension wgpu doesn't expose or a separatecuda-backend.
For Conv-heavy small-batch graphs like this tracker, the CPU runner
(3.05 ms) beats wgpu (5.25 ms) on the same host — GPU launch latency
dominates the actual compute. wgpu starts to win above batch ≥ 4 or
inference ≥ 30 ms on CPU. See
gpu-backend-guide.md
for the full positioning matrix and YSCV_GPU_FP32 env-flag docs.
R10 fixed a silent-drop residual_tile in microkernel_4x8_dispatch
on the x86 1×NR-tile path — SIMD/ASM 4×8 variants never received the
residual pointer and dropped the add for Conv+Add shapes whose n
left an 8-wide scalar tail. Output 882 max went 84.78 → 49.0920
(ORT: 49.0916, 1-ULP FP-ordering drift). Perf also improved
because non-BLAS path is now fully correct on tracker shapes.
| build | 1T p50 | 6T p50 | 12T p50 | output 882 max |
|---|---|---|---|---|
| yscv no-BLAS (default for onnx-fps) | 11.22 ms | 3.17 ms | 3.34 ms | 49.0920 ✓ |
| yscv with BLAS (OpenBLAS 0.3.31) | 13.00 ms | 8.75 ms | 9.01 ms | 49.0922 ✓ |
| ORT 1.24.4 CPU | 8.07 ms | 1.74 ms | 1.91 ms | 49.0916 ✓ |
yscv no-BLAS vs ORT: 1T 1.39×, 6T 1.82×, 12T 1.75× behind.
On this graph BLAS is a net regression — 2.76× slower at 6T than
the non-BLAS path. Root causes: matmul_2d_slices_fused with BLAS
splits into blas_sgemm + apply_epilogue_fallback (two passes over
out, ~5.5 ms of extra L2/L3 traffic at 6T), and the whole arc
(R4/R7/R9/A2) only fires on the non-BLAS branch. OPENBLAS_NUM_THREADS=1
does not close the gap, so it's not pure thread oversubscription —
rayon workers block serially on sgemm instead of running their own
A/B tiles in parallel.
See feature-flags.md
for the full when-to-enable / when-to-disable BLAS checklist.
| Operation | NEON | SSE | AVX |
|---|---|---|---|
| Tensor binary/unary (1M f32) | ✅ 4× unroll | ✅ 4-wide | ✅ 4× unroll (32 elem) |
| Activations (sigmoid/tanh/silu) | ✅ 3-term poly | ✅ poly | ✅ poly |
| Softmax/LogSoftmax | ✅ fused | ✅ fused | ✅ fused |
| MatMul | ✅ BLAS | ✅ BLAS | ✅ BLAS + FMA |
| Conv2d 3×3 | ✅ direct NEON | ✅ direct SSE | ✅ im2col + BLAS |
| Depthwise Conv2d | ✅ 4-wide FMA | ✅ 4-wide | ✅ 8-wide |
| u8 morphology/filter/sobel | ✅ 16B/iter | ✅ 16B/iter | ✅ 32B/iter (AVX2) |
| f32 filter/morphology/geometry | ✅ 4-wide | ✅ 4-wide | ✅ 8-wide |
| Median u8 | ✅ sort network | ✅ sort network | — |
| YUV→RGB | ✅ NEON + GCD | ✅ SSE + threads | ✅ AVX2 + threads |
- 315
#[target_feature]-gated SIMD functions with runtime CPU detection - All dispatch functions
#[inline]for cross-crate inlining - AlignedVec::uninitialized — skip output zeroing in hot paths
- ImageU8/ImageF32 — zero-overhead wrappers bypass Tensor allocation
- GCD dispatch_apply — macOS near-zero threading (~0.3µs)
- mimalloc — thread-local arena pools
- Fused kernels — single-pass softmax, sigmoid, attention
- im2col + BLAS — Accelerate/OpenBLAS for matmul/conv2d/conv3d
- Flash Attention — tiled O(Br×Bc) memory, online softmax
- Integer GEMM — quantized matmul with i32 accumulation (no dequant overhead)
| Crate | Tests | Coverage |
|---|---|---|
| yscv-model | 365 | Serialization, training loops, data loading, distributed |
| yscv-imgproc | 225 | All u8/f32 ops, SIMD paths, color conversion |
| yscv-video | 230 | H.264/HEVC decode (Main + Main10 + Rext + weighted prediction + tiles + WPP + chroma deblock/SAO), MP4/MKV parsing, HW detect |
| yscv-tensor | 207 | Elementwise, matmul, broadcast, BLAS dispatch |
| yscv-kernels | 120 | CPU ops, GPU backend, SIMD activation, GEMM |
| yscv-autograd | 106 | Forward/backward graph, all op gradients |
| yscv-eval | 95 | COCO/YOLO/VOC/KITTI/WiderFace/MOT metrics |
| yscv-onnx | 166 | Per-operator coverage for all 122 CPU dispatch arms, fusion regressions, quantization, vision ops |
| yscv-optim | 76 | SGD, Adam, LR schedulers, weight decay |
| yscv-detect | 60 | YOLOv8/v11 decode, NMS, letterbox |
| yscv-track | 57 | Hungarian, Kalman, IoU matching |
| yscv-cli | 42 | Config parsing, diagnostics |
| yscv-recognize | 16 | Embedding extraction, cosine similarity |
| Total | 1808 |
| Platform | Runner | Features | What's Tested |
|---|---|---|---|
| macOS (ARM) | macos-latest | default + videotoolbox | Full workspace + HW decode |
| Linux (x86) | ubuntu-latest | default + gpu | Full workspace + WGPU |
| Linux (ARM) | ubuntu-24.04-arm | default | Full workspace + NEON |
| Windows | windows-latest | default | Full workspace |
- workspace-compat: Multi-platform build + test
- quality: fmt + clippy + CLI integration + benchmark gates + eval format verification
- miri: Unsafe code soundness (yscv-tensor, yscv-kernels)
- hw-decode: VideoToolbox (macOS), SW fallback (Linux/Windows)
- benchmark gates: Criterion microbenchmarks for 12 crates with trend tracking
- Video:
examples/src/CENSUSWITHOUTLOGO.mp4(41MB, H.264 1080p60, 1100 frames) - Images:
examples/src/testtraffic.png,testtraffic2.png(3.4-3.5MB) - Models:
examples/src/slowwork/yolo{v8n,11n}.onnx(10-12MB, gitignored) - Eval samples:
benchmarks/eval-*(all formats, <10KB each) - Fuzz corpus:
fuzz/corpus/(H.264, HEVC, MKV seed files) - Baselines:
benchmarks/ci-baseline-*.txt,trend-baseline-*.tsv
# Linux-local CI subset before pushing
bash scripts/check-ci-local.sh
# Broader local pass: release tests, extended proptests, UX, benchmark gates
bash scripts/check-ci-local.sh --full
# Add local cross-target smoke where toolchains are available
bash scripts/check-ci-local.sh --cross --all-features
# Full workspace test (1693 tests)
cargo test
# Single crate
cargo test -p yscv-video
# Video decode benchmark
cargo run --release --example bench_video_decode -- examples/src/CENSUSWITHOUTLOGO.mp4
# Compare with ffmpeg
ffmpeg -threads 1 -benchmark -i examples/src/CENSUSWITHOUTLOGO.mp4 -f null -
# Criterion microbenchmarks
cargo bench -p yscv-kernels
cargo bench -p yscv-imgproc
# Miri soundness check
cargo +nightly miri test -p yscv-tensor --lib
# Fuzz testing
cd fuzz && cargo fuzz run fuzz_h264_nal -- -max_total_time=60