Every yscv-* crate has Cargo feature flags that gate optional code paths:
GPU backends, hardware codecs, real-time scheduling, platform-specific
acceleration. This is the canonical reference — what each flag does,
what it requires on the host, how to turn it on.
For specific backends with lots of surface area (MPSGraph, wgpu) see the dedicated guides linked inline.
yscv (umbrella)
├── blas default ON — BLAS matmul dispatch (Accelerate / OpenBLAS)
├── gpu wgpu cross-platform GPU → gpu-backend-guide.md
├── metal-backend Apple MPSGraph (macOS GPU) → mpsgraph-guide.md
├── rknn Rockchip NPU → edge-deployment.md
├── rknn-validate Dry-run RKNN load at config time
├── realtime SCHED_FIFO + affinity + mlockall + governor
├── native-camera V4L2 / AVFoundation / MediaFoundation
├── mkl Intel MKL for x86 matmul
├── armpl ARM Performance Libraries for aarch64 matmul
├── drm Linux KMS display output (yscv-video)
├── mpp Rockchip MPP hardware H.264/H.265 encoder (yscv-video-mpp)
├── videotoolbox Apple VideoToolbox HW decode (macOS/iOS)
├── vaapi Intel/AMD VA-API HW decode (Linux)
├── nvdec NVIDIA NVDEC HW decode (Linux/Windows)
├── mediafoundation Microsoft MediaFoundation HW decode (Windows)
└── profile Per-op ONNX profiling instrumentation
In your Cargo.toml:
[dependencies]
yscv = { version = "0.1", features = ["rknn", "realtime"] }
# or:
yscv-onnx = { version = "0.1", features = ["metal-backend"] }
# or per-crate for finer control:
yscv-kernels = { version = "0.1", features = ["gpu"] }
yscv-video = { version = "0.1", features = ["native-camera", "vaapi"] }On the command line:
cargo build --release --features "rknn realtime"
cargo run --release --features "metal-backend gpu" --example bench_mpsgraph
cargo test --workspace --features "rknn metal-backend gpu realtime rknn-validate"These change which execution path run_onnx_model / dispatch_frame /
RknnPipelinedPool route to. Multiple can be combined — the dispatcher
picks at runtime based on your TOML/API choice.
Default ON in the yscv-kernels / yscv-onnx crates; default OFF in
the private/onnx-fps bench harness. Links a CBLAS implementation for
SGEMM dispatch:
- macOS —
Accelerate.framework(built into OS, zero setup). Uses Apple's AMX matrix accelerator; ~2× faster than OpenBLAS here. - Linux —
libopenblas.so.0. Install withsudo apt install libopenblas-dev(Debian/Ubuntu) orsudo dnf install openblas-devel(Fedora). - Windows — OpenBLAS via vcpkg.
vcpkg install openblas:x64-windowsand setOPENBLAS_PATHorVCPKG_ROOTenv.
Turn ON when:
- Transformer attention (
QKᵀ,AV— big square-ish GEMM). - ResNet / ViT classification (FC 2048×1000 output).
- LLM K/V projection, output head, any decoder step.
- Graphs where one MatMul is tens of millions of FLOPs and there's no adjacent fusion opportunity.
- macOS anywhere —
Accelerate+ AMX is hard to beat and cheap to link.
Turn OFF when:
- Conv-heavy graphs (object detectors, Siamese trackers, U-Nets, segmentation). Our blocked GEMM fuses Conv+Bias+Residual+Activation in registers; BLAS can't — it writes GEMM output first, then a separate pass adds bias/residual and applies activation.
- You're on Linux / Windows with OpenBLAS 0.3.x and hit the
oversubscription wall — our non-BLAS path hands all work to rayon
and scales cleanly; OpenBLAS's internal threadpool fights rayon
even with
OPENBLAS_NUM_THREADS=1. - You care about single-binary deployment and don't want to carry
libopenblasinto a container image.
Measured on the public Siamese tracker (Conv 85 %, DW 9 %, MatMul 5 %), Zen 4 6C/12T, 2026-04 arc state:
| path | 1T p50 | 6T p50 |
|---|---|---|
| yscv no-BLAS | 11.22 ms | 3.17 ms |
| yscv with BLAS | 13.00 ms | 8.75 ms |
| ONNX Runtime 1.24 | 8.07 ms | 1.74 ms |
BLAS makes this workload 2.76× slower at 6T because it breaks
the fused-epilogue path (row_gemm_set_parallel_fused,
blocked_gemm MR=4×24 / MR=12×32 with in-register residual+bias+
activation) — all the R4 / R7 / R9 / A2 landings in the April arc
only fire on the non-BLAS branch.
Runtime CPU identity is centralized in yscv-cpu. Raw
is_*_feature_detected! probes are allowed only in
crates/yscv-cpu/src/detect_*; scripts/check-runtime-dispatch.sh
enforces that the rest of the workspace reads cached
host_cpu().features.
For benchmark logs and bug reports:
println!("{}", yscv_kernels::dispatch_report());For structured tooling:
let dispatch = yscv_kernels::runtime_dispatch_report();
let config = yscv_kernels::runtime_config_report();runtime_dispatch_report() includes the cached CPU identity, standalone
SIMD paths, matmul/conv dispatch gates, INT8 paths, and
RuntimeConfigReport. runtime_config_report() records every active
YSCV_* environment override visible to the process in stable name order,
so new A/B knobs show up automatically without updating a hand-written
allowlist.
Default OFF. Set YSCV_AVX512_SGEMM=1 to enable.
Uses the full 32-ZMM register file (24 accumulators, 2 B-load,
1 A-broadcast, 5 epilogue scratch). Epilogue supports: bias, residual,
Relu (all in asm), SiLU (ZMM post-store pass via bit-trick exp).
m-tail (<12 rows) handled via tail_tile copy path.
Session prepack also produces NR=32 layout (PackedB::data_nr32) when
enabled, eliminating per-inference B packing.
Shapes routed through 12×32: n%32==0 && m≥12 && k≥16. Shapes that fall through: routed to MR=4×24 AVX2 (default).
Zen 4 note: 4×24 AVX2 wins on AMD Zen 4 (verified 2026-05-14). Measured regression: +156µs 1T vs 4×24 baseline even with session prepack. Zen 4 executes 2 YMM FMAs/clock vs 1 ZMM FMA/clock, giving 4×24 a 2× throughput advantage for the tracker's shape mix. Enable on Intel (Sapphire Rapids+) or AMD Zen 5 where ZMM is native.
quantize_tracker --format qdq defaults to YSCV_QUANT_FAST=1 behavior:
constant QDQ weight dequantizers are folded into fp32 initializers and QDQ
pairs between Conv-like nodes are stripped. This keeps the exported tracker
model runnable while restoring loader-time Conv layout normalisation and weight
prepacking for yscv benchmarks. Regular, grouped, and depthwise Conv weights
are exported in standard OIHW quantized form before the yscv-fast cleanup folds
constant weight DQ nodes for reload; dead quantized tensors/scales are pruned
after fold/strip. Set YSCV_QUANT_FAST=0 when debugging the fully explicit
QDQ graph.
Runtime A/B knob for quantized ONNX execution. By default yscv folds
matching DequantizeLinear -> [Relu] -> QuantizeLinear boundaries into a
single quant-domain action and records those executions in
quant_runtime_stats(). Set YSCV_QUANT_INT8_FAST=0 to force the explicit
Q/DQ boundary path while leaving standard QLinearConv / QLinearMatMul
kernels enabled. This is a debug/measurement escape hatch, not a production
tuning flag.
On Transformer workloads the tradeoff inverts: one big sgemm per
attention head is the bulk of the work, and Accelerate / MKL
outperform yscv's hand-tuned blocked GEMM there.
# Library default — keeps BLAS on, good for Transformer / classifier
yscv = "0.1"
# Conv-heavy deployment — no-BLAS is faster, no external lib
yscv = { version = "0.1", default-features = false }See performance-benchmarks.md for the
tracker-specific numbers. On small-channel depthwise/pointwise models
like the tracker, the custom blocked-GEMM + streaming-fused Conv paths
beat a generic BLAS call, so the non-BLAS build is often faster there.
Enables compile_mpsgraph_plan / run_mpsgraph_plan / pipelined
submit+wait API. Requires macOS 13+, best on Apple Silicon.
Full guide: mpsgraph-guide.md — when to use,
API, pipelining, troubleshooting, performance numbers.
Quick example:
use yscv_onnx::{compile_mpsgraph_plan, run_mpsgraph_plan};
let plan = compile_mpsgraph_plan(&model, &[("images", &sample)])?;
let outputs = run_mpsgraph_plan(&plan, &[("images", bytes)])?;Vulkan / Metal / DX12 / OpenGL via wgpu. Works on AMD, NVIDIA, Intel,
Apple, Adreno, Mali — single Rust binary.
Full guide: gpu-backend-guide.md — platform
selection, compiled plans, multi-GPU, NixOS LD setup, fp16/fp32
toggle, troubleshooting.
Quick example:
use yscv_onnx::run_onnx_model_gpu;
let outputs = run_onnx_model_gpu(&model, inputs)?;Runtime flags:
YSCV_GPU_FP32=1— disable opportunistic fp16 (SHADER_F16), force every kernel + buffer through fp32. Output becomes 1-ULP identical to the CPU runner; latency rises ~12-15 % on memory-bound graphs. Use for accuracy A/B; ship default fp16.WGPU_BACKEND={vulkan|dx12|metal|gl}— pin the backend (wgpu env; default is "best-of"). Useful for forcing software fallback in CI.WGPU_ADAPTER_NAME=...— pick a specific GPU when multiple are visible.
Links librknnrt.so at runtime via dlopen. Builds on any platform;
runs only on Rockchip SoCs with the runtime lib installed. Covers
RK3588 / RK3576 / RV1106.
Full guide: edge-deployment.md — DMA-BUF
zero-copy, SRAM, MPP zero-copy from hardware decoder, dynamic-shape
matmul, custom OpenCL ops.
Quick example:
use yscv_kernels::{RknnPipelinedPool, NpuCoreMask};
let pool = RknnPipelinedPool::new(&model_bytes,
&[NpuCoreMask::Core0, NpuCoreMask::Core1, NpuCoreMask::Core2])?;Host setup on the Rockchip board:
# librknnrt.so from rockchip-linux/rknn-toolkit2 release
sudo cp librknnrt.so /usr/lib/aarch64-linux-gnu/
sudo ldconfigOptional companion to rknn. When enabled, PipelineConfig::validate_models
calls RknnBackend::load at config time to catch corrupt-but-magic-OK
.rknn files before any RT threads start.
Adds ~100 ms to startup. Worth it in production.
yscv-pipeline = { version = "0.1", features = ["rknn", "rknn-validate"] }Replaces OpenBLAS with Intel's MKL library. 2-3× faster matmul on Intel CPUs (AVX-512 + per-cpu dispatched kernels).
Setup:
# Install Intel oneAPI Base Toolkit (includes MKL)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html
# Or on Debian/Ubuntu:
sudo apt install intel-mkl
# Point cargo at it:
export MKLROOT=/opt/intel/oneapi/mkl/latestThen:
yscv = { version = "0.1", features = ["mkl"] }No code changes. Matmul and Conv silently take the MKL path.
When to use: Intel servers (Xeon, Core i9), Windows/Linux x86 dev.
When NOT to use: Apple Silicon (Accelerate beats MKL), AMD (MKL
historically pessimises on non-Intel, set MKL_DEBUG_CPU_TYPE=5 to
force AVX2 path).
Same idea, for aarch64 Linux. Arm's hand-tuned NEON/SVE matmul kernels. Biggest win on AWS Graviton, Ampere, or any server ARM.
Setup:
# Download from https://developer.arm.com/downloads/-/arm-performance-libraries
# Install to /opt/arm/armpl-xxx
export ARMPL_DIR=/opt/arm/armpl-23.10yscv = { version = "0.1", features = ["armpl"] }When to use: aarch64 Linux servers.
When NOT to use: Apple Silicon (use default Accelerate), Rockchip
(use rknn for NPU instead of CPU matmul).
Wires Linux real-time scheduling primitives through the pipeline
framework. When set, PipelineConfig with [realtime] section will
apply:
- SCHED_FIFO — real-time priority for dispatch threads
- CPU affinity — pin threads to specific cores, stop cache bouncing
- mlockall — prevent memory paging (no swap-out glitches mid-flight)
- cpufreq governor — write
"performance"to every core, kill DVFS step-up latency
Setup:
[dependencies]
yscv-pipeline = { version = "0.1", features = ["realtime"] }In your TOML:
[realtime]
sched_fifo = true
cpu_governor = "performance"
prio.dispatch = 70
affinity.dispatch = [4, 5, 6]Linux capabilities required:
- CAP_SYS_NICE for SCHED_FIFO
- CAP_SYS_ADMIN for cpufreq governor writes
- Process memlock limit raised for mlockall
Production setup via systemd unit:
[Service]
ExecStart=/usr/local/bin/my-pipeline config.toml
AmbientCapabilities=CAP_SYS_NICE CAP_SYS_ADMIN
LimitMEMLOCK=infinity
LimitRTPRIO=99Or via setcap on the binary:
sudo setcap cap_sys_nice,cap_sys_admin+ep ./my-pipelineGraceful fallback: if the process lacks a capability, the framework logs a warning and continues without that specific feature. Your pipeline runs; real-time guarantees are partial.
Verify what actually applied after startup by checking the stderr log:
[yscv-pipeline] realtime partial: sched_fifo=true affinity=true mlockall=true governor_cores=8
If sched_fifo=false: missing CAP_SYS_NICE. If mlockall=false:
memlock limit too low. If governor_cores=0: missing CAP_SYS_ADMIN.
On non-Linux hosts (macOS, Windows) the feature compiles but all
primitives return NotSupported — safe to leave enabled in cross-
platform code.
Native platform camera APIs for live capture. Wraps nokhwa crate
which handles the three platforms uniformly.
Setup per platform:
- Linux — V4L2 always available, no extra install. Permissions:
user must be in
videogroup (sudo usermod -aG video $USER). - macOS — AVFoundation, works out of the box. On first access
macOS prompts for camera permission (
Info.plistNSCameraUsageDescription needed for app bundles; console apps typically prompt in Terminal). - Windows — MediaFoundation, works out of the box. UWP apps need
Webcamcapability in the manifest.
Setup in Cargo:
yscv-video = { version = "0.1", features = ["native-camera"] }Usage:
use yscv_video::{CameraConfig, V4l2Camera}; // Linux
let mut cam = V4l2Camera::open(CameraConfig {
device: "/dev/video0".into(),
width: 1280, height: 720,
fps: 60,
format: yscv_video::PixelFormat::Nv12,
})?;
loop {
let frame = cam.capture_frame()?;
// process frame…
}On Rockchip boards: V4L2 works; for zero-copy to NPU use
VIDIOC_EXPBUF to get a DMA-BUF fd, then RknnBackend::wrap_fd to
bind it as an NPU input. Full recipe in edge-deployment.md.
Platform-specific HW decoders on top of the software H.264/HEVC/AV1 decoders. yscv automatically detects availability at runtime; software decode is always the fallback.
| Flag | Platform | Codecs |
|---|---|---|
videotoolbox |
macOS / iOS | H.264, HEVC, ProRes |
vaapi |
Linux | H.264, HEVC, AV1 (on Intel ≥ Xe, AMD ≥ RDNA2) |
nvdec |
Linux / Windows + NVIDIA | H.264, HEVC, AV1 (Ampere+) |
mediafoundation |
Windows | H.264, HEVC, AV1 (hardware-dependent) |
Setup:
yscv-video = { version = "0.1", features = ["videotoolbox"] } # macOS
yscv-video = { version = "0.1", features = ["vaapi"] } # Linux/Intel/AMD
yscv-video = { version = "0.1", features = ["nvdec"] } # Linux/Windows NVIDIA
yscv-video = { version = "0.1", features = ["mediafoundation"] }# WindowsSystem deps per platform:
- macOS VideoToolbox: built-in, no install.
- Linux VA-API:
sudo apt install libva-dev intel-media-va-driver(ormesa-va-driversfor AMD). Check withvainfo. - NVIDIA NVDEC: ships with the NVIDIA driver.
nvidia-smimust work. - Windows MediaFoundation: built into Windows 10+.
Usage — automatic detection via HwDecoder:
use yscv_video::{HwDecoder, VideoCodec};
let mut decoder = HwDecoder::new(VideoCodec::H264)?;
// Tries VideoToolbox/VAAPI/NVDEC/MediaFoundation in that order, falls
// back to software if none available. Use it the same way either way.
let frame = decoder.decode(packet)?;See video-pipeline.md for container parsing
(MP4/MKV), zero-copy decode, and performance numbers vs ffmpeg.
Direct framebuffer output via DRM/KMS. For headless edge devices
displaying video without X/Wayland. Used by pipeline output.kind = "drm".
Setup:
# User needs access to /dev/dri/card0
sudo usermod -aG video $USERyscv-video = { version = "0.1", features = ["drm"] }TOML config:
[output]
kind = "drm"
connector = "HDMI-A-1"
mode = "720p60"Hardware H.264/H.265 encoding on Rockchip SoCs via
librockchip_mpp.so (also dlopen). Used by pipeline
encoder.kind = "mpp-h264".
Setup on Rockchip board:
sudo apt install librockchip-mpp1 # or from rockchip-linux/mpp releasesyscv-video-mpp = { version = "0.1", features = ["mpp"] }Instruments every op in the ONNX runner with wall-time measurement. Useful to find which op is a bottleneck.
yscv-onnx = { version = "0.1", features = ["profile"] }use yscv_onnx::profile_onnx_model_cpu;
// Runs one unfused inference and prints a per-op timing table plus a
// "Top <filter> layers" detail table. Conv rows show the dispatched
// microkernel as `runner-dispatch` or `runner-dispatch/kernel-sub-path`:
// 0.42ms [1,3,128,128] → [1,16,64,64] k=[3,3] s=[2,2] via nhwc-padded/first-layer-rgb at stem
// 0.10ms [1,64,40,40] → [1,64,40,40] k=[3,3] s=[1,1] via dw-nhwc-padded/dw-avx512 at neck.dw
// 0.31ms [64,128] → [64,96] via blocked-mr4 at head.proj (MatMul)
profile_onnx_model_cpu(&model, inputs)?;Overhead is small (~1% per op) but leave it off in production — the counters persist across runs.
Runtime knobs (no rebuild needed):
YSCV_PROFILE_FILTER=<spec>— narrow the detail table and the JSON to a node subset. The spec is comma-separated: a bare token matches an op type exactly (Conv,Conv,MatMul),name:<substr>matches a node-name substring (name:head). Unset → the default Conv detail table and an unfiltered JSON.YSCV_PROFILE_JSON=<path>— additionally write a machine-readable per-node JSON (name/op/ms/shapes; Conv nodes also carrykernel_shape,stridesand the dispatchedkernel). Consumed byscripts/gap_diff.py.YSCV_RUNNER_PROFILE=<path>— aggregate per-node timings over the fused runner path across a whole bench loop; dump withdump_runner_profile. Each node carries the same dispatched-kernellabel as the CPU profiler (Conv sub-path / MatMul-Gemm family); fused streaming kernels (e.g.FusedDwPw) bypass the instrumented dispatch and carry none.
Diff two profiles with scripts/gap_diff.py a.json b.json: --filter takes
the same spec as YSCV_PROFILE_FILTER, and --per-node diffs node-by-node,
flagging a dispatched-kernel change as old→new — ideal for a
baseline-vs-change comparison of the same model.
Cargo features select backend code at build time. For ONNX CPU micro-tuning and A/B you also have runtime env toggles.
Most used knobs:
YSCV_FUSED_PW_DW_STREAM_OFF=1— disable PW->DW streaming fused path.YSCV_FUSED_PW_DW_PW2X_OFF=1— disable NEON PW2X inner-loop variant.YSCV_FUSED_PW_DW_W_TILE=<N>— override fused PW->DW strip-mining tile.YSCV_FUSED_PW_DW_DW_INTERIOR=1— opt into the experimental AVX-512 stride-1/stride-2 depthwise interior fast path inside PW->DW fusion.YSCV_FUSED_DW_PW_STREAM_OFF=1— disable DW->PW streaming fused path.YSCV_FUSED_DW_PW_STREAM_PADDED=1— enable padded streaming variant on non-x86; x86_64 enables it by default.YSCV_FUSED_DW_PW_STREAM_PADDED_OFF=1— disable x86_64 default padded DW->PW streaming.YSCV_FUSED_DW_PW_ROW_BATCH=<N>— override DW->PW streaming y-band size.YSCV_DIRECT_CONV_WORK_MAX=<N>— direct-3x3 routing threshold.YSCV_NO_POINTWISE_16X16_DIRECT=1— disable the direct NHWC 1×1K=16,N=16pointwise kernel with fused bias/residual/activation epilogue.YSCV_NO_POINTWISE_NX16_DIRECT=1— disable the single-thread direct NHWC residual pointwise kernel for small-M,N % 16 == 0ConvAdd blocks.YSCV_POOL=yscv— route standalonepar_chunks_mut_dispatchkernels through the pinned yscv thread pool instead of Rayon. This is useful for single-op CPU benches and lowers the softmax parallel threshold for the low-overhead pool; the default remains Rayon unless this env var is set.YSCV_POOL_SPIN_US=<N>— keep yscv-pool workers in a short idle spin before parking. Default is0, which is tuned for normal inference graphs. For tight standalone single-op microbench loops withYSCV_POOL=yscv, values around200can remove the last park/unpark overhead on Zen 4.YSCV_X86_MEMORY_SIMD=avx2— on x86/x86_64 with AVX-512 available, route memory-bound standalone elementwise/ReLU dispatch through the 256-bit AVX path for A/B measurement. Default keeps AVX-512 enabled.YSCV_NO_AARCH64_LOW_K_BLOCKED=1andYSCV_AARCH64_LOW_K_BLOCKED_MIN_WORK_FMAS=<N>— low-k blocked matmul route.YSCV_NO_X86_LOW_K_BLOCKED=1— disable the x86 low-k pointwise route through the blocked 4x24/4x16 AVX+FMA GEMM kernels.
Symmetric QLinearConv depthwise 3x3/5x5 stride-1 nodes, plus measured-win
stride-2 tracker shapes, route through the multi-arch INT8 depthwise kernel
automatically. Unsupported QLinear shapes fall back to the portable path, so
there is no runtime flag for this route.
VNNI-friendly explicit QLinear pointwise/MatMul weights (K >= 4,
N % 16 == 0) are also packed once at model load into the AVX-512 VNNI 4x16
RHS layout; this is part of the standard runtime index and has no separate
feature flag. Entry QuantizeLinear nodes that produce QLinear activation
edges write direct i8 side-table tensors via the shared scalar/AVX2/AVX-512F/
NEON quantizer; x86 paths pack SIMD lanes directly to i8, avoiding the old scalar collection path for real quant-domain
storage.
Canonical reference (with defaults, scope, and reproduction commands):
onnx-cpu-kernels.md.
# Pure CPU, zero system deps (works anywhere, slowest)
yscv = { version = "0.1", default-features = false }
# Apple Silicon dev — everything Apple has
yscv = { version = "0.1", features = ["metal-backend", "videotoolbox", "native-camera"] }
# Intel server — best CPU inference
yscv = { version = "0.1", features = ["mkl", "gpu", "vaapi"] }
# NVIDIA server
yscv = { version = "0.1", features = ["gpu", "nvdec"] }
# Rockchip drone / FPV — full edge stack
yscv-pipeline = { version = "0.1", features = ["rknn", "rknn-validate", "realtime"] }
yscv-video = { version = "0.1", features = ["drm"] }
yscv-video-mpp = { version = "0.1", features = ["mpp"] }
# Windows desktop
yscv = { version = "0.1", features = ["gpu", "mediafoundation"] }
# Cross-platform CV library (no accelerators, works anywhere including WASM)
yscv = { version = "0.1", default-features = false }For flags with lots of surface area, the table above links out:
mpsgraph-guide.md—metal-backend(MPSGraph API, pipelining)gpu-backend-guide.md—gpu(wgpu, multi-GPU, compiled plans)edge-deployment.md—rknn(DMA-BUF, SRAM, custom ops)pipeline-config.md—realtimeTOML schemavideo-pipeline.md— HW decode + container parsing
If a feature works on one platform but not another, open an issue with:
- Feature list from your
Cargo.toml rustc --version,uname -a, GPU / CPU model- Exact error or symptom
cargo tree -e features -p <crate>output — shows what got resolved