Feature flags — what they do and how to set them up

Every yscv-* crate has Cargo feature flags that gate optional code paths: GPU backends, hardware codecs, real-time scheduling, platform-specific acceleration. This is the canonical reference — what each flag does, what it requires on the host, how to turn it on.

For specific backends with lots of surface area (MPSGraph, wgpu) see the dedicated guides linked inline.

TL;DR — the map

yscv (umbrella)
├── blas            default ON — BLAS matmul dispatch (Accelerate / OpenBLAS)
├── gpu             wgpu cross-platform GPU            → gpu-backend-guide.md
├── metal-backend   Apple MPSGraph (macOS GPU)         → mpsgraph-guide.md
├── rknn            Rockchip NPU                       → edge-deployment.md
├── rknn-validate   Dry-run RKNN load at config time
├── realtime        SCHED_FIFO + affinity + mlockall + governor
├── native-camera   V4L2 / AVFoundation / MediaFoundation
├── mkl             Intel MKL for x86 matmul
├── armpl           ARM Performance Libraries for aarch64 matmul
├── drm             Linux KMS display output (yscv-video)
├── mpp             Rockchip MPP hardware H.264/H.265 encoder (yscv-video-mpp)
├── videotoolbox    Apple VideoToolbox HW decode (macOS/iOS)
├── vaapi           Intel/AMD VA-API HW decode (Linux)
├── nvdec           NVIDIA NVDEC HW decode (Linux/Windows)
├── mediafoundation Microsoft MediaFoundation HW decode (Windows)
└── profile         Per-op ONNX profiling instrumentation

How to set features

In your Cargo.toml:

[dependencies]
yscv = { version = "0.1", features = ["rknn", "realtime"] }
# or:
yscv-onnx = { version = "0.1", features = ["metal-backend"] }
# or per-crate for finer control:
yscv-kernels = { version = "0.1", features = ["gpu"] }
yscv-video   = { version = "0.1", features = ["native-camera", "vaapi"] }

On the command line:

cargo build --release --features "rknn realtime"
cargo run   --release --features "metal-backend gpu" --example bench_mpsgraph
cargo test  --workspace --features "rknn metal-backend gpu realtime rknn-validate"

Backend flags (pick what you're deploying on)

These change which execution path run_onnx_model / dispatch_frame / RknnPipelinedPool route to. Multiple can be combined — the dispatcher picks at runtime based on your TOML/API choice.

`blas` — BLAS-accelerated matmul

Default ON in the yscv-kernels / yscv-onnx crates; default OFF in the private/onnx-fps bench harness. Links a CBLAS implementation for SGEMM dispatch:

macOS — Accelerate.framework (built into OS, zero setup). Uses Apple's AMX matrix accelerator; ~2× faster than OpenBLAS here.
Linux — libopenblas.so.0. Install with sudo apt install libopenblas-dev (Debian/Ubuntu) or sudo dnf install openblas-devel (Fedora).
Windows — OpenBLAS via vcpkg. vcpkg install openblas:x64-windows and set OPENBLAS_PATH or VCPKG_ROOT env.

Turn ON when:

Transformer attention (QKᵀ, AV — big square-ish GEMM).
ResNet / ViT classification (FC 2048×1000 output).
LLM K/V projection, output head, any decoder step.
Graphs where one MatMul is tens of millions of FLOPs and there's no adjacent fusion opportunity.
macOS anywhere — Accelerate + AMX is hard to beat and cheap to link.

Turn OFF when:

Conv-heavy graphs (object detectors, Siamese trackers, U-Nets, segmentation). Our blocked GEMM fuses Conv+Bias+Residual+Activation in registers; BLAS can't — it writes GEMM output first, then a separate pass adds bias/residual and applies activation.
You're on Linux / Windows with OpenBLAS 0.3.x and hit the oversubscription wall — our non-BLAS path hands all work to rayon and scales cleanly; OpenBLAS's internal threadpool fights rayon even with OPENBLAS_NUM_THREADS=1.
You care about single-binary deployment and don't want to carry libopenblas into a container image.

Measured on the public Siamese tracker (Conv 85 %, DW 9 %, MatMul 5 %), Zen 4 6C/12T, 2026-04 arc state:

path	1T p50	6T p50
yscv no-BLAS	11.22 ms	3.17 ms
yscv with BLAS	13.00 ms	8.75 ms
ONNX Runtime 1.24	8.07 ms	1.74 ms

BLAS makes this workload 2.76× slower at 6T because it breaks the fused-epilogue path (row_gemm_set_parallel_fused, blocked_gemm MR=4×24 / MR=12×32 with in-register residual+bias+ activation) — all the R4 / R7 / R9 / A2 landings in the April arc only fire on the non-BLAS branch.

CPU dispatch inspection

Runtime CPU identity is centralized in yscv-cpu. Raw is_*_feature_detected! probes are allowed only in crates/yscv-cpu/src/detect_*; scripts/check-runtime-dispatch.sh enforces that the rest of the workspace reads cached host_cpu().features.

For benchmark logs and bug reports:

println!("{}", yscv_kernels::dispatch_report());

For structured tooling:

let dispatch = yscv_kernels::runtime_dispatch_report();
let config = yscv_kernels::runtime_config_report();

runtime_dispatch_report() includes the cached CPU identity, standalone SIMD paths, matmul/conv dispatch gates, INT8 paths, and RuntimeConfigReport. runtime_config_report() records every active YSCV_* environment override visible to the process in stable name order, so new A/B knobs show up automatically without updating a hand-written allowlist.

`YSCV_AVX512_SGEMM` — AVX-512 MR=12×NR=32 GEMM kernel (x86_64)

Default OFF. Set YSCV_AVX512_SGEMM=1 to enable.

Uses the full 32-ZMM register file (24 accumulators, 2 B-load, 1 A-broadcast, 5 epilogue scratch). Epilogue supports: bias, residual, Relu (all in asm), SiLU (ZMM post-store pass via bit-trick exp). m-tail (<12 rows) handled via tail_tile copy path. Session prepack also produces NR=32 layout (PackedB::data_nr32) when enabled, eliminating per-inference B packing.

Shapes routed through 12×32: n%32==0 && m≥12 && k≥16. Shapes that fall through: routed to MR=4×24 AVX2 (default).

Zen 4 note: 4×24 AVX2 wins on AMD Zen 4 (verified 2026-05-14). Measured regression: +156µs 1T vs 4×24 baseline even with session prepack. Zen 4 executes 2 YMM FMAs/clock vs 1 ZMM FMA/clock, giving 4×24 a 2× throughput advantage for the tracker's shape mix. Enable on Intel (Sapphire Rapids+) or AMD Zen 5 where ZMM is native.

`YSCV_QUANT_FAST`

quantize_tracker --format qdq defaults to YSCV_QUANT_FAST=1 behavior: constant QDQ weight dequantizers are folded into fp32 initializers and QDQ pairs between Conv-like nodes are stripped. This keeps the exported tracker model runnable while restoring loader-time Conv layout normalisation and weight prepacking for yscv benchmarks. Regular, grouped, and depthwise Conv weights are exported in standard OIHW quantized form before the yscv-fast cleanup folds constant weight DQ nodes for reload; dead quantized tensors/scales are pruned after fold/strip. Set YSCV_QUANT_FAST=0 when debugging the fully explicit QDQ graph.

`YSCV_QUANT_INT8_FAST`

Runtime A/B knob for quantized ONNX execution. By default yscv folds matching DequantizeLinear -> [Relu] -> QuantizeLinear boundaries into a single quant-domain action and records those executions in quant_runtime_stats(). Set YSCV_QUANT_INT8_FAST=0 to force the explicit Q/DQ boundary path while leaving standard QLinearConv / QLinearMatMul kernels enabled. This is a debug/measurement escape hatch, not a production tuning flag.

On Transformer workloads the tradeoff inverts: one big sgemm per attention head is the bulk of the work, and Accelerate / MKL outperform yscv's hand-tuned blocked GEMM there.

# Library default — keeps BLAS on, good for Transformer / classifier
yscv = "0.1"

# Conv-heavy deployment — no-BLAS is faster, no external lib
yscv = { version = "0.1", default-features = false }

See performance-benchmarks.md for the tracker-specific numbers. On small-channel depthwise/pointwise models like the tracker, the custom blocked-GEMM + streaming-fused Conv paths beat a generic BLAS call, so the non-BLAS build is often faster there.

`metal-backend` — Apple MPSGraph (macOS GPU, yscv's fastest backend on Apple Silicon)

Enables compile_mpsgraph_plan / run_mpsgraph_plan / pipelined submit+wait API. Requires macOS 13+, best on Apple Silicon.

Full guide: mpsgraph-guide.md — when to use, API, pipelining, troubleshooting, performance numbers.

Quick example:

use yscv_onnx::{compile_mpsgraph_plan, run_mpsgraph_plan};
let plan = compile_mpsgraph_plan(&model, &[("images", &sample)])?;
let outputs = run_mpsgraph_plan(&plan, &[("images", bytes)])?;

`gpu` — wgpu cross-platform GPU (Linux / Windows / macOS)

Vulkan / Metal / DX12 / OpenGL via wgpu. Works on AMD, NVIDIA, Intel, Apple, Adreno, Mali — single Rust binary.

Full guide: gpu-backend-guide.md — platform selection, compiled plans, multi-GPU, NixOS LD setup, fp16/fp32 toggle, troubleshooting.

Quick example:

use yscv_onnx::run_onnx_model_gpu;
let outputs = run_onnx_model_gpu(&model, inputs)?;

Runtime flags:

YSCV_GPU_FP32=1 — disable opportunistic fp16 (SHADER_F16), force every kernel + buffer through fp32. Output becomes 1-ULP identical to the CPU runner; latency rises ~12-15 % on memory-bound graphs. Use for accuracy A/B; ship default fp16.
WGPU_BACKEND={vulkan|dx12|metal|gl} — pin the backend (wgpu env; default is "best-of"). Useful for forcing software fallback in CI.
WGPU_ADAPTER_NAME=... — pick a specific GPU when multiple are visible.

`rknn` — Rockchip NPU (edge deployment)

Links librknnrt.so at runtime via dlopen. Builds on any platform; runs only on Rockchip SoCs with the runtime lib installed. Covers RK3588 / RK3576 / RV1106.

Full guide: edge-deployment.md — DMA-BUF zero-copy, SRAM, MPP zero-copy from hardware decoder, dynamic-shape matmul, custom OpenCL ops.

Quick example:

use yscv_kernels::{RknnPipelinedPool, NpuCoreMask};
let pool = RknnPipelinedPool::new(&model_bytes,
    &[NpuCoreMask::Core0, NpuCoreMask::Core1, NpuCoreMask::Core2])?;

Host setup on the Rockchip board:

# librknnrt.so from rockchip-linux/rknn-toolkit2 release
sudo cp librknnrt.so /usr/lib/aarch64-linux-gnu/
sudo ldconfig

`rknn-validate` — dry-run model load at startup

Optional companion to rknn. When enabled, PipelineConfig::validate_models calls RknnBackend::load at config time to catch corrupt-but-magic-OK .rknn files before any RT threads start.

Adds ~100 ms to startup. Worth it in production.

yscv-pipeline = { version = "0.1", features = ["rknn", "rknn-validate"] }

Platform-acceleration flags (silently faster if set)

`mkl` — Intel MKL for x86 matmul

Replaces OpenBLAS with Intel's MKL library. 2-3× faster matmul on Intel CPUs (AVX-512 + per-cpu dispatched kernels).

Setup:

# Install Intel oneAPI Base Toolkit (includes MKL)
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html

# Or on Debian/Ubuntu:
sudo apt install intel-mkl

# Point cargo at it:
export MKLROOT=/opt/intel/oneapi/mkl/latest

Then:

yscv = { version = "0.1", features = ["mkl"] }

No code changes. Matmul and Conv silently take the MKL path.

When to use: Intel servers (Xeon, Core i9), Windows/Linux x86 dev. When NOT to use: Apple Silicon (Accelerate beats MKL), AMD (MKL historically pessimises on non-Intel, set MKL_DEBUG_CPU_TYPE=5 to force AVX2 path).

`armpl` — ARM Performance Libraries

Same idea, for aarch64 Linux. Arm's hand-tuned NEON/SVE matmul kernels. Biggest win on AWS Graviton, Ampere, or any server ARM.

Setup:

# Download from https://developer.arm.com/downloads/-/arm-performance-libraries
# Install to /opt/arm/armpl-xxx
export ARMPL_DIR=/opt/arm/armpl-23.10

yscv = { version = "0.1", features = ["armpl"] }

When to use: aarch64 Linux servers. When NOT to use: Apple Silicon (use default Accelerate), Rockchip (use rknn for NPU instead of CPU matmul).

Real-time scheduling — `realtime`

Wires Linux real-time scheduling primitives through the pipeline framework. When set, PipelineConfig with [realtime] section will apply:

SCHED_FIFO — real-time priority for dispatch threads
CPU affinity — pin threads to specific cores, stop cache bouncing
mlockall — prevent memory paging (no swap-out glitches mid-flight)
cpufreq governor — write "performance" to every core, kill DVFS step-up latency

Setup:

[dependencies]
yscv-pipeline = { version = "0.1", features = ["realtime"] }

In your TOML:

[realtime]
sched_fifo = true
cpu_governor = "performance"
prio.dispatch = 70
affinity.dispatch = [4, 5, 6]

Linux capabilities required:

CAP_SYS_NICE for SCHED_FIFO
CAP_SYS_ADMIN for cpufreq governor writes
Process memlock limit raised for mlockall

Production setup via systemd unit:

[Service]
ExecStart=/usr/local/bin/my-pipeline config.toml
AmbientCapabilities=CAP_SYS_NICE CAP_SYS_ADMIN
LimitMEMLOCK=infinity
LimitRTPRIO=99

Or via setcap on the binary:

sudo setcap cap_sys_nice,cap_sys_admin+ep ./my-pipeline

Graceful fallback: if the process lacks a capability, the framework logs a warning and continues without that specific feature. Your pipeline runs; real-time guarantees are partial.

Verify what actually applied after startup by checking the stderr log:

[yscv-pipeline] realtime partial: sched_fifo=true affinity=true mlockall=true governor_cores=8

If sched_fifo=false: missing CAP_SYS_NICE. If mlockall=false: memlock limit too low. If governor_cores=0: missing CAP_SYS_ADMIN.

On non-Linux hosts (macOS, Windows) the feature compiles but all primitives return NotSupported — safe to leave enabled in cross- platform code.

Camera capture — `native-camera`

Native platform camera APIs for live capture. Wraps nokhwa crate which handles the three platforms uniformly.

Setup per platform:

Linux — V4L2 always available, no extra install. Permissions: user must be in video group (sudo usermod -aG video $USER).
macOS — AVFoundation, works out of the box. On first access macOS prompts for camera permission (Info.plist NSCameraUsageDescription needed for app bundles; console apps typically prompt in Terminal).
Windows — MediaFoundation, works out of the box. UWP apps need Webcam capability in the manifest.

Setup in Cargo:

yscv-video = { version = "0.1", features = ["native-camera"] }

Usage:

use yscv_video::{CameraConfig, V4l2Camera};  // Linux

let mut cam = V4l2Camera::open(CameraConfig {
    device: "/dev/video0".into(),
    width: 1280, height: 720,
    fps: 60,
    format: yscv_video::PixelFormat::Nv12,
})?;
loop {
    let frame = cam.capture_frame()?;
    // process frame…
}

On Rockchip boards: V4L2 works; for zero-copy to NPU use VIDIOC_EXPBUF to get a DMA-BUF fd, then RknnBackend::wrap_fd to bind it as an NPU input. Full recipe in edge-deployment.md.

Hardware video decode — `videotoolbox` / `vaapi` / `nvdec` / `mediafoundation`

Platform-specific HW decoders on top of the software H.264/HEVC/AV1 decoders. yscv automatically detects availability at runtime; software decode is always the fallback.

Flag	Platform	Codecs
`videotoolbox`	macOS / iOS	H.264, HEVC, ProRes
`vaapi`	Linux	H.264, HEVC, AV1 (on Intel ≥ Xe, AMD ≥ RDNA2)
`nvdec`	Linux / Windows + NVIDIA	H.264, HEVC, AV1 (Ampere+)
`mediafoundation`	Windows	H.264, HEVC, AV1 (hardware-dependent)

Setup:

yscv-video = { version = "0.1", features = ["videotoolbox"] }   # macOS
yscv-video = { version = "0.1", features = ["vaapi"] }          # Linux/Intel/AMD
yscv-video = { version = "0.1", features = ["nvdec"] }          # Linux/Windows NVIDIA
yscv-video = { version = "0.1", features = ["mediafoundation"] }# Windows

System deps per platform:

macOS VideoToolbox: built-in, no install.
Linux VA-API: sudo apt install libva-dev intel-media-va-driver (or mesa-va-drivers for AMD). Check with vainfo.
NVIDIA NVDEC: ships with the NVIDIA driver. nvidia-smi must work.
Windows MediaFoundation: built into Windows 10+.

Usage — automatic detection via HwDecoder:

use yscv_video::{HwDecoder, VideoCodec};

let mut decoder = HwDecoder::new(VideoCodec::H264)?;
// Tries VideoToolbox/VAAPI/NVDEC/MediaFoundation in that order, falls
// back to software if none available. Use it the same way either way.
let frame = decoder.decode(packet)?;

See video-pipeline.md for container parsing (MP4/MKV), zero-copy decode, and performance numbers vs ffmpeg.

Miscellaneous flags

`drm` — Linux KMS display output (yscv-video)

Direct framebuffer output via DRM/KMS. For headless edge devices displaying video without X/Wayland. Used by pipeline output.kind = "drm".

Setup:

# User needs access to /dev/dri/card0
sudo usermod -aG video $USER

yscv-video = { version = "0.1", features = ["drm"] }

TOML config:

[output]
kind = "drm"
connector = "HDMI-A-1"
mode = "720p60"

`mpp` — Rockchip MPP hardware encoder (yscv-video-mpp)

Hardware H.264/H.265 encoding on Rockchip SoCs via librockchip_mpp.so (also dlopen). Used by pipeline encoder.kind = "mpp-h264".

Setup on Rockchip board:

sudo apt install librockchip-mpp1     # or from rockchip-linux/mpp releases

yscv-video-mpp = { version = "0.1", features = ["mpp"] }

`profile` — per-op ONNX profiling

Instruments every op in the ONNX runner with wall-time measurement. Useful to find which op is a bottleneck.

yscv-onnx = { version = "0.1", features = ["profile"] }

use yscv_onnx::profile_onnx_model_cpu;
// Runs one unfused inference and prints a per-op timing table plus a
// "Top <filter> layers" detail table. Conv rows show the dispatched
// microkernel as `runner-dispatch` or `runner-dispatch/kernel-sub-path`:
//   0.42ms  [1,3,128,128] → [1,16,64,64]  k=[3,3] s=[2,2] via nhwc-padded/first-layer-rgb at stem
//   0.10ms  [1,64,40,40]  → [1,64,40,40]  k=[3,3] s=[1,1] via dw-nhwc-padded/dw-avx512 at neck.dw
//   0.31ms  [64,128]      → [64,96]                       via blocked-mr4 at head.proj   (MatMul)
profile_onnx_model_cpu(&model, inputs)?;

Overhead is small (~1% per op) but leave it off in production — the counters persist across runs.

Runtime knobs (no rebuild needed):

YSCV_PROFILE_FILTER=<spec> — narrow the detail table and the JSON to a node subset. The spec is comma-separated: a bare token matches an op type exactly (Conv, Conv,MatMul), name:<substr> matches a node-name substring (name:head). Unset → the default Conv detail table and an unfiltered JSON.
YSCV_PROFILE_JSON=<path> — additionally write a machine-readable per-node JSON (name/op/ms/shapes; Conv nodes also carry kernel_shape, strides and the dispatched kernel). Consumed by scripts/gap_diff.py.
YSCV_RUNNER_PROFILE=<path> — aggregate per-node timings over the fused runner path across a whole bench loop; dump with dump_runner_profile. Each node carries the same dispatched-kernel label as the CPU profiler (Conv sub-path / MatMul-Gemm family); fused streaming kernels (e.g. FusedDwPw) bypass the instrumented dispatch and carry none.

Diff two profiles with scripts/gap_diff.py a.json b.json: --filter takes the same spec as YSCV_PROFILE_FILTER, and --per-node diffs node-by-node, flagging a dispatched-kernel change as old→new — ideal for a baseline-vs-change comparison of the same model.

Runtime perf env toggles (ONNX CPU kernels)

Cargo features select backend code at build time. For ONNX CPU micro-tuning and A/B you also have runtime env toggles.

Most used knobs:

YSCV_FUSED_PW_DW_STREAM_OFF=1 — disable PW->DW streaming fused path.
YSCV_FUSED_PW_DW_PW2X_OFF=1 — disable NEON PW2X inner-loop variant.
YSCV_FUSED_PW_DW_W_TILE=<N> — override fused PW->DW strip-mining tile.
YSCV_FUSED_PW_DW_DW_INTERIOR=1 — opt into the experimental AVX-512 stride-1/stride-2 depthwise interior fast path inside PW->DW fusion.
YSCV_FUSED_DW_PW_STREAM_OFF=1 — disable DW->PW streaming fused path.
YSCV_FUSED_DW_PW_STREAM_PADDED=1 — enable padded streaming variant on non-x86; x86_64 enables it by default.
YSCV_FUSED_DW_PW_STREAM_PADDED_OFF=1 — disable x86_64 default padded DW->PW streaming.
YSCV_FUSED_DW_PW_ROW_BATCH=<N> — override DW->PW streaming y-band size.
YSCV_DIRECT_CONV_WORK_MAX=<N> — direct-3x3 routing threshold.
YSCV_NO_POINTWISE_16X16_DIRECT=1 — disable the direct NHWC 1×1 K=16,N=16 pointwise kernel with fused bias/residual/activation epilogue.
YSCV_NO_POINTWISE_NX16_DIRECT=1 — disable the single-thread direct NHWC residual pointwise kernel for small-M, N % 16 == 0 ConvAdd blocks.
YSCV_POOL=yscv — route standalone par_chunks_mut_dispatch kernels through the pinned yscv thread pool instead of Rayon. This is useful for single-op CPU benches and lowers the softmax parallel threshold for the low-overhead pool; the default remains Rayon unless this env var is set.
YSCV_POOL_SPIN_US=<N> — keep yscv-pool workers in a short idle spin before parking. Default is 0, which is tuned for normal inference graphs. For tight standalone single-op microbench loops with YSCV_POOL=yscv, values around 200 can remove the last park/unpark overhead on Zen 4.
YSCV_X86_MEMORY_SIMD=avx2 — on x86/x86_64 with AVX-512 available, route memory-bound standalone elementwise/ReLU dispatch through the 256-bit AVX path for A/B measurement. Default keeps AVX-512 enabled.
YSCV_NO_AARCH64_LOW_K_BLOCKED=1 and YSCV_AARCH64_LOW_K_BLOCKED_MIN_WORK_FMAS=<N> — low-k blocked matmul route.
YSCV_NO_X86_LOW_K_BLOCKED=1 — disable the x86 low-k pointwise route through the blocked 4x24/4x16 AVX+FMA GEMM kernels.

Symmetric QLinearConv depthwise 3x3/5x5 stride-1 nodes, plus measured-win stride-2 tracker shapes, route through the multi-arch INT8 depthwise kernel automatically. Unsupported QLinear shapes fall back to the portable path, so there is no runtime flag for this route. VNNI-friendly explicit QLinear pointwise/MatMul weights (K >= 4, N % 16 == 0) are also packed once at model load into the AVX-512 VNNI 4x16 RHS layout; this is part of the standard runtime index and has no separate feature flag. Entry QuantizeLinear nodes that produce QLinear activation edges write direct i8 side-table tensors via the shared scalar/AVX2/AVX-512F/ NEON quantizer; x86 paths pack SIMD lanes directly to i8, avoiding the old scalar collection path for real quant-domain storage.

Canonical reference (with defaults, scope, and reproduction commands): onnx-cpu-kernels.md.

Combination recipes

# Pure CPU, zero system deps (works anywhere, slowest)
yscv = { version = "0.1", default-features = false }

# Apple Silicon dev — everything Apple has
yscv = { version = "0.1", features = ["metal-backend", "videotoolbox", "native-camera"] }

# Intel server — best CPU inference
yscv = { version = "0.1", features = ["mkl", "gpu", "vaapi"] }

# NVIDIA server
yscv = { version = "0.1", features = ["gpu", "nvdec"] }

# Rockchip drone / FPV — full edge stack
yscv-pipeline  = { version = "0.1", features = ["rknn", "rknn-validate", "realtime"] }
yscv-video     = { version = "0.1", features = ["drm"] }
yscv-video-mpp = { version = "0.1", features = ["mpp"] }

# Windows desktop
yscv = { version = "0.1", features = ["gpu", "mediafoundation"] }

# Cross-platform CV library (no accelerators, works anywhere including WASM)
yscv = { version = "0.1", default-features = false }

Deep-dive guides

For flags with lots of surface area, the table above links out:

mpsgraph-guide.md — metal-backend (MPSGraph API, pipelining)
gpu-backend-guide.md — gpu (wgpu, multi-GPU, compiled plans)
edge-deployment.md — rknn (DMA-BUF, SRAM, custom ops)
pipeline-config.md — realtime TOML schema
video-pipeline.md — HW decode + container parsing

Reporting flag bugs

If a feature works on one platform but not another, open an issue with:

Feature list from your Cargo.toml
rustc --version, uname -a, GPU / CPU model
Exact error or symptom
cargo tree -e features -p <crate> output — shows what got resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature flags — what they do and how to set them up

TL;DR — the map

How to set features

Backend flags (pick what you're deploying on)

`blas` — BLAS-accelerated matmul

CPU dispatch inspection

`YSCV_AVX512_SGEMM` — AVX-512 MR=12×NR=32 GEMM kernel (x86_64)

`YSCV_QUANT_FAST`

`YSCV_QUANT_INT8_FAST`

`metal-backend` — Apple MPSGraph (macOS GPU, yscv's fastest backend on Apple Silicon)

`gpu` — wgpu cross-platform GPU (Linux / Windows / macOS)

`rknn` — Rockchip NPU (edge deployment)

`rknn-validate` — dry-run model load at startup

Platform-acceleration flags (silently faster if set)

`mkl` — Intel MKL for x86 matmul

`armpl` — ARM Performance Libraries

Real-time scheduling — `realtime`

Camera capture — `native-camera`

Hardware video decode — `videotoolbox` / `vaapi` / `nvdec` / `mediafoundation`

Miscellaneous flags

`drm` — Linux KMS display output (yscv-video)

`mpp` — Rockchip MPP hardware encoder (yscv-video-mpp)

`profile` — per-op ONNX profiling

Runtime perf env toggles (ONNX CPU kernels)

Combination recipes

Deep-dive guides

Reporting flag bugs

FilesExpand file tree

feature-flags.md

Latest commit

History

feature-flags.md

File metadata and controls

Feature flags — what they do and how to set them up

TL;DR — the map

How to set features

Backend flags (pick what you're deploying on)

blas — BLAS-accelerated matmul

CPU dispatch inspection

YSCV_AVX512_SGEMM — AVX-512 MR=12×NR=32 GEMM kernel (x86_64)

YSCV_QUANT_FAST

YSCV_QUANT_INT8_FAST

metal-backend — Apple MPSGraph (macOS GPU, yscv's fastest backend on Apple Silicon)

gpu — wgpu cross-platform GPU (Linux / Windows / macOS)

rknn — Rockchip NPU (edge deployment)

rknn-validate — dry-run model load at startup

Platform-acceleration flags (silently faster if set)

mkl — Intel MKL for x86 matmul

armpl — ARM Performance Libraries

Real-time scheduling — realtime

Camera capture — native-camera

Hardware video decode — videotoolbox / vaapi / nvdec / mediafoundation

Miscellaneous flags

drm — Linux KMS display output (yscv-video)

mpp — Rockchip MPP hardware encoder (yscv-video-mpp)

profile — per-op ONNX profiling

Runtime perf env toggles (ONNX CPU kernels)

Combination recipes

Deep-dive guides

Reporting flag bugs

`blas` — BLAS-accelerated matmul

`YSCV_AVX512_SGEMM` — AVX-512 MR=12×NR=32 GEMM kernel (x86_64)

`YSCV_QUANT_FAST`

`YSCV_QUANT_INT8_FAST`

`metal-backend` — Apple MPSGraph (macOS GPU, yscv's fastest backend on Apple Silicon)

`gpu` — wgpu cross-platform GPU (Linux / Windows / macOS)

`rknn` — Rockchip NPU (edge deployment)

`rknn-validate` — dry-run model load at startup

`mkl` — Intel MKL for x86 matmul

`armpl` — ARM Performance Libraries

Real-time scheduling — `realtime`

Camera capture — `native-camera`

Hardware video decode — `videotoolbox` / `vaapi` / `nvdec` / `mediafoundation`

`drm` — Linux KMS display output (yscv-video)

`mpp` — Rockchip MPP hardware encoder (yscv-video-mpp)

`profile` — per-op ONNX profiling