This document is the canonical capability inventory for Python-ecosystem replacement.
- Lock the target surface needed to replace practical Python CV/DL workflows.
- Track implementation status by capability area.
- Make gaps explicit and tie each gap to concrete framework work.
Implemented: production-ready baseline exists in workspace crates with tests.Partial: baseline exists but parity/scale/perf is incomplete.Planned: no production-grade implementation yet.
| Area | Python Parity Target | Status | yscv Surface Today | Primary Gap To Close |
|---|---|---|---|---|
| Tensor core and layout | NumPy tensor creation/layout/broadcast/reduction baseline | Implemented | yscv-tensor (115 ops on Tensor in ops/: creation, broadcast, add/sub/mul/div/pow/min/max elementwise, neg/abs/exp/ln/sqrt/reciprocal/sign/floor/ceil/round/clamp/scale/add_scalar unary, sum/mean/max_value/min_value/argmax/argmin/var/std_dev + axis reductions, transpose_2d/permute/unsqueeze/squeeze/cat/stack/select/narrow shape ops, eq/gt/lt/all_finite comparisons, where_cond/masked_fill/scatter/gather/scatter_add/topk/triu/tril/eye/repeat advanced ops, FP16/BF16 native dtype storage) |
— |
| CPU numeric kernels | NumPy/SciPy/PyTorch CPU operator runtime | Implemented | yscv-kernels (adaptive + configurable threaded CPU matmul, elementwise/relu/sigmoid/gelu/silu/mish/dropout, softmax/log-softmax/logsumexp/layer-norm/rms-norm/group-norm/batch-norm, NHWC pooling/convolution/depthwise-conv/separable-conv/deformable-conv/transposed-conv, scaled dot-product attention, embedding lookup; GPU backend via wgpu: tiled matmul, elementwise, relu, sigmoid, softmax, conv2d, pooling, batch_norm, layer_norm, transpose compute shaders; GPU buffer pool; multi-device scheduling) |
Deepen SIMD/zero-copy hot-path coverage |
| Autograd | PyTorch/TensorFlow eager grad baseline | Implemented | yscv-autograd (61 Op enum variants in node.rs: add/sub/mul/div/neg/exp/log/sqrt/sigmoid/tanh/abs/pow/clamp/leaky_relu/mean/relu/matmul_2d/sum/softmax/log_softmax/transpose_2d/reshape/unsqueeze/squeeze/conv1d/conv2d/conv3d/conv_transpose/depthwise_conv2d/batch_norm/layer_norm/group_norm/instance_norm/max_pool2d/avg_pool2d/global_avg_pool/adaptive_pool/multi_head_attention/embedding/lstm/gru/rnn/dropout/upsample/pixel_shuffle/residual/einsum/grid_sample/pad/flip/repeat/cat/select/narrow/gather/gather_elements/scatter_add/sum_axis/mean_axis/flatten/expand/slice, gradient checkpointing, GPU backward ops) |
— |
| Optimizers | Core optimizer families used in CV training | Implemented | yscv-optim (8 optimizers: Sgd, Adam, AdamW, RAdam, RmsProp, Adagrad, Lamb, Lars; Lookahead meta-optimizer; 11 LR schedulers: StepLr, CosineAnnealingLr, LinearWarmupLr, OneCycleLr, ExponentialLr, MultiStepLr, ReduceOnPlateauLr, CyclicLr, CosineWarmRestartLr, PolynomialLr, SequentialLr; gradient clipping utilities) |
— |
| Model/layer runtime | Typical CV model execution stack | Implemented | yscv-model (39 ModelLayer variants in layers/mod.rs: Linear, ReLU, LeakyReLU, Sigmoid, Tanh, Dropout, Conv1d, Conv2d, Conv3d, DepthwiseConv2d, SeparableConv2d, ConvTranspose2d, DeformableConv2d, BatchNorm2d, LayerNorm, GroupNorm, InstanceNorm, MaxPool2d, AvgPool2d, GlobalAvgPool2d, AdaptiveAvgPool2d, AdaptiveMaxPool2d, Flatten, Softmax, Embedding, LoraLinear, PixelShuffle, Upsample, GELU, SiLU, Mish, PReLU, ResidualBlock, Rnn, Lstm, Gru, MultiHeadAttention, TransformerEncoder, FeedForward; architecture blocks: BottleneckBlock, SqueezeExciteBlock, MbConvBlock, UNetEncoder/Decoder, FpnNeck, AnchorFreeHead, PatchEmbedding, VisionTransformer; quantization: INT8 symmetric/asymmetric/per-channel; inference: batched_inference, BatchCollector; weight management: save/load/inspect; model zoo: 17 architectures — ResNet18/34/50/101, VGG16/19, MobileNetV2, EfficientNetB0, AlexNet, ViTTiny/Base/Large, DeiTTiny — with ModelHub remote download) |
— |
| Training loop stack | Practical supervised CV training workflows | Implemented | yscv-model + yscv-optim + yscv-autograd (high-level Trainer with TrainerConfig, validation split, EarlyStopping/BestModelCheckpoint callbacks; 17 loss functions: mse, mae, huber, hinge, bce, nll, cross_entropy, label_smoothing_cross_entropy, focal, dice, triplet, contrastive, cosine_embedding, ctc, smooth_l1, kl_div, distillation; per-optimizer train-step/epoch helpers; deterministic NHWC augmentation; weighted/shuffled/class-balanced batch pipeline; batch mixup/cutmix; eval/train mode; mixed-precision training with DynamicLossScaler; distributed training with AllReduceAggregator/ParameterServer/TopKCompressor/TCP transport; LoRA fine-tuning; EMA; LR finder; TensorBoard writer; StreamingDataLoader; gradient clipping) |
— |
| Data transforms | torchvision.transforms-style composable preprocessing/augmentation | Implemented | yscv-model (Compose pipeline with 8 transforms: Normalize, ScaleValues, PermuteDims, Resize, CenterCrop, RandomHorizontalFlip, GaussianBlur; ImageAugmentationPipeline with 12+ ops; DataLoader with prefetch and samplers) |
Expand transform catalogue |
| Image processing primitives | OpenCV/scikit-image/Pillow core transforms | Implemented | yscv-imgproc (159 free public functions in ops/: grayscale/HSV/BGR/LAB/YUV conversions, nearest+bilinear resize, box/Gaussian/median/bilateral blur, morphology suite, Sobel/Laplacian/Canny edges, binary/Otsu/adaptive thresholding, connected-component labeling, template matching (SSD+NCC), contour detection, filter2d, pad/crop, histogram+equalize+CLAHE, flip/rotate, affine+perspective warp, Harris/FAST corners, Hough lines/circles, Gaussian pyramid, distance transform, integral image, NMS, ORB/SIFT/SURF/BRIEF descriptors, Lucas-Kanade+Farneback optical flow, watershed, HOG descriptors, convex hull, min-area rect, homography (4pt DLT + RANSAC), ellipse fitting, Douglas-Peucker, inpainting, camera calibration, stereo matching, drawing primitives) |
— |
| Video/Camera I/O | OpenCV-like cross-platform capture and frame pipeline | Implemented | yscv-video (native camera I/O via nokhwa (V4L2/AVFoundation/MediaFoundation); H.264 decoder: full Baseline/Main/High profile, I/P/B slices, CAVLC, CABAC (branchless), DCT + dequant, motion compensation, deblocking, weighted prediction, sub-MB partitions, scaling lists; HEVC/H.265 decoder: Main + Main10 (10-bit), I/P/B slices, CTU quad-tree, 35 intra modes, chroma motion compensation, CABAC, SAO, deblocking, tiles; hardware decode backends: VideoToolbox, VAAPI, NVDEC, MediaFoundation; MP4 + MKV/WebM container parsing; streaming Mp4VideoReader + FrameStream; audio metadata for AAC/ALAC/Opus/Vorbis/MP3/FLAC; YUV420→RGB8 BT.601 conversion with NEON+SSE2+AVX2 paths) |
— |
| Detection runtime | Practical object/face detection baseline | Implemented | yscv-detect (full YOLOv8 + YOLOv11 ONNX inference pipelines: detect_yolov8_from_rgb, decode_yolov8_output, decode_yolov11_output, yolov8_coco_config, yolov11_coco_config, letterbox_preprocess; ModelDetector config with postprocess_detections and preprocess_rgb8_for_model; heatmap detection (detect_from_heatmap); NMS hard/soft/batched; anchor generation; RoI pool/align; scratch buffer reuse; detect_people_from_rgb8 / detect_faces_from_rgb8 helpers) |
— |
| Recognition runtime | Embedding-based identity matching | Implemented | yscv-recognize (Recognizer with cosine similarity, gallery management, enroll/remove/recognize API, VP-Tree ANN indexing via build_index()/search_indexed(), snapshot JSON persistence, slice-based recognition) |
— |
| Multi-object tracking | Stable ID tracking and counting baseline | Implemented | yscv-track (DeepSortTracker, ByteTracker, KalmanFilter 8-state, Hungarian assignment, TrackerConfig, re-identification: ReIdExtractor trait, ColorHistogramReId, ReIdGallery with distance threshold matching) |
— |
| Evaluation stack | Detection/tracking/counting metric and benchmark tooling | Implemented | yscv-eval (37 public metric/eval functions: mAP/COCO detection, HOTA/IDF1/MOTA/MOTP tracking, accuracy/precision/recall/F1/ROC/AUC classification, MAE/RMSE/MAPE/R2 regression, DICE/IoU/PSNR/SSIM/top-k general, counting metrics; 8 dataset adapters under dataset/: COCO, JSONL, OpenImages, YOLO, VOC, KITTI, WIDERFACE, MOT; camera diagnostics; timing stats; pipeline benchmark thresholds) |
— |
| Dataset adapters | Typical annotation formats used in CV training/eval | Implemented | Training: JSONL/CSV/ImageManifest/ImageFolder in yscv-model; Evaluation: JSONL/COCO/OpenImages/YOLO/VOC/KITTI/WIDERFACE detection + JSONL/MOT tracking in yscv-eval (8 formats in dataset/) |
— |
| Model portability | ONNX import/export compatibility | Implemented | yscv-onnx (122 CPU op match arms in runner/dispatch.rs covering opset 22; quantized runtime ops, trig/hyperbolic, logical, activation, spatial, misc; NCHW<->NHWC conversion; graph optimizer with Conv+ReLU/BN+ReLU fusion; dynamic shape; OnnxDtype enum with Float32/Float16/Int8/UInt8/Int32/Int64/Bool, OnnxTensorData with quantize/dequantize; CPU perf varies by model/hardware: ~1.07× behind ORT single-thread on x86 Zen 4 and ~1.5-1.6× faster than ORT on the Orange Pi Zero 3 (Cortex-A53) Siamese tracker (see performance-benchmarks.md); supports YOLO11n opset 22 where onnxruntime/tract fail); yscv-model ONNX export bridge |
— |
| Inference deployment ops | Graph optimization, batching, quantization, packaging | Implemented | ONNX graph optimization (Conv+BN folding, constant folding, Conv+ReLU/BN+ReLU fusion), quantized ONNX runtime, INT8 quantization, dynamic batching (batched_inference, BatchCollector), binary weight save/load, SafeTensors support |
Model packaging/export tooling |
| Accelerator backends | CUDA/Metal/Vulkan/other acceleration parity track | Implemented | yscv-kernels gpu feature: wgpu compute backend (Vulkan/Metal/DX12) with tiled matmul, elementwise, activation, conv2d, pooling, batch_norm, layer_norm, softmax, transpose shaders; multi-device scheduling (MultiGpuBackend, SchedulingStrategy::RoundRobin/DataParallel/Manual); GPU buffer pool; Metal-native pipeline (metal-backend feature): compiled execution plans with tiled f16 conv GEMM (BM=64, BN=64, BK=16, 16 accumulators/thread), f16 weight pre-packing (half bandwidth), parallel softmax with shared-memory reduction, fast_divmod im2col, zero-cost buffer aliasing, fused command buffers; MPSGraph pipelined submit/wait API (triple-buffered by default via YSCV_MPS_PIPELINE, multi-input models, f16 end-to-end with NEON vcvt_f32_f16 widening, zero ObjC allocs in hot path — measured ~4.4× ORT CoreML throughput on the Siamese tracker, Apple M1) |
f16 inter-op buffers, further CoreML parity |
| Distributed training | Multi-node gradient synchronization | Implemented | yscv-model (DistributedConfig, GradientAggregator trait, AllReduceAggregator ring all-reduce, ParameterServer centralized aggregation, InProcessTransport, TcpTransport with coordinator/worker roles and length-prefixed protocol, TopKCompressor gradient compression, distributed_train_step helper) |
Scale testing on real clusters |
| API/release governance | Semver policy + compatibility notes + publish automation | Implemented | docs/api-stability.md (stability tiers, semver policy, release checklist, publish order), CHANGELOG.md, scripts/publish.sh (dependency-ordered publish), scripts/bump-version.sh |
— |
| End-to-end reference apps | Usable Rust apps proving framework viability | Implemented | yscv-cli, apps/bench, apps/camera-face-tool, apps/llm-bench, 26 examples in examples/src/ (incl. train_linear, train_cnn, image_pipeline, yolo_detect, yolo_finetune, GPU/Metal/MPSGraph benches) |
Expand real-world reference scenarios |
- After every feature change, update this matrix if coverage status/gaps changed.
- Do not mark
Implementedwithout executable evidence (tests/benchmarks/repro run). - Keep
context.mdandAGENTS.mdaligned with this matrix.