Sample iOS app for NMS-free YOLO object detection models (YOLO26, YOLOv10), with on-device multi-object tracking (ByteTrack).
| Model | Size | Architecture | Year |
|---|---|---|---|
| YOLO26s | 18 MB | End-to-end, NMS-free | 2026 |
| YOLOv10s | 14 MB | Dual assignment, NMS-free | 2024 |
These models output direct predictions without requiring Non-Maximum Suppression post-processing.
- Camera — Real-time detection + tracking with FPS/latency stats
- Photo — Pick an image from library and run detection
- Video — Frame-by-frame detection + tracking with progress bar
- Track toggle — Tap
Track/Rawin Camera or Video to switch between tracked (persistent IDs, smoothed boxes) and raw detections
Tracking is implemented in pure Swift (Tracker.swift) using ByteTrack
(Zhang et al., ECCV 2022, arxiv 2110.06864)
— currently the mobile sweet-spot for multi-object tracking:
- No appearance network. Pure motion (per-track 8D constant-velocity Kalman filter) + IoU association. No second neural net on the ANE/GPU.
- Two-stage association. High-confidence detections are matched first; low-confidence detections are then used to rescue tracks about to be lost, which keeps IDs stable through motion blur and brief occlusions.
- Class-aware. Only detections of the same class can inherit a track.
- Track lifecycle. Lost tracks survive 30 frames before being dropped; new tentative tracks need a second-frame confirmation before being drawn.
Each tracked object gets a persistent color (indexed by track ID) and its box label shows #<id> <class> <conf>%. A short motion trail (~2 seconds of recent Kalman-smoothed centers, configurable via Config.trailMaxLen) is drawn behind each active track so movement direction and velocity are visible at a glance. Track IDs restart when tracking is toggled or when the Camera view re-appears.
- Download a model from the links above
- Unzip and drag the
.mlpackageinto the Xcode project - Build and run on a physical device (iOS 16+)
Model output: [1, 300, 6] where each row is [x1, y1, x2, y2, confidence, class_id] in pixel coordinates (0-640). The app decodes these directly — no NMS needed.