Add 10 new CoreML models with sample apps by john-rocky · Pull Request #55 · john-rocky/CoreML-Models

john-rocky · 2026-03-30T17:21:48Z

Summary

10 new CoreML models across 9 new categories: depth estimation (next-gen), object detection (next-gen), background removal, speech recognition, text-to-speech, vision-language model, open-vocabulary detection, pose estimation, multilingual OCR
3 sample apps + 7 creative apps with full SwiftUI implementations
10 Python conversion scripts in conversion_scripts/
README.md updated with all new model entries and TOC

New Models

Model	Category	Year	Size
Depth Anything V2 Small	Monocular Depth	2024	47MB
YOLOv10-N	Object Detection	2024	4.6MB
BiRefNet	Background Removal	2024	82MB
Whisper Tiny	Speech Recognition	2023	16MB
Depth Pro	Metric Depth (Apple)	2024	1.2GB
Kokoro-82M	Text-to-Speech	2025	55MB
SmolVLM2-500M	Vision-Language Model	2025	164MB
YOLOE-S	Open-Vocab Detection	2025	20MB
DWPose / RTMPose	Pose Estimation	2024	~50MB
PP-OCRv5	Multilingual OCR	2025	~20MB

Model Distribution

CoreML models (.mlpackage) will be distributed via GitHub Releases to avoid repo size limits. Users download models and place them in the app directory to build.

Test plan

Open each .xcodeproj in Xcode and verify it builds
Download .mlpackage from Releases, place in app dir, run on device
Verify README links and model entries

🤖 Generated with Claude Code

New model categories: - Face Manipulation: LivePortrait, FOMM, Wav2Lip, SimSwap, 3DDFA_V2, DPR - Image Harmonization: CDTNet - Audio Source Separation: HTDemucs - Video Motion Magnification: STB-VMM - Image Deblurring: NAFNet - Image Classifiers: MobileNetV3, ConvNeXt, FastViT, MobileOne, etc. - Semantic Segmentation: DeepLabV3, LRASPP Includes 20 SwiftUI sample apps (creative_apps/ and sample_apps/). Model files (.mlpackage) are excluded - download from Google Drive.

New models: Depth Anything V2, YOLOv10-N, BiRefNet, Whisper Tiny, Depth Pro, Kokoro-82M TTS, SmolVLM2-500M, YOLOE-S, DWPose, PP-OCRv5. Covers new categories: speech recognition, TTS, VLM, open-vocab detection, pose estimation, and multilingual OCR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Models are distributed via GitHub Releases (not in repo). Download .mlpackage files and place in the app directory to build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Depth Pro requires 1536x1536 fixed input (~1.2GB model). Added RAM requirement warning (iPhone 15 Pro+ / 6GB RAM). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ANE fails to compile the large DepthPro model. Switch to CPU+GPU compute units. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Model input is MLMultiArray (not pixelBuffer): convert BGRA→RGB Float16 - Output name is 'var_4563' (auto-generated), with fallback to first output - Handle Float16 output with vImage conversion - Add Accelerate import for vImage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Also handle Float16 output in fallback path with vImage conversion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Depth Pro (1.2GB, 1536x1536 fixed) crashes on all iPhones due to memory - Removed DepthProDemo app, conversion script, and README entry - BiRefNet: reduced input from 1024x1024 to 512x512 to fit iPhone memory - BiRefNet: switched from ANE to cpuAndGPU (ANE compilation fails) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Model uses Float16, not Float32. Reading Float32 from Float16 buffer produced garbage → NaN → UInt8 crash. - Input: write as Float16 via vImage conversion - Output: read as Float16 and convert to Float32 via vImage - Add NaN guard in mask-to-image conversion - Add Accelerate import for vImage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

UIImage from PhotosPicker can have rotation metadata (imageOrientation). CGImage ignores this, causing a 90-degree mismatch between mask and cutout. Normalize to .up orientation before extracting CGImage pixels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…trait pipeline - Remove DepthAnythingV2Demo (Apple official CoreML model available) - Remove WhisperDemo (WhisperKit provides full implementation) - Remove DWPoseDemo (Apple Vision API has built-in pose detection) - Remove corresponding conversion scripts - Update README to reference official implementations - Fix YOLOv10Demo to parse raw MultiArray output [1,300,6] - Implement full LivePortrait animation pipeline with 4-model inference - Add AppIcon.appiconset to YOLOv10Demo

- Add Accelerate-based STFT/iSTFT signal processing (vDSP FFT) - Implement stride-aware MLMultiArray extraction for Float16 outputs - Use time-domain output only (freq branch overflows Float16 → ±inf) - Audio loading with format conversion and resampling to 44.1kHz stereo - Per-stem WAV export and playback - Fix SwiftUI type-checker timeout in WaveformView Known issue: freq_output produces ±inf due to Float16 overflow in the model's frequency branch. Reconverting the model with Float32 outputs should enable freq+time reconstruction for better separation quality.

Draft conversion script to reconvert HTDemucs with Float32 precision. The current Float16 model causes overflow (±inf) in the frequency branch.

- freq_output overflows Float16 range (±inf) for real STFT data, even with Float32 internal computation (output tensor is Float16) - Use time-domain output only for stem reconstruction - Add F32 model to Xcode project (compute_precision=FLOAT32) - Simplify conversion script: ONNX-based, end-to-end model - Source order confirmed: drums, bass, other, vocals Known issue: Full freq+time reconstruction requires Float32 output tensors in the CoreML model. The current model spec forces Float16 output which cannot represent large STFT values (>65504). Time-only provides decent separation quality.

- Add computeSpectralInput: STFT with Python _spec padding, CaC channel format - Add inverseSpec: iSTFT matching Python _ispec (time padding + trim) - Feed actual STFT data to spectral_magnitude (was zeros, disabling freq branch) - Normalize STFT input (÷√N) to match Python torch.stft(normalized=True) - Compensate iSTFT output (×√N) for correct freq+time addition - Add stride-aware MLMultiArray fallback for non-contiguous layouts - Fix stem order: vocals=3, other=2 (matching Python model.sources) - Generalize forwardSTFT/inverseSTFT for variable frame counts

john-rocky and others added 18 commits March 28, 2026 22:52

Add missing AppIcon.appiconset to all creative apps

4ee4dba

Add .DS_Store and xcuserdata to gitignore, remove tracked copies

9b0df17

Remove tracked .DS_Store and Xcode user files

d7bf6f0

Add mlpackage references to Xcode projects for local testing

040d473

Models are distributed via GitHub Releases (not in repo). Download .mlpackage files and place in the app directory to build. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add device compatibility warning for Depth Pro demo

0b203bf

Depth Pro requires 1536x1536 fixed input (~1.2GB model). Added RAM requirement warning (iPhone 15 Pro+ / 6GB RAM). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix Depth Pro ANE compilation failure: use cpuAndGPU

431180a

ANE fails to compile the large DepthPro model. Switch to CPU+GPU compute units. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix DepthPro fallback: use depthStats instead of missing properties

699a9e5

Also handle Float16 output in fallback path with vImage conversion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add HTDemucs CoreML conversion script (Float32)

a748acd

Draft conversion script to reconvert HTDemucs with Float32 precision. The current Float16 model causes overflow (±inf) in the frequency branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 10 new CoreML models with sample apps#55

Add 10 new CoreML models with sample apps#55
john-rocky wants to merge 18 commits into
masterfrom
add-new-models

john-rocky commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented Mar 30, 2026

Summary

New Models

Model Distribution

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant