Skip to content

Add 10 new CoreML models with sample apps#55

Open
john-rocky wants to merge 18 commits intomasterfrom
add-new-models
Open

Add 10 new CoreML models with sample apps#55
john-rocky wants to merge 18 commits intomasterfrom
add-new-models

Conversation

@john-rocky
Copy link
Copy Markdown
Owner

Summary

  • 10 new CoreML models across 9 new categories: depth estimation (next-gen), object detection (next-gen), background removal, speech recognition, text-to-speech, vision-language model, open-vocabulary detection, pose estimation, multilingual OCR
  • 3 sample apps + 7 creative apps with full SwiftUI implementations
  • 10 Python conversion scripts in conversion_scripts/
  • README.md updated with all new model entries and TOC

New Models

Model Category Year Size
Depth Anything V2 Small Monocular Depth 2024 47MB
YOLOv10-N Object Detection 2024 4.6MB
BiRefNet Background Removal 2024 82MB
Whisper Tiny Speech Recognition 2023 16MB
Depth Pro Metric Depth (Apple) 2024 1.2GB
Kokoro-82M Text-to-Speech 2025 55MB
SmolVLM2-500M Vision-Language Model 2025 164MB
YOLOE-S Open-Vocab Detection 2025 20MB
DWPose / RTMPose Pose Estimation 2024 ~50MB
PP-OCRv5 Multilingual OCR 2025 ~20MB

Model Distribution

CoreML models (.mlpackage) will be distributed via GitHub Releases to avoid repo size limits. Users download models and place them in the app directory to build.

Test plan

  • Open each .xcodeproj in Xcode and verify it builds
  • Download .mlpackage from Releases, place in app dir, run on device
  • Verify README links and model entries

🤖 Generated with Claude Code

john-rocky and others added 18 commits March 28, 2026 22:52
New model categories:
- Face Manipulation: LivePortrait, FOMM, Wav2Lip, SimSwap, 3DDFA_V2, DPR
- Image Harmonization: CDTNet
- Audio Source Separation: HTDemucs
- Video Motion Magnification: STB-VMM
- Image Deblurring: NAFNet
- Image Classifiers: MobileNetV3, ConvNeXt, FastViT, MobileOne, etc.
- Semantic Segmentation: DeepLabV3, LRASPP

Includes 20 SwiftUI sample apps (creative_apps/ and sample_apps/).
Model files (.mlpackage) are excluded - download from Google Drive.
New models: Depth Anything V2, YOLOv10-N, BiRefNet, Whisper Tiny,
Depth Pro, Kokoro-82M TTS, SmolVLM2-500M, YOLOE-S, DWPose, PP-OCRv5.
Covers new categories: speech recognition, TTS, VLM, open-vocab detection,
pose estimation, and multilingual OCR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models are distributed via GitHub Releases (not in repo).
Download .mlpackage files and place in the app directory to build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Depth Pro requires 1536x1536 fixed input (~1.2GB model).
Added RAM requirement warning (iPhone 15 Pro+ / 6GB RAM).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ANE fails to compile the large DepthPro model. Switch to CPU+GPU compute units.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Model input is MLMultiArray (not pixelBuffer): convert BGRA→RGB Float16
- Output name is 'var_4563' (auto-generated), with fallback to first output
- Handle Float16 output with vImage conversion
- Add Accelerate import for vImage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also handle Float16 output in fallback path with vImage conversion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Depth Pro (1.2GB, 1536x1536 fixed) crashes on all iPhones due to memory
- Removed DepthProDemo app, conversion script, and README entry
- BiRefNet: reduced input from 1024x1024 to 512x512 to fit iPhone memory
- BiRefNet: switched from ANE to cpuAndGPU (ANE compilation fails)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model uses Float16, not Float32. Reading Float32 from Float16 buffer
produced garbage → NaN → UInt8 crash.
- Input: write as Float16 via vImage conversion
- Output: read as Float16 and convert to Float32 via vImage
- Add NaN guard in mask-to-image conversion
- Add Accelerate import for vImage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UIImage from PhotosPicker can have rotation metadata (imageOrientation).
CGImage ignores this, causing a 90-degree mismatch between mask and cutout.
Normalize to .up orientation before extracting CGImage pixels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…trait pipeline

- Remove DepthAnythingV2Demo (Apple official CoreML model available)
- Remove WhisperDemo (WhisperKit provides full implementation)
- Remove DWPoseDemo (Apple Vision API has built-in pose detection)
- Remove corresponding conversion scripts
- Update README to reference official implementations
- Fix YOLOv10Demo to parse raw MultiArray output [1,300,6]
- Implement full LivePortrait animation pipeline with 4-model inference
- Add AppIcon.appiconset to YOLOv10Demo
- Add Accelerate-based STFT/iSTFT signal processing (vDSP FFT)
- Implement stride-aware MLMultiArray extraction for Float16 outputs
- Use time-domain output only (freq branch overflows Float16 → ±inf)
- Audio loading with format conversion and resampling to 44.1kHz stereo
- Per-stem WAV export and playback
- Fix SwiftUI type-checker timeout in WaveformView

Known issue: freq_output produces ±inf due to Float16 overflow in the
model's frequency branch. Reconverting the model with Float32 outputs
should enable freq+time reconstruction for better separation quality.
Draft conversion script to reconvert HTDemucs with Float32 precision.
The current Float16 model causes overflow (±inf) in the frequency branch.
- freq_output overflows Float16 range (±inf) for real STFT data,
  even with Float32 internal computation (output tensor is Float16)
- Use time-domain output only for stem reconstruction
- Add F32 model to Xcode project (compute_precision=FLOAT32)
- Simplify conversion script: ONNX-based, end-to-end model
- Source order confirmed: drums, bass, other, vocals

Known issue: Full freq+time reconstruction requires Float32 output
tensors in the CoreML model. The current model spec forces Float16
output which cannot represent large STFT values (>65504).
Time-only provides decent separation quality.
- Add computeSpectralInput: STFT with Python _spec padding, CaC channel format
- Add inverseSpec: iSTFT matching Python _ispec (time padding + trim)
- Feed actual STFT data to spectral_magnitude (was zeros, disabling freq branch)
- Normalize STFT input (÷√N) to match Python torch.stft(normalized=True)
- Compensate iSTFT output (×√N) for correct freq+time addition
- Add stride-aware MLMultiArray fallback for non-contiguous layouts
- Fix stem order: vocals=3, other=2 (matching Python model.sources)
- Generalize forwardSTFT/inverseSTFT for variable frame counts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant