Sample iOS app for ByteDance/Hyper-SD — single-step text-to-image generation distilled from SD 1.5 via Trajectory Segmented Consistency Distillation. ByteDance reports 2× user preference vs. SD-Turbo at 1 step.
1-step generations on iPhone, 512×512. Prompts: cat with sunglasses, cyberpunk city, japanese garden, astronaut on horse.
4 CoreML models, ~947 MB total. CLIP text encoder + Swin-style chunked UNet (6-bit palettized) + VAE decoder. The TCD scheduler (custom Swift implementation) drives single-step inference.
| Model | Size | Input | Output |
|---|---|---|---|
| HyperSDTextEncoder.mlpackage.zip | 235 MB | input_ids [1,77] | encoder_hidden_states [1,77,768] |
| HyperSDUnetChunk1.mlpackage.zip | 318 MB | latent + encoder_hs + timestep | first half intermediates |
| HyperSDUnetChunk2.mlpackage.zip | 299 MB | first half outputs + skip connections | noise_pred [2,4,64,64] |
| HyperSDVAEDecoder.mlpackage.zip | 95 MB | latent [1,4,64,64] | image [1,3,512,512] |
- Download the four
.mlpackage.zipfiles above - Unzip and drag them into the Xcode project
- Build and run on iPhone 15 or newer (the chunked UNet expects ANE)
- LoRA fusion before conversion. The Hyper-SD 1-step LoRA is fused into the SD 1.5 base model with
pipe.fuse_lora()before handing the unified model to Apple'sml-stable-diffusion. - Apple's
torch2coremltoolchain is invoked with--attention-implementation SPLIT_EINSUM(Neural Engine path) and--chunk-unet(memory-efficient inference). UNet is split across two mlpackages so each chunk fits ANE memory. - 6-bit kmeans palettization on UNet only. The CLIP text encoder's FP16 weights contain
infvalues that break kmeans, so the text encoder ships at FP16 instead. - Quantize after chunking, not before. Apple's tool palettizes the unchunked model; once chunks are emitted, each chunk has to be re-palettized separately.
- coremltools 9.0 patches required:
- Custom
intop converter for multi-dim tensor shape casts list(block.operations)workaround inchunk_mlprogram.pyfor the newCacheDoublyLinkedListAPI
- Custom
- Inference scheduler. A custom Swift TCD scheduler implementation drives single-step inference with
guidance_scale=1.0(no CFG amplification at 1 step).
