Wan2.1 text-to-video and image-to-video implementation in MLX. The model weights are downloaded directly from the Hugging Face Hub.
| Model | Task | HF Repo | RAM (unquantized), 81 frames | Single DiT step on M4 Max chip, 81 frames |
|---|---|---|---|---|
| 1.3B | T2V | Wan-AI/Wan2.1-T2V-1.3B | ~10GB | ~90 s/it |
| 14B | T2V | Wan-AI/Wan2.1-T2V-14B | ~36GB | ~230 s/it |
| 14B | I2V | Wan-AI/Wan2.1-I2V-14B-480P | ~39GB | ~250 s/it |
Install the dependencies:
pip install -r requirements.txtSaving videos requires ffmpeg on your PATH.
Generate a video with the default 1.3B model:
python txt2video.py 'A cat playing piano' --output out.mp4Use the 14B model with quantization:
python txt2video.py 'A cat playing piano' \
--model t2v-14B --quantize --output out_14B.mp4Adjust resolution, frame count, and sampling parameters:
python txt2video.py 'Ocean waves crashing on a rocky shore at sunset' \
--size 832x480 --frames 81 --steps 50 --guidance 5.0 --seed 42 \
--output waves.mp4For more parameters, use python txt2video.py --help.
Generate a video from an input image:
python img2video.py 'Astronaut riding a horse' \
--image ./inputs/astronaut-on-a-horse.png --quantize --output out_i2v.mp4Adjust resolution and sampling parameters:
python img2video.py 'Astronaut riding a horse' \
--image ./inputs/astronaut-on-a-horse.png --size 832x480 --frames 81 --steps 40 \
--guidance 5.0 --shift 3.0 --seed 42 --output out_i2v.mp4For more parameters, use python img2video.py --help.
Pass --quantize (or -q) to the CLI
python txt2video.py 'A cat playing piano' --quantize --output out_quantized.mp4To get additional memory savings at the expense of a bit of speed use --no-cache argument. It will prevent MLX from utilizing the cache (sets mx.set_cache_limit(0) under the hood). See documentation for more info
python txt2video.py 'A cat playing piano' --output out.mp4 --no-cacheFor 1.3B model 480p 81 frames --no-cache run utilizes ~10GB of RAM and ~14GB of RAM otherwise
Use --checkpoint to load custom DiT weights (e.g. step-distilled models).
Pass --sampler euler to use Euler sampling for step-distilled models:
For text to video pipeline you can try this 4 steps distilled model
wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/resolve/main/wan2.1_t2v_14b_lightx2v_4step.safetensorspython txt2video.py 'A cat playing piano' \
--model t2v-14B --checkpoint ./wan2.1_t2v_14b_lightx2v_4step.safetensors \
--sampler euler --steps 4 --guidance 1.0 \
--quantize --output out_t2v_distilled.mp4For image to video pipeline we use 4 steps distilled i2v model
wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/resolve/main/wan2.1_i2v_480p_lightx2v_4step.safetensorspython img2video.py 'Astronaut riding a horse' \
--image ./inputs/astronaut-on-a-horse.png --checkpoint ./wan2.1_i2v_480p_lightx2v_4step.safetensors \
--sampler euler --steps 4 --guidance 1.0 --shift 5.0 \
--quantize --output out_i2v_distilled.mp4- Negative prompts:
--n-prompt 'blurry, low quality, distorted' - Disable CFG:
--guidance 1.0skips the unconditional pass, roughly halving compute per step.
TeaCache skips redundant transformer computations when consecutive steps produce similar embeddings, eliminating 20-60% of forward passes. Note that the TeaCache parameters are calibrated for each resolution, consult with LightX2V configs for advanced tweaking. Our defaults are located at pipeline.py
python txt2video.py 'A cat playing piano' --teacache 0.05 --output out.mp4 --verboseRecommended thresholds (1.3B):
| Threshold | Skip Rate | Quality |
|---|---|---|
0.05 |
~34% | Almost lossless |
0.1 |
~58% | Slightly corrupted |
0.25 |
~76% | Visible quality loss |
Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.
--teacache 0.05, 34% steps skipped (17/50) |
--teacache 0.1, 58% steps skipped (29/50) |
--teacache 0.25, 76% steps skipped (38/50) |
|---|---|---|
![]() |
![]() |
![]() |





