Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Wan2.1

Wan2.1 text-to-video and image-to-video implementation in MLX. The model weights are downloaded directly from the Hugging Face Hub.

Model Task HF Repo RAM (unquantized), 81 frames Single DiT step on M4 Max chip, 81 frames
1.3B T2V Wan-AI/Wan2.1-T2V-1.3B ~10GB ~90 s/it
14B T2V Wan-AI/Wan2.1-T2V-14B ~36GB ~230 s/it
14B I2V Wan-AI/Wan2.1-I2V-14B-480P ~39GB ~250 s/it
T2V 1.3B T2V 14B I2V 14B
WAN t2v 1.3B WAN t2v 14B distilled WAN t2v 14B distilled
Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage. Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage. An astronaut riding a horse

Installation

Install the dependencies:

pip install -r requirements.txt

Saving videos requires ffmpeg on your PATH.

Usage

Text-to-Video

Generate a video with the default 1.3B model:

python txt2video.py 'A cat playing piano' --output out.mp4

Use the 14B model with quantization:

python txt2video.py 'A cat playing piano' \
    --model t2v-14B --quantize --output out_14B.mp4

Adjust resolution, frame count, and sampling parameters:

python txt2video.py 'Ocean waves crashing on a rocky shore at sunset' \
    --size 832x480 --frames 81 --steps 50 --guidance 5.0 --seed 42 \
    --output waves.mp4

For more parameters, use python txt2video.py --help.

Image-to-Video

Generate a video from an input image:

python img2video.py 'Astronaut riding a horse' \
    --image ./inputs/astronaut-on-a-horse.png --quantize --output out_i2v.mp4

Adjust resolution and sampling parameters:

python img2video.py 'Astronaut riding a horse' \
    --image ./inputs/astronaut-on-a-horse.png --size 832x480 --frames 81 --steps 40 \
    --guidance 5.0 --shift 3.0 --seed 42 --output out_i2v.mp4

For more parameters, use python img2video.py --help.

Quantization

Pass --quantize (or -q) to the CLI

python txt2video.py 'A cat playing piano' --quantize --output out_quantized.mp4

Disabling the cache

To get additional memory savings at the expense of a bit of speed use --no-cache argument. It will prevent MLX from utilizing the cache (sets mx.set_cache_limit(0) under the hood). See documentation for more info

python txt2video.py 'A cat playing piano' --output out.mp4 --no-cache

For 1.3B model 480p 81 frames --no-cache run utilizes ~10GB of RAM and ~14GB of RAM otherwise

Custom DiT Weights

Use --checkpoint to load custom DiT weights (e.g. step-distilled models). Pass --sampler euler to use Euler sampling for step-distilled models:

For text to video pipeline you can try this 4 steps distilled model

wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/resolve/main/wan2.1_t2v_14b_lightx2v_4step.safetensors
python txt2video.py 'A cat playing piano' \
    --model t2v-14B --checkpoint ./wan2.1_t2v_14b_lightx2v_4step.safetensors \
    --sampler euler --steps 4 --guidance 1.0 \
    --quantize --output out_t2v_distilled.mp4

For image to video pipeline we use 4 steps distilled i2v model

wget https://huggingface.co/lightx2v/Wan2.1-Distill-Models/resolve/main/wan2.1_i2v_480p_lightx2v_4step.safetensors
python img2video.py 'Astronaut riding a horse' \
    --image ./inputs/astronaut-on-a-horse.png --checkpoint ./wan2.1_i2v_480p_lightx2v_4step.safetensors \
    --sampler euler --steps 4 --guidance 1.0 --shift 5.0 \
    --quantize --output out_i2v_distilled.mp4

Options

  • Negative prompts: --n-prompt 'blurry, low quality, distorted'
  • Disable CFG: --guidance 1.0 skips the unconditional pass, roughly halving compute per step.

TeaCache

TeaCache skips redundant transformer computations when consecutive steps produce similar embeddings, eliminating 20-60% of forward passes. Note that the TeaCache parameters are calibrated for each resolution, consult with LightX2V configs for advanced tweaking. Our defaults are located at pipeline.py

python txt2video.py 'A cat playing piano' --teacache 0.05 --output out.mp4 --verbose

Recommended thresholds (1.3B):

Threshold Skip Rate Quality
0.05 ~34% Almost lossless
0.1 ~58% Slightly corrupted
0.25 ~76% Visible quality loss

Result with --teacache for 1.3B model

Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.

--teacache 0.05, 34% steps skipped (17/50) --teacache 0.1, 58% steps skipped (29/50) --teacache 0.25, 76% steps skipped (38/50)
WAN t2v 1.3B teacache=0.05 WAN t2v 1.3B teacache=0.05 WAN t2v 1.3B teacache=0.05

References

  1. Original WAN 2.1 implementation
  2. LightX2V