Skip to content

Commit 0a78bb9

Browse files
authored
Merge pull request #864 from modelscope/wans2v
Support Wan-S2V
2 parents 6663dca + 9cea10c commit 0a78bb9

File tree

10 files changed

+1122
-5
lines changed

10 files changed

+1122
-5
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,7 @@ save_video(video, "video1.mp4", fps=15, quality=5)
201201

202202
| Model ID | Extra Parameters | Inference | Full Training | Validate After Full Training | LoRA Training | Validate After LoRA Training |
203203
|-|-|-|-|-|-|-|
204+
|[Wan-AI/Wan2.2-S2V-14B](https://www.modelscope.cn/models/Wan-AI/Wan2.2-S2V-14B)|`input_image`, `input_audio`, `audio_sample_rate`, `s2v_pose_video`|[code](./examples/wanvideo/model_inference/Wan2.2-S2V-14B.py)|-|-|-|-|
204205
|[Wan-AI/Wan2.2-I2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-I2V-A14B)|`input_image`|[code](./examples/wanvideo/model_inference/Wan2.2-I2V-A14B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-I2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-I2V-A14B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-I2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-I2V-A14B.py)|
205206
|[Wan-AI/Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B)||[code](./examples/wanvideo/model_inference/Wan2.2-T2V-A14B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-T2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-T2V-A14B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-T2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-T2V-A14B.py)|
206207
|[Wan-AI/Wan2.2-TI2V-5B](https://modelscope.cn/models/Wan-AI/Wan2.2-TI2V-5B)|`input_image`|[code](./examples/wanvideo/model_inference/Wan2.2-TI2V-5B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-TI2V-5B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-TI2V-5B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-TI2V-5B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-TI2V-5B.py)|
@@ -372,6 +373,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
372373

373374

374375
## Update History
376+
- **August 28, 2025** We support Wan2.2-S2V, an audio-driven cinematic video generation model open-sourced by Alibaba. See [./examples/wanvideo/](./examples/wanvideo/).
377+
375378
- **August 21, 2025**: [DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2) is released! Compared to the V1 version, the training dataset has been updated to the [Qwen-Image-Self-Generated-Dataset](https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset), enabling generated images to better align with the inherent image distribution and style of Qwen-Image. Please refer to [our sample code](./examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py).
376379

377380
- **August 21, 2025**: We open-sourced the [DiffSynth-Studio/Qwen-Image-In-Context-Control-Union](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-In-Context-Control-Union) structure control LoRA model. Following "In Context" routine, it supports various types of structural control conditions, including canny, depth, lineart, softedge, normal, and openpose. Please refer to [our sample code](./examples/qwen_image/model_inference/Qwen-Image-In-Context-Control-Union.py).

README_zh.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,7 @@ save_video(video, "video1.mp4", fps=15, quality=5)
201201

202202
|模型 ID|额外参数|推理|全量训练|全量训练后验证|LoRA 训练|LoRA 训练后验证|
203203
|-|-|-|-|-|-|-|
204+
|[Wan-AI/Wan2.2-S2V-14B](https://www.modelscope.cn/models/Wan-AI/Wan2.2-S2V-14B)|`input_image`, `input_audio`, `audio_sample_rate`, `s2v_pose_video`|[code](./examples/wanvideo/model_inference/Wan2.2-S2V-14B.py)|-|-|-|-|
204205
|[Wan-AI/Wan2.2-I2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-I2V-A14B)|`input_image`|[code](./examples/wanvideo/model_inference/Wan2.2-I2V-A14B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-I2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-I2V-A14B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-I2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-I2V-A14B.py)|
205206
|[Wan-AI/Wan2.2-T2V-A14B](https://modelscope.cn/models/Wan-AI/Wan2.2-T2V-A14B)||[code](./examples/wanvideo/model_inference/Wan2.2-T2V-A14B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-T2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-T2V-A14B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-T2V-A14B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-T2V-A14B.py)|
206207
|[Wan-AI/Wan2.2-TI2V-5B](https://modelscope.cn/models/Wan-AI/Wan2.2-TI2V-5B)|`input_image`|[code](./examples/wanvideo/model_inference/Wan2.2-TI2V-5B.py)|[code](./examples/wanvideo/model_training/full/Wan2.2-TI2V-5B.sh)|[code](./examples/wanvideo/model_training/validate_full/Wan2.2-TI2V-5B.py)|[code](./examples/wanvideo/model_training/lora/Wan2.2-TI2V-5B.sh)|[code](./examples/wanvideo/model_training/validate_lora/Wan2.2-TI2V-5B.py)|
@@ -388,6 +389,8 @@ https://github.com/Artiprocher/DiffSynth-Studio/assets/35051019/59fb2f7b-8de0-44
388389

389390

390391
## 更新历史
392+
- **2025年8月28日** 我们支持了Wan2.2-S2V,一个音频驱动的电影级视频生成模型。请参见[./examples/wanvideo/](./examples/wanvideo/)
393+
391394
- **2025年8月21日** [DiffSynth-Studio/Qwen-Image-EliGen-V2](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-EliGen-V2) 发布!相比于 V1 版本,训练数据集变为 [Qwen-Image-Self-Generated-Dataset](https://www.modelscope.cn/datasets/DiffSynth-Studio/Qwen-Image-Self-Generated-Dataset),因此,生成的图像更符合 Qwen-Image 本身的图像分布和风格。 请参考[我们的示例代码](./examples/qwen_image/model_inference_low_vram/Qwen-Image-EliGen-V2.py)
392395

393396
- **2025年8月21日** 我们开源了 [DiffSynth-Studio/Qwen-Image-In-Context-Control-Union](https://www.modelscope.cn/models/DiffSynth-Studio/Qwen-Image-In-Context-Control-Union) 结构控制 LoRA 模型,采用 In Context 的技术路线,支持多种类别的结构控制条件,包括 canny, depth, lineart, softedge, normal, openpose。 请参考[我们的示例代码](./examples/qwen_image/model_inference/Qwen-Image-In-Context-Control-Union.py)

diffsynth/configs/model_config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,13 @@
5656
from ..models.stepvideo_dit import StepVideoModel
5757

5858
from ..models.wan_video_dit import WanModel
59+
from ..models.wan_video_dit_s2v import WanS2VModel
5960
from ..models.wan_video_text_encoder import WanTextEncoder
6061
from ..models.wan_video_image_encoder import WanImageEncoder
6162
from ..models.wan_video_vae import WanVideoVAE, WanVideoVAE38
6263
from ..models.wan_video_motion_controller import WanMotionControllerModel
6364
from ..models.wan_video_vace import VaceWanModel
65+
from ..models.wav2vec import WanS2VAudioEncoder
6466

6567
from ..models.step1x_connector import Qwen2Connector
6668

@@ -155,6 +157,7 @@
155157
(None, "a61453409b67cd3246cf0c3bebad47ba", ["wan_video_dit", "wan_video_vace"], [WanModel, VaceWanModel], "civitai"),
156158
(None, "7a513e1f257a861512b1afd387a8ecd9", ["wan_video_dit", "wan_video_vace"], [WanModel, VaceWanModel], "civitai"),
157159
(None, "cb104773c6c2cb6df4f9529ad5c60d0b", ["wan_video_dit"], [WanModel], "diffusers"),
160+
(None, "966cffdcc52f9c46c391768b27637614", ["wan_video_dit"], [WanS2VModel], "civitai"),
158161
(None, "9c8818c2cbea55eca56c7b447df170da", ["wan_video_text_encoder"], [WanTextEncoder], "civitai"),
159162
(None, "5941c53e207d62f20f9025686193c40b", ["wan_video_image_encoder"], [WanImageEncoder], "civitai"),
160163
(None, "1378ea763357eea97acdef78e65d6d96", ["wan_video_vae"], [WanVideoVAE], "civitai"),
@@ -172,6 +175,7 @@
172175
(None, "ed4ea5824d55ec3107b09815e318123a", ["qwen_image_vae"], [QwenImageVAE], "diffusers"),
173176
(None, "073bce9cf969e317e5662cd570c3e79c", ["qwen_image_blockwise_controlnet"], [QwenImageBlockWiseControlNet], "civitai"),
174177
(None, "a9e54e480a628f0b956a688a81c33bab", ["qwen_image_blockwise_controlnet"], [QwenImageBlockWiseControlNet], "civitai"),
178+
(None, "06be60f3a4526586d8431cd038a71486", ["wans2v_audio_encoder"], [WanS2VAudioEncoder], "civitai"),
175179
]
176180
huggingface_model_loader_configs = [
177181
# These configs are provided for detecting model type automatically.

diffsynth/data/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
from .video import VideoData, save_video, save_frames
1+
from .video import VideoData, save_video, save_frames, merge_video_audio, save_video_with_audio

diffsynth/data/video.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
import numpy as np
33
from PIL import Image
44
from tqdm import tqdm
5+
import subprocess
6+
import shutil
57

68

79
class LowMemoryVideo:
@@ -146,3 +148,70 @@ def save_frames(frames, save_path):
146148
os.makedirs(save_path, exist_ok=True)
147149
for i, frame in enumerate(tqdm(frames, desc="Saving images")):
148150
frame.save(os.path.join(save_path, f"{i}.png"))
151+
152+
153+
def merge_video_audio(video_path: str, audio_path: str):
154+
# TODO: may need a in-python implementation to avoid subprocess dependency
155+
"""
156+
Merge the video and audio into a new video, with the duration set to the shorter of the two,
157+
and overwrite the original video file.
158+
159+
Parameters:
160+
video_path (str): Path to the original video file
161+
audio_path (str): Path to the audio file
162+
"""
163+
164+
# check
165+
if not os.path.exists(video_path):
166+
raise FileNotFoundError(f"video file {video_path} does not exist")
167+
if not os.path.exists(audio_path):
168+
raise FileNotFoundError(f"audio file {audio_path} does not exist")
169+
170+
base, ext = os.path.splitext(video_path)
171+
temp_output = f"{base}_temp{ext}"
172+
173+
try:
174+
# create ffmpeg command
175+
command = [
176+
'ffmpeg',
177+
'-y', # overwrite
178+
'-i',
179+
video_path,
180+
'-i',
181+
audio_path,
182+
'-c:v',
183+
'copy', # copy video stream
184+
'-c:a',
185+
'aac', # use AAC audio encoder
186+
'-b:a',
187+
'192k', # set audio bitrate (optional)
188+
'-map',
189+
'0:v:0', # select the first video stream
190+
'-map',
191+
'1:a:0', # select the first audio stream
192+
'-shortest', # choose the shortest duration
193+
temp_output
194+
]
195+
196+
# execute the command
197+
result = subprocess.run(
198+
command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
199+
200+
# check result
201+
if result.returncode != 0:
202+
error_msg = f"FFmpeg execute failed: {result.stderr}"
203+
print(error_msg)
204+
raise RuntimeError(error_msg)
205+
206+
shutil.move(temp_output, video_path)
207+
print(f"Merge completed, saved to {video_path}")
208+
209+
except Exception as e:
210+
if os.path.exists(temp_output):
211+
os.remove(temp_output)
212+
print(f"merge_video_audio failed with error: {e}")
213+
214+
215+
def save_video_with_audio(frames, save_path, audio_path, fps=16, quality=9, ffmpeg_params=None):
216+
save_video(frames, save_path, fps, quality, ffmpeg_params)
217+
merge_video_audio(save_path, audio_path)

0 commit comments

Comments
 (0)