Skip to content

Commit bcbfcd4

Browse files
yzhautouskaydg845github-actions[bot]
authored
Add Cosmos3 video2video generation support (#13896)
* Init v2v cosmos3 commit * Add user quide; prompt upsampling is TBD * Apply style fixes --------- Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 21e3338 commit bcbfcd4

4 files changed

Lines changed: 331 additions & 2 deletions

File tree

docs/source/en/api/pipelines/cosmos3.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ python -m cosmos_framework.inference.prompt_upsampling \
7777

7878
Switch `--mode` to match the workflow you are targeting (`text2image`, `text2video`, `image2video`). The command writes the upsampled prompt(s) to the `--output` file as a JSON array (one object per non-empty line in `--input`); pass a `.jsonl` path instead to get one JSON object per line. For `image2video`, you must also supply the conditioning image via `--image-url` (a URL or local path) or `--image-list` (one image per prompt).
7979

80+
<!-- TODO: Add prompt upsampling support for video inputs (video-to-video) to the upsampler CLI. -->
81+
8082
A pre-upsampled positive prompt (`assets/example_t2v_prompt.json`) and negative prompt (`assets/negative_prompt.json`) are provided for convenience, and are used by the generation examples below. The examples load these JSON files and pass them to the pipeline as JSON strings via `json.dumps(...)`.
8183

8284
## Text-to-video
@@ -276,6 +278,200 @@ export_to_video(result.video, "cosmos3_i2v.mp4", fps=24, macro_block_size=1)
276278
</hfoption>
277279
</hfoptions>
278280

281+
## Video-to-video
282+
283+
Pass a conditioning clip via `video=` (e.g. from `load_video`). The pipeline anchors the leading latent frames given by `condition_frame_indexes_vision` (default `[0, 1]`) to the clip and denoises the rest. Use `condition_video_keep` (`"first"` or `"last"`) to choose which end of a longer source clip the conditioning frames are taken from. As with the other modes, the prompt should follow the descriptive JSON structure described in [Prompt upsampling](#prompt-upsampling).
284+
285+
<!-- TODO: Add prompt upsampling support for video inputs (video-to-video) to the upsampler CLI. -->
286+
287+
<hfoptions id="model">
288+
<hfoption id="Nano">
289+
290+
```python
291+
import json
292+
import torch
293+
from diffusers import Cosmos3OmniPipeline
294+
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
295+
from diffusers.utils import export_to_video, load_video
296+
297+
# JSON-upsampled positive and negative prompts (see "Prompt upsampling" above).
298+
json_prompt = json.load(open("assets/example_v2v_prompt.json"))
299+
negative_prompt = json.load(open("assets/negative_prompt_i2v.json"))
300+
301+
pipe = Cosmos3OmniPipeline.from_pretrained(
302+
"nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
303+
)
304+
pipe.scheduler = UniPCMultistepScheduler.from_config(
305+
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
306+
)
307+
308+
video = load_video(
309+
"https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4"
310+
)
311+
312+
result = pipe(
313+
prompt=json.dumps(json_prompt),
314+
negative_prompt=json.dumps(negative_prompt),
315+
video=video,
316+
condition_frame_indexes_vision=[0, 1],
317+
condition_video_keep="first",
318+
num_frames=189,
319+
height=720,
320+
width=1280,
321+
num_inference_steps=35,
322+
guidance_scale=6.0,
323+
fps=24.0,
324+
)
325+
# macro_block_size=1 allows arbitrary frame sizes (Cosmos3 outputs are not always divisible by 16).
326+
export_to_video(result.video, "cosmos3_v2v.mp4", fps=24, macro_block_size=1)
327+
```
328+
329+
</hfoption>
330+
<hfoption id="Super">
331+
332+
```python
333+
import json
334+
import torch
335+
from diffusers import Cosmos3OmniPipeline
336+
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
337+
from diffusers.utils import export_to_video, load_video
338+
339+
# JSON-upsampled positive and negative prompts (see "Prompt upsampling" above).
340+
json_prompt = json.load(open("assets/example_v2v_prompt.json"))
341+
negative_prompt = json.load(open("assets/negative_prompt_i2v.json"))
342+
343+
pipe = Cosmos3OmniPipeline.from_pretrained(
344+
"nvidia/Cosmos3-Super", torch_dtype=torch.bfloat16, device_map="cuda"
345+
)
346+
pipe.scheduler = UniPCMultistepScheduler.from_config(
347+
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
348+
)
349+
350+
video = load_video(
351+
"https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4"
352+
)
353+
354+
result = pipe(
355+
prompt=json.dumps(json_prompt),
356+
negative_prompt=json.dumps(negative_prompt),
357+
video=video,
358+
condition_frame_indexes_vision=[0, 1],
359+
condition_video_keep="first",
360+
num_frames=189,
361+
height=720,
362+
width=1280,
363+
num_inference_steps=35,
364+
guidance_scale=6.0,
365+
fps=24.0,
366+
)
367+
# macro_block_size=1 allows arbitrary frame sizes (Cosmos3 outputs are not always divisible by 16).
368+
export_to_video(result.video, "cosmos3_v2v.mp4", fps=24, macro_block_size=1)
369+
```
370+
371+
</hfoption>
372+
</hfoptions>
373+
374+
## Video-to-video with sound
375+
376+
When the checkpoint carries a `sound_tokenizer`, add `enable_sound=True` to the video-to-video call to jointly generate a synchronized audio track. The waveform is returned alongside the video and can be muxed into the MP4 with [`~utils.encode_video`].
377+
378+
<hfoptions id="model">
379+
<hfoption id="Nano">
380+
381+
```python
382+
import json
383+
import torch
384+
from diffusers import Cosmos3OmniPipeline
385+
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
386+
from diffusers.utils import encode_video, load_video
387+
388+
# JSON-upsampled positive and negative prompts (see "Prompt upsampling" above).
389+
json_prompt = json.load(open("assets/example_v2v_prompt.json"))
390+
negative_prompt = json.load(open("assets/negative_prompt_i2v.json"))
391+
392+
pipe = Cosmos3OmniPipeline.from_pretrained(
393+
"nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
394+
)
395+
pipe.scheduler = UniPCMultistepScheduler.from_config(
396+
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
397+
)
398+
399+
video = load_video(
400+
"https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4"
401+
)
402+
403+
result = pipe(
404+
prompt=json.dumps(json_prompt),
405+
negative_prompt=json.dumps(negative_prompt),
406+
video=video,
407+
condition_frame_indexes_vision=[0, 1],
408+
condition_video_keep="first",
409+
num_frames=189,
410+
height=720,
411+
width=1280,
412+
fps=24.0,
413+
enable_sound=True,
414+
)
415+
416+
encode_video(
417+
result.video,
418+
fps=24,
419+
audio=result.sound,
420+
audio_sample_rate=pipe.sound_tokenizer.config.sampling_rate,
421+
output_path="cosmos3_v2v_with_sound.mp4",
422+
)
423+
```
424+
425+
</hfoption>
426+
<hfoption id="Super">
427+
428+
```python
429+
import json
430+
import torch
431+
from diffusers import Cosmos3OmniPipeline
432+
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
433+
from diffusers.utils import encode_video, load_video
434+
435+
# JSON-upsampled positive and negative prompts (see "Prompt upsampling" above).
436+
json_prompt = json.load(open("assets/example_v2v_prompt.json"))
437+
negative_prompt = json.load(open("assets/negative_prompt_i2v.json"))
438+
439+
pipe = Cosmos3OmniPipeline.from_pretrained(
440+
"nvidia/Cosmos3-Super", torch_dtype=torch.bfloat16, device_map="cuda"
441+
)
442+
pipe.scheduler = UniPCMultistepScheduler.from_config(
443+
pipe.scheduler.config, flow_shift=10.0, use_karras_sigmas=False
444+
)
445+
446+
video = load_video(
447+
"https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4"
448+
)
449+
450+
result = pipe(
451+
prompt=json.dumps(json_prompt),
452+
negative_prompt=json.dumps(negative_prompt),
453+
video=video,
454+
condition_frame_indexes_vision=[0, 1],
455+
condition_video_keep="first",
456+
num_frames=189,
457+
height=720,
458+
width=1280,
459+
fps=24.0,
460+
enable_sound=True,
461+
)
462+
463+
encode_video(
464+
result.video,
465+
fps=24,
466+
audio=result.sound,
467+
audio_sample_rate=pipe.sound_tokenizer.config.sampling_rate,
468+
output_path="cosmos3_v2v_with_sound.mp4",
469+
)
470+
```
471+
472+
</hfoption>
473+
</hfoptions>
474+
279475
## Text-to-video with sound
280476

281477
When the checkpoint carries a `sound_tokenizer`, pass `enable_sound=True` to jointly generate a synchronized audio track. The waveform is returned alongside the video and can be muxed into the MP4 with [`~utils.encode_video`].

examples/cosmos3/README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,16 @@ python examples/cosmos3/inference_cosmos3.py \
4040
--vision-path https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/robot_153.jpg
4141
```
4242

43+
Video-to-video (condition on the leading frames of a clip and continue it):
44+
45+
```bash
46+
python examples/cosmos3/inference_cosmos3.py \
47+
--prompt "A robotic arm finishes pouring liquid into the glass." \
48+
--video-path "https://github.com/nvidia-cosmos/cosmos-dependencies/raw/refs/heads/assets/cosmos3/inputs/vision/robot_pouring.mp4" \
49+
--condition-frame-indexes-vision 0,1 \
50+
--condition-video-keep first
51+
```
52+
4353
Text-to-video-with-sound (sound-capable checkpoint only):
4454

4555
```bash

examples/cosmos3/inference_cosmos3.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@
1818
Image-to-video:
1919
python inference_cosmos3.py --prompt "..." --vision-path /path/to/image.jpg
2020
21+
Video-to-video:
22+
python inference_cosmos3.py --prompt "..." --video-path /path/to/video.mp4
23+
2124
Text-to-video-with-sound (requires a sound-capable checkpoint):
2225
python inference_cosmos3.py --prompt "..." --enable-sound
2326
"""
@@ -70,6 +73,22 @@ def main():
7073
default=None,
7174
help="Optional URL or local path for an image-conditioning frame, or an action conditioning video.",
7275
)
76+
parser.add_argument(
77+
"--video-path",
78+
default=None,
79+
help="Optional URL or local path to a conditioning video for video-to-video generation.",
80+
)
81+
parser.add_argument(
82+
"--condition-frame-indexes-vision",
83+
default=None,
84+
help="Comma-separated latent frame indexes kept clean for video-to-video (default: 0,1).",
85+
)
86+
parser.add_argument(
87+
"--condition-video-keep",
88+
choices=["first", "last"],
89+
default="first",
90+
help="Take the video-to-video conditioning frames from the first or last of the source clip (default: first).",
91+
)
7392
parser.add_argument("--output", default=".", help="Directory to save generated video/image/audio files.")
7493
parser.add_argument(
7594
"--height",
@@ -206,6 +225,30 @@ def main():
206225
add_duration_template=args.add_duration_template,
207226
enable_safety_check=not args.no_safety_check,
208227
)
228+
elif args.video_path is not None:
229+
video = load_video(args.video_path)
230+
condition_frame_indexes_vision = (
231+
[int(i) for i in args.condition_frame_indexes_vision.split(",") if i.strip()]
232+
if args.condition_frame_indexes_vision is not None
233+
else [0, 1]
234+
)
235+
result = pipeline(
236+
prompt=args.prompt,
237+
video=video,
238+
condition_frame_indexes_vision=condition_frame_indexes_vision,
239+
condition_video_keep=args.condition_video_keep,
240+
num_frames=args.num_frames,
241+
height=args.height,
242+
width=args.width,
243+
fps=args.fps,
244+
num_inference_steps=args.num_inference_steps,
245+
enable_sound=args.enable_sound,
246+
guidance_scale=args.guidance_scale,
247+
generator=generator,
248+
add_resolution_template=args.add_resolution_template,
249+
add_duration_template=args.add_duration_template,
250+
enable_safety_check=not args.no_safety_check,
251+
)
209252
else:
210253
image = load_image(args.vision_path) if args.vision_path is not None else None
211254
result = pipeline(

0 commit comments

Comments
 (0)