Skip to content

Commit 7298f5b

Browse files
dg845sayakpaulstevhliu
authored
Update LTX-2 Docs to Cover LTX-2.3 Models (#13337)
* Update LTX-2 docs to cover multimodal guidance and prompt enhancement * Apply suggestions from code review Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply reviewer feedback --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent b757035 commit 7298f5b

File tree

2 files changed

+300
-15
lines changed

2 files changed

+300
-15
lines changed

docs/source/en/api/pipelines/ltx2.md

Lines changed: 151 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
1919
</div>
2020

21-
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
21+
[LTX-2](https://hf.co/papers/2601.03233) is a DiT-based foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
2222

2323
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
2424

@@ -293,6 +293,7 @@ import torch
293293
from diffusers import LTX2ConditionPipeline
294294
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
295295
from diffusers.pipelines.ltx2.export_utils import encode_video
296+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
296297
from diffusers.utils import load_image, load_video
297298

298299
device = "cuda"
@@ -315,19 +316,6 @@ prompt = (
315316
"landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the "
316317
"solitude and beauty of a winter drive through a mountainous region."
317318
)
318-
negative_prompt = (
319-
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
320-
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
321-
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
322-
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
323-
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
324-
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
325-
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
326-
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
327-
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
328-
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
329-
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
330-
)
331319

332320
cond_video = load_video(
333321
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
@@ -343,7 +331,7 @@ frame_rate = 24.0
343331
video, audio = pipe(
344332
conditions=conditions,
345333
prompt=prompt,
346-
negative_prompt=negative_prompt,
334+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
347335
width=width,
348336
height=height,
349337
num_frames=121,
@@ -366,6 +354,154 @@ encode_video(
366354

367355
Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static.
368356

357+
## Multimodal Guidance
358+
359+
LTX-2.X pipelines support multimodal guidance. It is composed of three terms, all using a CFG-style update rule:
360+
361+
1. Classifier-Free Guidance (CFG): standard [CFG](https://huggingface.co/papers/2207.12598) where the perturbed ("weaker") output is generated using the negative prompt.
362+
2. Spatio-Temporal Guidance (STG): [STG](https://huggingface.co/papers/2411.18664) moves away from a perturbed output created from short-cutting self-attention operations and substitutes in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency.
363+
3. Modality Isolation Guidance: moves away from a perturbed output created from disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://huggingface.co/papers/2601.03233) models, with the idea that this produces better consistency between the generated audio and video.
364+
365+
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments and can be set separately for video and audio. Additionally, for STG the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. The LTX-2.X pipelines also support [guidance rescaling](https://huggingface.co/papers/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
366+
367+
```py
368+
import torch
369+
from diffusers import LTX2ImageToVideoPipeline
370+
from diffusers.pipelines.ltx2.export_utils import encode_video
371+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
372+
from diffusers.utils import load_image
373+
374+
device = "cuda"
375+
width = 768
376+
height = 512
377+
random_seed = 42
378+
frame_rate = 24.0
379+
generator = torch.Generator(device).manual_seed(random_seed)
380+
model_path = "dg845/LTX-2.3-Diffusers"
381+
382+
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
383+
pipe.enable_sequential_cpu_offload(device=device)
384+
pipe.vae.enable_tiling()
385+
386+
prompt = (
387+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
388+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
389+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
390+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
391+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
392+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
393+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
394+
"breath-taking, movie-like shot."
395+
)
396+
397+
image = load_image(
398+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
399+
)
400+
401+
video, audio = pipe(
402+
image=image,
403+
prompt=prompt,
404+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
405+
width=width,
406+
height=height,
407+
num_frames=121,
408+
frame_rate=frame_rate,
409+
num_inference_steps=30,
410+
guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters
411+
stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0)
412+
modality_scale=3.0,
413+
guidance_rescale=0.7,
414+
audio_guidance_scale=7.0, # Note that a higher CFG guidance scale is recommended for audio
415+
audio_stg_scale=1.0,
416+
audio_modality_scale=3.0,
417+
audio_guidance_rescale=0.7,
418+
spatio_temporal_guidance_blocks=[28],
419+
use_cross_timestep=True,
420+
generator=generator,
421+
output_type="np",
422+
return_dict=False,
423+
)
424+
425+
encode_video(
426+
video[0],
427+
fps=frame_rate,
428+
audio=audio[0].float().cpu(),
429+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
430+
output_path="ltx2_3_i2v_stage_1.mp4",
431+
)
432+
```
433+
434+
## Prompt Enhancement
435+
436+
The LTX-2.X models are sensitive to prompting style. Refer to the [official prompting guide](https://ltx.io/model/model-blog/prompting-guide-for-ltx-2) for recommendations on how to write a good prompt. Using prompt enhancement, where the supplied prompts are enhanced using the pipeline's text encoder (by default a [Gemma 3](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized) model) given a system prompt, can also improve sample quality. The optional `processor` pipeline component needs to be present to use prompt enhancement. Enable prompt enhancement by supplying a `system_prompt` argument:
437+
438+
439+
```py
440+
import torch
441+
from transformers import Gemma3Processor
442+
from diffusers import LTX2Pipeline
443+
from diffusers.pipelines.ltx2.export_utils import encode_video
444+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT, T2V_DEFAULT_SYSTEM_PROMPT
445+
446+
device = "cuda"
447+
width = 768
448+
height = 512
449+
random_seed = 42
450+
frame_rate = 24.0
451+
generator = torch.Generator(device).manual_seed(random_seed)
452+
model_path = "dg845/LTX-2.3-Diffusers"
453+
454+
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
455+
pipe.enable_model_cpu_offload(device=device)
456+
pipe.vae.enable_tiling()
457+
if getattr(pipe, "processor", None) is None:
458+
processor = Gemma3Processor.from_pretrained("google/gemma-3-12b-it-qat-q4_0-unquantized")
459+
pipe.processor = processor
460+
461+
prompt = (
462+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
463+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
464+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
465+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
466+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
467+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
468+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
469+
"breath-taking, movie-like shot."
470+
)
471+
472+
video, audio = pipe(
473+
prompt=prompt,
474+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
475+
width=width,
476+
height=height,
477+
num_frames=121,
478+
frame_rate=frame_rate,
479+
num_inference_steps=30,
480+
guidance_scale=3.0,
481+
stg_scale=1.0,
482+
modality_scale=3.0,
483+
guidance_rescale=0.7,
484+
audio_guidance_scale=7.0,
485+
audio_stg_scale=1.0,
486+
audio_modality_scale=3.0,
487+
audio_guidance_rescale=0.7,
488+
spatio_temporal_guidance_blocks=[28],
489+
use_cross_timestep=True,
490+
system_prompt=T2V_DEFAULT_SYSTEM_PROMPT,
491+
generator=generator,
492+
output_type="np",
493+
return_dict=False,
494+
)
495+
496+
encode_video(
497+
video[0],
498+
fps=frame_rate,
499+
audio=audio[0].float().cpu(),
500+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
501+
output_path="ltx2_3_t2v_stage_1.mp4",
502+
)
503+
```
504+
369505
## LTX2Pipeline
370506

371507
[[autodoc]] LTX2Pipeline

0 commit comments

Comments
 (0)