[modular] Add LTX Video modular pipeline#13378
[modular] Add LTX Video modular pipeline#13378akshan-main wants to merge 12 commits intohuggingface:mainfrom
Conversation
|
cc @asomoza |
|
Reran with the official example params T2V standard: ltx_t2v_standard.mp4T2V modular: ltx_t2v_modular.mp4T2V codeimport torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
from diffusers.utils import export_to_video
model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
height, width, num_frames = 480, 704, 161
steps, cfg, seed = 50, 3.0, 42
print("=== Standard T2V ===")
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
prompt=prompt, negative_prompt=negative_prompt,
height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=cfg, generator=gen,
output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_t2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")
del std_pipe
torch.cuda.empty_cache()
print("\n=== Modular T2V ===")
blocks = LTXBlocks()
mod_pipe = blocks.init_pipeline(model_id)
mod_pipe.load_components(torch_dtype=torch.bfloat16)
mod_pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = mod_pipe(
prompt=prompt, negative_prompt=negative_prompt,
height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=cfg, generator=gen,
output="videos",
)
export_to_video(mod_result[0], "/content/ltx_t2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")
diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nT2V MAD: {diff:.6f}")
print("T2V PARITY:", "PASS" if diff < 1.0 else "FAIL")
del mod_pipe, blocks
torch.cuda.empty_cache()I2V standard: ltx_i2v_standard.mp4I2V modular: ltx_i2v_modular.mp4I2V codefrom diffusers import LTXImageToVideoPipeline, LTXImage2VideoBlocks
from diffusers.utils import load_image
image = load_image("https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_640.jpg").resize((704, 480))
i2v_prompt = "A cat slowly turns its head and looks around"
print("=== Standard I2V ===")
std_pipe = LTXImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=cfg, generator=gen,
output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_i2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")
del std_pipe
torch.cuda.empty_cache()
print("\n=== Modular I2V ===")
blocks = LTXImage2VideoBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=cfg, generator=gen,
output="videos",
)
export_to_video(mod_result[0], "/content/ltx_i2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")
diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nI2V MAD: {diff:.6f}")
print("I2V PARITY:", "PASS" if diff < 1.0 else "FAIL")
print("\n=== Done ===")
print("Videos saved: ltx_t2v_standard.mp4, ltx_t2v_modular.mp4, ltx_i2v_standard.mp4, ltx_i2v_modular.mp4")Also verified that without CFG (guidance_scale=1.0), MAD drops to 0.008. The small visual difference with CFG enabled comes from the guider running cond/uncond as separate batches vs the standard pipeline's single concatenated batch. This is same behavior as the Wan modular pipeline. No CFG codeimport torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair smiles"
height, width, num_frames = 480, 704, 41
steps, seed = 20, 42
# Standard - no CFG
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
prompt=prompt, height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=1.0, generator=gen,
output_type="np",
).frames
del std_pipe; torch.cuda.empty_cache()
# Modular - no CFG
blocks = LTXBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
prompt=prompt, height=height, width=width, num_frames=num_frames,
num_inference_steps=steps, guidance_scale=1.0, generator=gen,
output="videos",
)
diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"No-CFG MAD: {diff:.6f}") |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks, I left some feedbacks!
| return False | ||
|
|
||
|
|
||
| class LTXImage2VideoModularPipeline(LTXModularPipeline): |
There was a problem hiding this comment.
IMO we don't need a modular pipeline for I2V
for wan, we have T2V and I2V pipelines because they are different checkpoints/models
for LTX, it's same checkpoint support multiple workflow, no?
There was a problem hiding this comment.
Makes sense! Removed LTXImage2VideoModularPipeline. Both workflows use LTXModularPipeline now.
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class LTXCoreDenoiseStep(SequentialPipelineBlocks): |
|
@claude can you do a review here? |
|
Claude finished @yiyixuxu's task in 4m 1s —— View job Code Review: LTX Video Modular Pipeline
On
|
|
@yiyixuxu Would you like me to add the condition and latent upsampler workflows to this PR as well, or keep those for a follow-up? Also happy to do LTX2 modular as a next PR if that's useful. condition is same checkpoint, upsampler is different. |
|
@akshan-main let's merge this one first and follow up with conditionnal + upsample |
|
Sounds good! I'll work on the conditional + upsampler follow-up PR next, then. Meanwhile, I also opened a PR for HunyuanVideo 1.5 modular #13389. Both have MAD 0.000000 parity since the standard HunyuanVideo 1.5 pipeline already uses the guider abstraction, so there's no batching difference like with LTX's manual CFG concat. Would love your feedback on it when you get a chance! |
|
added docstrings to assembled blocks using #auto_docstring |
|
|
||
| Inputs: | ||
| num_videos_per_prompt (`None`, *optional*, defaults to 1): | ||
| TODO: Add description. |
There was a problem hiding this comment.
can you update the InputParam/outputParam definitions https://huggingface.co/docs/diffusers/main/en/modular_diffusers/auto_docstring#parameter-templates
|
@claude are you here? can you look at th PR and explain what steps needs to be done for this to work modular_ltx = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
modular_ltx.load_components(torch_dtype=torch.bfloat16) |
|
Claude finished @yiyixuxu's task in 2m 27s —— View job What's needed for
|
| Requirement | Status |
|---|---|
("ltx", _create_default_map_fn("LTXModularPipeline")) in MODULAR_PIPELINE_MAPPING |
✅ Added by PR (modular_pipeline.py) |
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING |
❌ Missing |
Hub repo with modular_model_index.json + auto_map |
akshan-main/tiny-ltx-modular-pipe (needs to move to hf-internal-testing/) |
So the usage would be:
# Option A — using a dedicated modular pipeline repo (current pattern, works today)
pipe = ModularPipeline.from_pretrained("hf-internal-testing/ltx-modular-pipe")
pipe.load_components(torch_dtype=torch.bfloat16)
# Option B — directly from the model repo (requires Lightricks/LTX-Video
# to have modular_model_index.json + auto_map added)
pipe = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
pipe.load_components(torch_dtype=torch.bfloat16)The PR as written only supports the explicit LTXBlocks().init_pipeline(model_id) entry point. To support the cleaner ModularPipeline.from_pretrained("Lightricks/LTX-Video") pattern, you'd need MODULAR_PIPELINE_BLOCKS_MAPPING updated and either the official model repo or a sidecar modular-pipeline repo published with the right JSON files.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I think as long as MODULAR_PIPELINE_MAPPING is updated the pattern would work no?
can you verify and if agree with me, maybe we should add an item to the skill file? https://github.com/huggingface/diffusers/blob/main/.ai/skills/model-integration/modular-conversion.md |
|
Claude finished @yiyixuxu's task in 2m 47s —— View job
Verification: is
|
| Path | Mechanism | When used |
|---|---|---|
| A | ModularPipelineBlocks.from_pretrained(repo) → reads auto_map from repo's config.json |
Only when the hub repo explicitly has auto_map |
| B | MODULAR_PIPELINE_BLOCKS_MAPPING.get(pipeline_class.__name__) inside __init__ |
When blocks=None (path A returned nothing) |
Path B is the MODULAR_PIPELINE_BLOCKS_MAPPING path. It activates when blocks=None, which is the normal case for a standard model repo like Lightricks/LTX-Video. So auto_map is not required.
The corrected requirements for ModularPipeline.from_pretrained("Lightricks/LTX-Video"):
| Requirement | Previous comment | Corrected |
|---|---|---|
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING |
❌ Missing (noted) | ❌ Missing — needs to be added |
Hub repo with modular_model_index.json (has _class_name: "LTXModularPipeline" + component specs) |
_get_pipeline_class can resolve LTXModularPipeline and load_components knows where to fetch weights |
|
auto_map in hub repo's config.json |
✅ Not required — Path B handles this |
So the minimal change to make ModularPipeline.from_pretrained("Lightricks/LTX-Video") work:
- Add
("LTXModularPipeline", "LTXBlocks")toMODULAR_PIPELINE_BLOCKS_MAPPINGinmodular_pipeline.py(this PR is missing it) - The hub repo (e.g.
hf-internal-testing/ltx-modular-pipe) needsmodular_model_index.jsonwith correct_class_nameand component loading specs —auto_mapis optional
Proposed skill file addition
For .ai/skills/model-integration/modular-conversion.md, I'd suggest adding to the conversion checklist:
- [ ] Add `("ltx", "LTXModularPipeline")` to `MODULAR_PIPELINE_MAPPING` in `modular_pipeline.py`
(enables `blocks.init_pipeline(model_id)` to create the right pipeline class)
- [ ] Add `("LTXModularPipeline", "LTXBlocks")` to `MODULAR_PIPELINE_BLOCKS_MAPPING` in `modular_pipeline.py`
(enables `ModularPipeline.from_pretrained(repo)` without needing `auto_map` in the hub repo —
`auto_map` is only needed if you want `ModularPipelineBlocks.from_pretrained()` to work directly)
- [ ] Publish a sidecar modular repo (e.g. `hf-internal-testing/<model>-modular-pipe`) with
`modular_model_index.json` pointing to the model weights; `auto_map` is optionalSince the .ai/ directory doesn't exist on this branch (it lives on main only), and the skill file is outside src/diffusers/, I'll leave the actual edit to you or @akshan-main. But I'd be glad to implement the missing MODULAR_PIPELINE_BLOCKS_MAPPING entry in modular_pipeline.py if you want — just say the word.
|
@akshan-main thanks, i will do a review now |
|
MODULAR_PIPELINE_BLOCKS_MAPPING doesn't exist in the codebase |
|
ohh right |
| return [ | ||
| InputParam.template("prompt"), | ||
| InputParam.template("negative_prompt"), | ||
| InputParam.template("prompt_embeds"), |
There was a problem hiding this comment.
we don't need to list prompt embeds as input. We use this pattern in our standard pipelines to let user skip encoding etc, but in modular it is not needed, you can just pop out the text encoder block and run it separately.
| raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(block_state.prompt)}") | ||
|
|
||
| @staticmethod | ||
| def _get_t5_prompt_embeds( |
There was a problem hiding this comment.
can we make this a regular function? so custom blocks can use it as well
| block_state = self.get_block_state(state) | ||
|
|
||
| # Set guidance_scale on guider so CFG is configured correctly | ||
| guidance_scale = getattr(block_state, "guidance_scale", 3.0) |
There was a problem hiding this comment.
we don't need to accept guidance_scale in modular pipeline,. User can configure the guider separately https://huggingface.co/docs/diffusers/modular_diffusers/guiders#changing-guider-parameters
as we support more guider types, each will have its own set of paramters and we won;t be able to forwarding all of them through the pipeline inputs.
| @property | ||
| def intermediate_outputs(self) -> list[OutputParam]: | ||
| return [ | ||
| OutputParam.template("latents"), |
There was a problem hiding this comment.
this we cannot use the template here, because it is not the "denoise latent" as defined in the output param template
| import torch | ||
|
|
||
| from ...models import LTXVideoTransformer3DModel | ||
| from ...pipelines.ltx.pipeline_ltx import LTXPipeline |
There was a problem hiding this comment.
let's not import the standard pipeline here
the modular and standard pipeline are meant to be parallel.
| block_state.latents = randn_tensor( | ||
| shape, generator=block_state.generator, device=device, dtype=torch.float32 | ||
| ) | ||
| block_state.latents = LTXPipeline._pack_latents( |
There was a problem hiding this comment.
you can redefine it as regular function here or maybe use #Copied from
see example using #Copied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/before_denoise.py#L495
| if not isinstance(image, torch.Tensor): | ||
| from ...video_processor import VideoProcessor | ||
|
|
||
| processor = VideoProcessor(vae_scale_factor=components.vae_spatial_compression_ratio) |
There was a problem hiding this comment.
this should be a components no?
| else: | ||
| init_latents = [ | ||
| retrieve_latents( | ||
| components.vae.encode(img.unsqueeze(0).unsqueeze(2).to(vae_dtype)), block_state.generator |
There was a problem hiding this comment.
we should extract the vae encoding into its own block in encoders.py (e.g. LTXVaeEncoderStep), and here this step should accept image_latents as input instead of raw image. This way users can run the VAE encoder standalone and pass pre-computed latents directly. See https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/encoders.py#L470
|
|
||
| from ...configuration_utils import FrozenDict | ||
| from ...models import AutoencoderKLLTXVideo | ||
| from ...pipelines.ltx.pipeline_ltx import LTXPipeline |
There was a problem hiding this comment.
same here
let's either redefine or copy the pipeline methods you need
|
|
||
| latents = block_state.latents | ||
|
|
||
| if block_state.output_type == "latent": |
There was a problem hiding this comment.
we don't need accept latent output_type in modular
similar to encode_prompt, we can pop out the decoder step from the pipeline if we don't need it decodded
|
addressed everything @yiyixuxu |
Create .ai/modular.md as a shared reference for modular pipeline conventions, patterns, and common mistakes — parallel to the existing models.md for model conventions. Consolidates content from the former modular-conversion.md skill file and adds gotchas identified from reviewing recent modular pipeline PRs (LTX #13378, SD3 #13324). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What does this PR do?
Adds modular pipeline support for LTX Video, covering both text-to-video and image-to-video. The implementation follows the same structure as the existing Wan modular pipeline.
Text-to-video
Image-to-video
Verification
Parity tested against standard pipelines with identical parameters (H100, bfloat16, 297 frames, 30 steps, seed 42):
T2V - Standard vs Modular:
ltx_standard.mp4
ltx_modular.mp4
T2V reproduction code
I2V - Standard vs Modular:
ltx_i2v_standard.mp4
ltx_i2v_modular.mp4
I2V reproduction code
Files added
Files modified
src/diffusers/__init__.pysrc/diffusers/modular_pipelines/__init__.pysrc/diffusers/modular_pipelines/modular_pipeline.pyNote: tiny test model at
akshan-main/tiny-ltx-modular-pipeon hf, will have to be moved tohf-internal-testing/before merge if this is to be okayed.Contribution to #13295
Before submitting
Who can review?
@sayakpaul @yiyixuxu @asomoza