Skip to content

[modular] Add LTX Video modular pipeline#13378

Open
akshan-main wants to merge 12 commits intohuggingface:mainfrom
akshan-main:modular-ltx
Open

[modular] Add LTX Video modular pipeline#13378
akshan-main wants to merge 12 commits intohuggingface:mainfrom
akshan-main:modular-ltx

Conversation

@akshan-main
Copy link
Copy Markdown

@akshan-main akshan-main commented Apr 1, 2026

What does this PR do?

Adds modular pipeline support for LTX Video, covering both text-to-video and image-to-video. The implementation follows the same structure as the existing Wan modular pipeline.

Text-to-video

LTXBlocks (SequentialPipelineBlocks)
  text_encoder      LTXTextEncoderStep
  denoise           LTXCoreDenoiseStep
    input               LTXTextInputStep
    set_timesteps       LTXSetTimestepsStep
    prepare_latents     LTXPrepareLatentsStep
    denoise             LTXDenoiseStep (LoopSequentialPipelineBlocks)
      before_denoiser       LTXLoopBeforeDenoiser
      denoiser              LTXLoopDenoiser
      after_denoiser        LTXLoopAfterDenoiser
  decode            LTXVaeDecoderStep

Image-to-video

LTXImage2VideoBlocks (SequentialPipelineBlocks)
  text_encoder      LTXTextEncoderStep
  denoise           LTXImage2VideoCoreDenoiseStep
    input               LTXTextInputStep
    set_timesteps       LTXSetTimestepsStep
    prepare_latents     LTXImage2VideoPrepareLatentsStep
    denoise             LTXImage2VideoDenoiseStep (LoopSequentialPipelineBlocks)
      before_denoiser       LTXImage2VideoLoopBeforeDenoiser
      denoiser              LTXImage2VideoLoopDenoiser
      after_denoiser        LTXImage2VideoLoopAfterDenoiser
  decode            LTXVaeDecoderStep

Verification

Parity tested against standard pipelines with identical parameters (H100, bfloat16, 297 frames, 30 steps, seed 42):

Standard shape Modular shape MAD
T2V (1, 297, 512, 704, 3) (1, 297, 512, 704, 3) 0.021609
I2V (1, 297, 512, 704, 3) (1, 297, 512, 704, 3) 0.016330

T2V - Standard vs Modular:

ltx_standard.mp4
ltx_modular.mp4
T2V reproduction code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
from diffusers.utils import export_to_video

model_id = "Lightricks/LTX-Video-0.9.7-dev"
prompt = "A cat walking across a sunlit garden"
height, width, num_frames = 512, 704, 297
steps, cfg, seed = 30, 3.0, 42

# Standard pipeline
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "ltx_standard.mp4", fps=25)

del std_pipe
torch.cuda.empty_cache()

# Modular pipeline
blocks = LTXBlocks()
mod_pipe = blocks.init_pipeline(model_id)
mod_pipe.load_components(torch_dtype=torch.bfloat16)
mod_pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = mod_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "ltx_modular.mp4", fps=25)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"Mean absolute difference: {diff:.6f}")

I2V - Standard vs Modular:

ltx_i2v_standard.mp4
ltx_i2v_modular.mp4
I2V reproduction code
import torch
import numpy as np
from diffusers import LTXImageToVideoPipeline, LTXImage2VideoBlocks
from diffusers.utils import export_to_video, load_image

model_id = "Lightricks/LTX-Video-0.9.7-dev"
image = load_image("https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_640.jpg").resize((704, 512))
prompt = "A cat slowly turns its head"
height, width, num_frames = 512, 704, 297
steps, cfg, seed = 30, 3.0, 42

# Standard pipeline
std_pipe = LTXImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    image=image, prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen, output_type="np",
).frames
export_to_video(std_result[0], "ltx_i2v_standard.mp4", fps=25)

del std_pipe
torch.cuda.empty_cache()

# Modular pipeline
blocks = LTXImage2VideoBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    image=image, prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen, output="videos",
)
export_to_video(mod_result[0], "ltx_i2v_modular.mp4", fps=25)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"Mean absolute difference: {diff:.6f}")

Files added

src/diffusers/modular_pipelines/ltx/
  __init__.py
  encoders.py              LTXTextEncoderStep
  before_denoise.py        LTXTextInputStep, LTXSetTimestepsStep, LTXPrepareLatentsStep, LTXImage2VideoPrepareLatentsStep
  denoise.py               T2V and I2V denoise loop blocks
  decoders.py              LTXVaeDecoderStep
  modular_blocks_ltx.py    LTXBlocks, LTXImage2VideoBlocks
  modular_pipeline.py      LTXModularPipeline, LTXImage2VideoModularPipeline

tests/modular_pipelines/ltx/
  test_modular_pipeline_ltx.py

Files modified

  • src/diffusers/__init__.py
  • src/diffusers/modular_pipelines/__init__.py
  • src/diffusers/modular_pipelines/modular_pipeline.py

Note: tiny test model at akshan-main/tiny-ltx-modular-pipe on hf, will have to be moved to hf-internal-testing/ before merge if this is to be okayed.

Contribution to #13295

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you read our philosophy doc (important for complex PRs)?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Modular Diffusers 🧨 #13295
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @asomoza

@akshan-main akshan-main marked this pull request as ready for review April 1, 2026 10:58
@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 1, 2026

cc @asomoza
can you help check if our current LTX (0.97) is broken? the output does not seem right, especailly the T2V one

@akshan-main
Copy link
Copy Markdown
Author

akshan-main commented Apr 1, 2026

Reran with the official example params Lightricks/LTX-Video instead of 0.97, and 480x704, 161 frames, 50 steps, negative prompt. Updated videos:

T2V standard:

ltx_t2v_standard.mp4

T2V modular:

ltx_t2v_modular.mp4
T2V code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
from diffusers.utils import export_to_video

model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
height, width, num_frames = 480, 704, 161
steps, cfg, seed = 50, 3.0, 42

print("=== Standard T2V ===")
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_t2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")

del std_pipe
torch.cuda.empty_cache()

print("\n=== Modular T2V ===")
blocks = LTXBlocks()
mod_pipe = blocks.init_pipeline(model_id)
mod_pipe.load_components(torch_dtype=torch.bfloat16)
mod_pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = mod_pipe(
    prompt=prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "/content/ltx_t2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nT2V MAD: {diff:.6f}")
print("T2V PARITY:", "PASS" if diff < 1.0 else "FAIL")

del mod_pipe, blocks
torch.cuda.empty_cache()

I2V standard:

ltx_i2v_standard.mp4

I2V modular:

ltx_i2v_modular.mp4
I2V code
from diffusers import LTXImageToVideoPipeline, LTXImage2VideoBlocks
from diffusers.utils import load_image

image = load_image("https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_640.jpg").resize((704, 480))
i2v_prompt = "A cat slowly turns its head and looks around"

print("=== Standard I2V ===")
std_pipe = LTXImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_i2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")

del std_pipe
torch.cuda.empty_cache()

print("\n=== Modular I2V ===")
blocks = LTXImage2VideoBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "/content/ltx_i2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nI2V MAD: {diff:.6f}")
print("I2V PARITY:", "PASS" if diff < 1.0 else "FAIL")

print("\n=== Done ===")
print("Videos saved: ltx_t2v_standard.mp4, ltx_t2v_modular.mp4, ltx_i2v_standard.mp4, ltx_i2v_modular.mp4")

Also verified that without CFG (guidance_scale=1.0), MAD drops to 0.008. The small visual difference with CFG enabled comes from the guider running cond/uncond as separate batches vs the standard pipeline's single concatenated batch. This is same behavior as the Wan modular pipeline.

No CFG code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks

model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair smiles"
height, width, num_frames = 480, 704, 41
steps, seed = 20, 42

# Standard - no CFG
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=1.0, generator=gen,
    output_type="np",
).frames

del std_pipe; torch.cuda.empty_cache()

# Modular - no CFG
blocks = LTXBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=1.0, generator=gen,
    output="videos",
)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"No-CFG MAD: {diff:.6f}")

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I left some feedbacks!

return False


class LTXImage2VideoModularPipeline(LTXModularPipeline):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need a modular pipeline for I2V
for wan, we have T2V and I2V pipelines because they are different checkpoints/models
for LTX, it's same checkpoint support multiple workflow, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Removed LTXImage2VideoModularPipeline. Both workflows use LTXModularPipeline now.

logger = logging.get_logger(__name__)


class LTXCoreDenoiseStep(SequentialPipelineBlocks):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add docstrings to assembled blocks using #auto_docstring? see this PR introduced it #12958
@claude are you there, can you help explaIn a bit more on #auto_docstring?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 1, 2026

@claude can you do a review here?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Claude finished @yiyixuxu's task in 4m 1s —— View job


Code Review: LTX Video Modular Pipeline

  • Read PR diff and source files
  • Review correctness issues
  • Answer #auto_docstring question
  • Post final review

On #auto_docstring

@yiyixuxu, @akshan-main#auto_docstring is a doc-generation marker introduced in PR #12958. When placed before an assembled block class (like LTXBlocks or LTXImage2VideoBlocks), the documentation tooling introspects that class's inputs, intermediate_outputs, outputs, and description properties and automatically generates a comprehensive API docstring. This means developers don't have to manually maintain the docstring as the pipeline evolves — it stays in sync with the declared parameter specs automatically.

Usage looks like:

# #auto_docstring
class LTXBlocks(SequentialPipelineBlocks):
    ...

Correctness Issues

src/diffusers/__init__.py

[src/diffusers/__init__.py:1214] LTXImage2VideoBlocks and LTXImage2VideoModularPipeline are missing from the TYPE_CHECKING block — they appear in the lazy _import_structure but not in the static import branch, so type-checkers (mypy/pyright) will not resolve them. Fix this →

Also, the four LTX entries are inserted between HeliosPyramidDistilledModularPipeline and HeliosPyramidModularPipeline, splitting the Helios group. Since H < L alphabetically, all LTX entries should be placed after HeliosPyramidModularPipeline.


src/diffusers/modular_pipelines/ltx/before_denoise.py

[before_denoise.py:484] LTXImage2VideoPrepareLatentsStep.__call__ — when generator is a list, the loop iterates range(batch_size) where batch_size = block_state.batch_size * block_state.num_videos_per_prompt. But image only contains block_state.batch_size frames, so image[i] will raise IndexError when num_videos_per_prompt > 1. Fix: iterate over range(block_state.batch_size) and handle video repetition after encoding. Fix this →


src/diffusers/modular_pipelines/ltx/encoders.py

[encoders.py:1330] LTXTextEncoderStep.__call__ — the early return when prompt_embeds is pre-provided skips generating negative_prompt_embeds, even if CFG is active. If a user passes prompt_embeds without negative_prompt_embeds, LTXLoopDenoiser will fail because it declares negative_prompt_embeds as required=True. At minimum, the guard should also check negative_prompt_embeds when requires_unconditional_embeds is true. Fix this →

[encoders.py:1300] The type mismatch check type(prompt) is not type(negative_prompt) is dead code: by this point prompt has already been converted to a list (line 1283) and negative_prompt has also been converted to a list (line 1298), so they will always have the same type.


src/diffusers/modular_pipelines/ltx/denoise.py

[denoise.py:739,965] LTXLoopDenoiser and LTXImage2VideoLoopDenoiser both use a mutable dict as a default argument for guider_input_fields. This is a Python anti-pattern — if the dict is mutated by any caller, it affects all future calls using the default. Use None as the default and set the dict inside the function body instead. Fix this →

[denoise.py:709,960] LTXLoopDenoiser and LTXImage2VideoLoopDenoiser are ~90% identical — the only runtime difference is that I2V uses block_state.timestep_adjusted instead of t.expand(...) for the timestep argument. Consider refactoring into a base class with a _get_timestep(block_state, t) hook to avoid duplicated maintenance surface.


src/diffusers/modular_pipelines/ltx/decoders.py

[decoders.py:637] latents.to(block_state.dtype if hasattr(block_state, 'dtype') else torch.float32)dtype is set by LTXTextInputStep and should always be present by the time the decoder runs. The hasattr fallback silently hides a pipeline state management bug if dtype is somehow missing; better to let it fail clearly, or assert its presence.


src/diffusers/modular_pipelines/ltx/modular_blocks_ltx.py

[modular_blocks_ltx.py:32] Missing #auto_docstring on LTXBlocks and LTXImage2VideoBlocks as noted by @yiyixuxu.


Architecture Note

Echoing @yiyixuxu's comment: since LTX uses a single checkpoint for both T2V and I2V (unlike Wan where they're separate model families), LTXImage2VideoModularPipeline as a distinct pipeline class adds minimal value. The two workflows are differentiated entirely by the block set (LTXBlocks vs LTXImage2VideoBlocks) which is already the modular pattern's intended separation point. The ltx-i2v key in modular_pipeline.py's registry (and LTXImage2VideoModularPipeline class) could be dropped.

@akshan-main
Copy link
Copy Markdown
Author

@yiyixuxu Would you like me to add the condition and latent upsampler workflows to this PR as well, or keep those for a follow-up? Also happy to do LTX2 modular as a next PR if that's useful.

condition is same checkpoint, upsampler is different.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 2, 2026

@akshan-main let's merge this one first and follow up with conditionnal + upsample
LTX2.3 I'm working on it already:)

@akshan-main
Copy link
Copy Markdown
Author

Sounds good! I'll work on the conditional + upsampler follow-up PR next, then.

Meanwhile, I also opened a PR for HunyuanVideo 1.5 modular #13389. Both have MAD 0.000000 parity since the standard HunyuanVideo 1.5 pipeline already uses the guider abstraction, so there's no batching difference like with LTX's manual CFG concat. Would love your feedback on it when you get a chance!

@akshan-main
Copy link
Copy Markdown
Author

added docstrings to assembled blocks using #auto_docstring


Inputs:
num_videos_per_prompt (`None`, *optional*, defaults to 1):
TODO: Add description.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@claude are you here? can you look at th PR and explain what steps needs to be done for this to work

modular_ltx = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
modular_ltx.load_components(torch_dtype=torch.bfloat16)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @yiyixuxu's task in 2m 27s —— View job


What's needed for ModularPipeline.from_pretrained("Lightricks/LTX-Video")

  • Read PR code and modular pipeline registry
  • Understand current from_pretrained dispatch mechanism
  • Explain what's needed

Here's what ModularPipeline.from_pretrained(repo) actually does (from modular_pipeline.py:2062):

# Step 1: try to load blocks from the repo's auto_map
blocks = ModularPipelineBlocks.from_pretrained(repo, ...)  # needs auto_map in config.json
# Step 2: load config from modular_model_index.json
config_dict = cls.load_config(repo)                        # needs modular_model_index.json
pipeline_class = _get_pipeline_class(cls, config=config_dict)
# Step 3: construct
pipeline = pipeline_class(blocks=blocks, pretrained_model_name_or_path=repo)

Lightricks/LTX-Video is a standard model repo — it has a model_index.json but no modular_model_index.json and no auto_map for blocks. So both steps above fail gracefully and you end up with blocks=None and an empty ModularPipeline.

Three things need to be in place:

1. A hub repo with modular_model_index.json

This JSON tells the pipeline where to load each component from:

{
  "scheduler": ["diffusers", "FlowMatchEulerDiscreteScheduler", {"repo": "Lightricks/LTX-Video", "subfolder": "scheduler"}],
  "vae":        ["diffusers", "AutoencoderKLLTXVideo",          {"repo": "Lightricks/LTX-Video", "subfolder": "vae"}],
  "text_encoder": [...],
  "tokenizer":    [...],
  "transformer":  [...]
}

This can either live in the official Lightricks/LTX-Video repo, or in a separate modular pipeline repo (e.g. hf-internal-testing/ltx-modular-pipe) that points back to the weights. The test already uses the latter pattern (akshan-main/tiny-ltx-modular-pipe).

2. A config.json with auto_map pointing to LTXBlocks

{ "auto_map": { "ModularPipelineBlocks": "modular_blocks_ltx.LTXBlocks" } }

This lives in the same repo as the modular_model_index.json. Without it, blocks=None and no pipeline logic is wired up.

3. LTXModularPipelineLTXBlocks entry in MODULAR_PIPELINE_BLOCKS_MAPPING

In modular_pipeline.py:66, the current mapping is:

MODULAR_PIPELINE_BLOCKS_MAPPING = OrderedDict([
    ("StableDiffusionXLModularPipeline", "StableDiffusionXLAutoBlocks"),
])

This is used in ModularPipeline.__init__ (line 1873) as a fallback when blocks=None. Without ("LTXModularPipeline", "LTXBlocks") here, even LTXModularPipeline(pretrained_model_name_or_path=...) won't wire up the right blocks automatically. This entry is missing from the PR.


Summary — current PR state vs. what's needed:

Requirement Status
("ltx", _create_default_map_fn("LTXModularPipeline")) in MODULAR_PIPELINE_MAPPING ✅ Added by PR (modular_pipeline.py)
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING ❌ Missing
Hub repo with modular_model_index.json + auto_map ⚠️ Only in akshan-main/tiny-ltx-modular-pipe (needs to move to hf-internal-testing/)

So the usage would be:

# Option A — using a dedicated modular pipeline repo (current pattern, works today)
pipe = ModularPipeline.from_pretrained("hf-internal-testing/ltx-modular-pipe")
pipe.load_components(torch_dtype=torch.bfloat16)

# Option B — directly from the model repo (requires Lightricks/LTX-Video
#             to have modular_model_index.json + auto_map added)
pipe = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
pipe.load_components(torch_dtype=torch.bfloat16)

The PR as written only supports the explicit LTXBlocks().init_pipeline(model_id) entry point. To support the cleaner ModularPipeline.from_pretrained("Lightricks/LTX-Video") pattern, you'd need MODULAR_PIPELINE_BLOCKS_MAPPING updated and either the official model repo or a sidecar modular-pipeline repo published with the right JSON files.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@claude

I think as long as MODULAR_PIPELINE_MAPPING is updated the pattern would work no?

To support the cleaner ModularPipeline.from_pretrained("Lightricks/LTX-Video") pattern, you'd need MODULAR_PIPELINE_BLOCKS_MAPPING updated and either the official model repo or a sidecar modular-pipeline repo published with the right JSON files.

can you verify and if agree with me, maybe we should add an item to the skill file? https://github.com/huggingface/diffusers/blob/main/.ai/skills/model-integration/modular-conversion.md

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @yiyixuxu's task in 2m 47s —— View job


  • Read modular_pipeline.py to trace the from_pretrained flow
  • Read the skill file
  • Verify the claim and propose skill file update

Verification: is MODULAR_PIPELINE_BLOCKS_MAPPING sufficient?

Yes, @yiyixuxu is correct — my previous comment was wrong to list auto_map as a requirement. Here's why:

ModularPipeline.from_pretrained(repo) has two block-wiring paths:

Path Mechanism When used
A ModularPipelineBlocks.from_pretrained(repo) → reads auto_map from repo's config.json Only when the hub repo explicitly has auto_map
B MODULAR_PIPELINE_BLOCKS_MAPPING.get(pipeline_class.__name__) inside __init__ When blocks=None (path A returned nothing)

Path B is the MODULAR_PIPELINE_BLOCKS_MAPPING path. It activates when blocks=None, which is the normal case for a standard model repo like Lightricks/LTX-Video. So auto_map is not required.

The corrected requirements for ModularPipeline.from_pretrained("Lightricks/LTX-Video"):

Requirement Previous comment Corrected
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING ❌ Missing (noted) ❌ Missing — needs to be added
Hub repo with modular_model_index.json (has _class_name: "LTXModularPipeline" + component specs) ⚠️ Needed ⚠️ Still needed — so _get_pipeline_class can resolve LTXModularPipeline and load_components knows where to fetch weights
auto_map in hub repo's config.json ⚠️ Listed as needed Not required — Path B handles this

So the minimal change to make ModularPipeline.from_pretrained("Lightricks/LTX-Video") work:

  1. Add ("LTXModularPipeline", "LTXBlocks") to MODULAR_PIPELINE_BLOCKS_MAPPING in modular_pipeline.py (this PR is missing it)
  2. The hub repo (e.g. hf-internal-testing/ltx-modular-pipe) needs modular_model_index.json with correct _class_name and component loading specs — auto_map is optional

Proposed skill file addition

For .ai/skills/model-integration/modular-conversion.md, I'd suggest adding to the conversion checklist:

- [ ] Add `("ltx", "LTXModularPipeline")` to `MODULAR_PIPELINE_MAPPING` in `modular_pipeline.py`
      (enables `blocks.init_pipeline(model_id)` to create the right pipeline class)
- [ ] Add `("LTXModularPipeline", "LTXBlocks")` to `MODULAR_PIPELINE_BLOCKS_MAPPING` in `modular_pipeline.py`
      (enables `ModularPipeline.from_pretrained(repo)` without needing `auto_map` in the hub repo — 
       `auto_map` is only needed if you want `ModularPipelineBlocks.from_pretrained()` to work directly)
- [ ] Publish a sidecar modular repo (e.g. `hf-internal-testing/<model>-modular-pipe`) with 
      `modular_model_index.json` pointing to the model weights; `auto_map` is optional

Since the .ai/ directory doesn't exist on this branch (it lives on main only), and the skill file is outside src/diffusers/, I'll leave the actual edit to you or @akshan-main. But I'd be glad to implement the missing MODULAR_PIPELINE_BLOCKS_MAPPING entry in modular_pipeline.py if you want — just say the word.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@akshan-main thanks, i will do a review now
just heads up we are testing our claude a a little bit so there will be come conversaion back and forth with them, hope you don't mind :)

@akshan-main
Copy link
Copy Markdown
Author

MODULAR_PIPELINE_BLOCKS_MAPPING doesn't exist in the codebase

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

ohh right MODULAR_PIPELINE_MAPPING (you already added so don't worry about it)

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
I left some comments!

@claude, can you look through my comments and put together a summary on the proposed change in the skill file?

return [
InputParam.template("prompt"),
InputParam.template("negative_prompt"),
InputParam.template("prompt_embeds"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to list prompt embeds as input. We use this pattern in our standard pipelines to let user skip encoding etc, but in modular it is not needed, you can just pop out the text encoder block and run it separately.

raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(block_state.prompt)}")

@staticmethod
def _get_t5_prompt_embeds(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a regular function? so custom blocks can use it as well

block_state = self.get_block_state(state)

# Set guidance_scale on guider so CFG is configured correctly
guidance_scale = getattr(block_state, "guidance_scale", 3.0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to accept guidance_scale in modular pipeline,. User can configure the guider separately https://huggingface.co/docs/diffusers/modular_diffusers/guiders#changing-guider-parameters

as we support more guider types, each will have its own set of paramters and we won;t be able to forwarding all of them through the pipeline inputs.

@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("latents"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this we cannot use the template here, because it is not the "denoise latent" as defined in the output param template

import torch

from ...models import LTXVideoTransformer3DModel
from ...pipelines.ltx.pipeline_ltx import LTXPipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not import the standard pipeline here
the modular and standard pipeline are meant to be parallel.

block_state.latents = randn_tensor(
shape, generator=block_state.generator, device=device, dtype=torch.float32
)
block_state.latents = LTXPipeline._pack_latents(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can redefine it as regular function here or maybe use #Copied from

see example using #Copied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/before_denoise.py#L495

if not isinstance(image, torch.Tensor):
from ...video_processor import VideoProcessor

processor = VideoProcessor(vae_scale_factor=components.vae_spatial_compression_ratio)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a components no?

else:
init_latents = [
retrieve_latents(
components.vae.encode(img.unsqueeze(0).unsqueeze(2).to(vae_dtype)), block_state.generator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should extract the vae encoding into its own block in encoders.py (e.g. LTXVaeEncoderStep), and here this step should accept image_latents as input instead of raw image. This way users can run the VAE encoder standalone and pass pre-computed latents directly. See https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/encoders.py#L470


from ...configuration_utils import FrozenDict
from ...models import AutoencoderKLLTXVideo
from ...pipelines.ltx.pipeline_ltx import LTXPipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here
let's either redefine or copy the pipeline methods you need


latents = block_state.latents

if block_state.output_type == "latent":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need accept latent output_type in modular
similar to encode_prompt, we can pop out the decoder step from the pipeline if we don't need it decodded

@yiyixuxu yiyixuxu mentioned this pull request Apr 3, 2026
@akshan-main
Copy link
Copy Markdown
Author

addressed everything @yiyixuxu

yiyixuxu pushed a commit that referenced this pull request Apr 4, 2026
Create .ai/modular.md as a shared reference for modular pipeline
conventions, patterns, and common mistakes — parallel to the existing
models.md for model conventions.

Consolidates content from the former modular-conversion.md skill file
and adds gotchas identified from reviewing recent modular pipeline PRs
(LTX #13378, SD3 #13324).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants