Skip to content

lucy model/pipeline review #13632

@hlky

Description

@hlky

lucy model/pipeline review

Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423

Review performed against the repository review rules.

Duplicate search status: searched GitHub issues and PRs in huggingface/diffusers for lucy, LucyEditPipeline, pipeline_lucy_edit, LucyPipelineOutput, num_videos_per_prompt condition_latents, ftfy basic_clean prompt_clean, and Lucy tests. No duplicate issue/PR found for the findings below. Existing related PRs found: original implementation PR #12340 and typo PR #12705.

Issue 1: num_videos_per_prompt > 1 fails because condition latents are not expanded

Affected code:

latents, condition_latents = self.prepare_latents(
video,
batch_size * num_videos_per_prompt,
num_channels_latents,
height,
width,
torch.float32,
device,
generator,
latents,
)

# Prepare condition latents
condition_latents = [
retrieve_latents(self.vae.encode(vid.unsqueeze(0)), sample_mode="argmax") for vid in video
]
condition_latents = torch.cat(condition_latents, dim=0).to(dtype)
latents_mean = (
torch.tensor(self.vae.config.latents_mean).view(1, self.vae.config.z_dim, 1, 1, 1).to(device, dtype)
)
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
device, dtype
)
condition_latents = (condition_latents - latents_mean) * latents_std
# Check shapes
assert latents.shape == condition_latents.shape, (
f"Latents shape {latents.shape} does not match expected shape {condition_latents.shape}. Please check the input."
)
return latents, condition_latents

Problem:
__call__ passes batch_size * num_videos_per_prompt to prepare_latents, so random latents and prompt embeddings are expanded. The conditioning video latents are encoded only once per input video and are never repeated, then an assertion requires them to match the expanded latent batch.

Impact:
The public num_videos_per_prompt argument is broken for values greater than 1.

Reproduction:

from types import SimpleNamespace
import torch
from diffusers import LucyEditPipeline

class FakeVAE:
    config = SimpleNamespace(
        scale_factor_temporal=4,
        scale_factor_spatial=8,
        z_dim=16,
        latents_mean=[0.0] * 16,
        latents_std=[1.0] * 16,
    )
    def encode(self, x):
        b, c, f, h, w = x.shape
        latent_frames = (f - 1) // self.config.scale_factor_temporal + 1
        return SimpleNamespace(latents=torch.zeros(b, 16, latent_frames, h // 8, w // 8))

pipe = object.__new__(LucyEditPipeline)
pipe.vae = FakeVAE()
pipe.vae_scale_factor_temporal = 4
pipe.vae_scale_factor_spatial = 8

video = torch.zeros(1, 3, 17, 16, 16)
LucyEditPipeline.prepare_latents(
    pipe,
    video=video,
    batch_size=2,  # one prompt, num_videos_per_prompt=2
    num_channels_latents=16,
    height=16,
    width=16,
    dtype=torch.float32,
    device=torch.device("cpu"),
    generator=torch.Generator().manual_seed(0),
)

Relevant precedent:

if isinstance(generator, list):
latent_condition = [
retrieve_latents(self.vae.encode(video_condition), sample_mode="argmax") for _ in generator
]
latent_condition = torch.cat(latent_condition)
else:
latent_condition = retrieve_latents(self.vae.encode(video_condition), sample_mode="argmax")
latent_condition = latent_condition.repeat(batch_size, 1, 1, 1, 1)
latent_condition = latent_condition.to(dtype)
latent_condition = (latent_condition - latents_mean) * latents_std

Suggested fix:

condition_latents = torch.cat(condition_latents, dim=0).to(device=device, dtype=dtype)

if batch_size > condition_latents.shape[0]:
    if batch_size % condition_latents.shape[0] != 0:
        raise ValueError(
            f"Cannot duplicate `video` batch size {condition_latents.shape[0]} to latent batch size {batch_size}."
        )
    condition_latents = condition_latents.repeat_interleave(batch_size // condition_latents.shape[0], dim=0)

Issue 2: Prompt cleaning crashes when optional ftfy is unavailable

Affected code:

def basic_clean(text):
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()
def whitespace_clean(text):
text = re.sub(r"\s+", " ", text)
text = text.strip()
return text
def prompt_clean(text):
text = whitespace_clean(basic_clean(text))
return text

Problem:
ftfy is optional, but basic_clean() calls ftfy.fix_text() unconditionally. If ftfy is not installed, Lucy prompt encoding raises NameError.

Impact:
A standard install without the optional text-cleaning dependency can import the pipeline but fails at runtime on normal prompt inputs.

Reproduction:

from diffusers.pipelines.lucy import pipeline_lucy_edit as lucy

# Simulate an environment where optional dependency ftfy is not installed.
if hasattr(lucy, "ftfy"):
    delattr(lucy, "ftfy")

lucy.prompt_clean("hello")

Relevant precedent:

def basic_clean(text):
if is_ftfy_available():
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()

Suggested fix:

def basic_clean(text):
    if is_ftfy_available():
        text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()

Issue 3: num_frames is accepted but ignored for latent preparation

Affected code:

if num_frames % self.vae_scale_factor_temporal != 1:
logger.warning(
f"`num_frames - 1` has to be divisible by {self.vae_scale_factor_temporal}. Rounding to the nearest number."
)
num_frames = num_frames // self.vae_scale_factor_temporal * self.vae_scale_factor_temporal + 1
num_frames = max(num_frames, 1)

video = self.video_processor.preprocess_video(video, height=height, width=width).to(
device, dtype=torch.float32
)
latents, condition_latents = self.prepare_latents(
video,
batch_size * num_videos_per_prompt,
num_channels_latents,
height,
width,
torch.float32,
device,
generator,
latents,
)

num_latent_frames = (
(video.size(2) - 1) // self.vae_scale_factor_temporal + 1 if latents is None else latents.size(1)
)

Problem:
__call__ validates and rounds num_frames, but prepare_latents() derives the latent frame count from video.size(2). The requested num_frames does not control generation length and is not validated against the conditioning video length.

Impact:
Users can pass num_frames expecting it to control the output, but the output length follows the input video instead. This is especially confusing because the docstring says num_frames is “The number of frames in the generated video.”

Reproduction:

from types import SimpleNamespace
import torch
from diffusers import LucyEditPipeline

class FakeVAE:
    config = SimpleNamespace(
        scale_factor_temporal=4,
        scale_factor_spatial=8,
        z_dim=16,
        latents_mean=[0.0] * 16,
        latents_std=[1.0] * 16,
    )
    def encode(self, x):
        b, c, f, h, w = x.shape
        latent_frames = (f - 1) // self.config.scale_factor_temporal + 1
        return SimpleNamespace(latents=torch.zeros(b, 16, latent_frames, h // 8, w // 8))

pipe = object.__new__(LucyEditPipeline)
pipe.vae = FakeVAE()
pipe.vae_scale_factor_temporal = 4
pipe.vae_scale_factor_spatial = 8

for conditioning_frames in (9, 17):
    video = torch.zeros(1, 3, conditioning_frames, 16, 16)
    latents, _ = LucyEditPipeline.prepare_latents(
        pipe,
        video=video,
        batch_size=1,
        num_channels_latents=16,
        height=16,
        width=16,
        dtype=torch.float32,
        device=torch.device("cpu"),
        generator=torch.Generator().manual_seed(0),
    )
    print(conditioning_frames, latents.shape[2])

Relevant precedent:

def prepare_latents(
self,
image: PipelineImageInput,
batch_size: int,
num_channels_latents: int = 16,
height: int = 480,
width: int = 832,
num_frames: int = 81,
dtype: torch.dtype | None = None,
device: torch.device | None = None,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
last_image: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
num_latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
latent_height = height // self.vae_scale_factor_spatial
latent_width = width // self.vae_scale_factor_spatial
shape = (batch_size, num_channels_latents, num_latent_frames, latent_height, latent_width)

Suggested fix:
Either remove/reword num_frames for Lucy and validate that the conditioning video length is the generation length, or pass num_frames into prepare_latents() and crop/validate the conditioning video before encoding.

Issue 4: No Lucy fast or slow tests exist

Affected code:

class LucyEditPipeline(DiffusionPipeline, WanLoraLoaderMixin):
r"""
Pipeline for video-to-video generation using Lucy Edit.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args:
tokenizer ([`T5Tokenizer`]):
Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer),
specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
text_encoder ([`T5EncoderModel`]):
[T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically
the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
transformer ([`WanTransformer3DModel`]):
Conditional Transformer to denoise the input latents.
scheduler ([`UniPCMultistepScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKLWan`]):
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
transformer_2 ([`WanTransformer3DModel`], *optional*):
Conditional Transformer to denoise the input latents during the low-noise stage. If provided, enables
two-stage denoising where `transformer` handles high-noise stages and `transformer_2` handles low-noise
stages. If not provided, only `transformer` is used.
boundary_ratio (`float`, *optional*, defaults to `None`):
Ratio of total timesteps to use as the boundary for switching between transformers in two-stage denoising.
The actual boundary timestep is calculated as `boundary_ratio * num_train_timesteps`. When provided,
`transformer` handles timesteps >= boundary_timestep and `transformer_2` handles timesteps <
boundary_timestep. If `None`, only `transformer` is used for the entire denoising process.
"""
model_cpu_offload_seq = "text_encoder->transformer->transformer_2->vae"
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
_optional_components = ["transformer", "transformer_2"]

Problem:
There is no tests/pipelines/lucy/ coverage and no test file matching *lucy*. This leaves imports, save/load behavior, callbacks, dtype handling, batching, num_videos_per_prompt, and a slow smoke test untested.

Impact:
The two runtime bugs above are not covered by CI, and regressions in the newly added pipeline family can ship unnoticed. Slow tests are also missing.

Reproduction:

from pathlib import Path

lucy_tests = sorted(Path("tests").rglob("*lucy*"))
print(lucy_tests)
assert lucy_tests, "No Lucy fast or slow tests found"

Relevant precedent:

class WanVideoToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase):
pipeline_class = WanVideoToVideoPipeline
params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"}
batch_params = frozenset(["video", "prompt", "negative_prompt"])
image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS
required_optional_params = frozenset(
[
"num_inference_steps",
"generator",
"latents",
"return_dict",
"callback_on_step_end",
"callback_on_step_end_tensor_inputs",
]
)
test_xformers_attention = False

@slow
@require_torch_accelerator
class WanPipelineIntegrationTests(unittest.TestCase):
prompt = "A painting of a squirrel eating a burger."
def setUp(self):
super().setUp()
gc.collect()
backend_empty_cache(torch_device)
def tearDown(self):
super().tearDown()
gc.collect()
backend_empty_cache(torch_device)
@unittest.skip("TODO: test needs to be implemented")
def test_Wanx(self):

Suggested fix:
Add tests/pipelines/lucy/test_lucy_edit.py with tiny Wan components using WanTransformer3DModel(in_channels=32, out_channels=16), a fast inference test, save/load coverage, callback coverage, num_videos_per_prompt=2, and a slow test for decart-ai/Lucy-Edit-Dev.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions