lucy model/pipeline review
Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423
Review performed against the repository review rules.
Duplicate search status: searched GitHub issues and PRs in huggingface/diffusers for lucy, LucyEditPipeline, pipeline_lucy_edit, LucyPipelineOutput, num_videos_per_prompt condition_latents, ftfy basic_clean prompt_clean, and Lucy tests. No duplicate issue/PR found for the findings below. Existing related PRs found: original implementation PR #12340 and typo PR #12705.
Issue 1: num_videos_per_prompt > 1 fails because condition latents are not expanded
Affected code:
|
latents, condition_latents = self.prepare_latents( |
|
video, |
|
batch_size * num_videos_per_prompt, |
|
num_channels_latents, |
|
height, |
|
width, |
|
torch.float32, |
|
device, |
|
generator, |
|
latents, |
|
) |
|
# Prepare condition latents |
|
condition_latents = [ |
|
retrieve_latents(self.vae.encode(vid.unsqueeze(0)), sample_mode="argmax") for vid in video |
|
] |
|
|
|
condition_latents = torch.cat(condition_latents, dim=0).to(dtype) |
|
|
|
latents_mean = ( |
|
torch.tensor(self.vae.config.latents_mean).view(1, self.vae.config.z_dim, 1, 1, 1).to(device, dtype) |
|
) |
|
latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to( |
|
device, dtype |
|
) |
|
|
|
condition_latents = (condition_latents - latents_mean) * latents_std |
|
|
|
# Check shapes |
|
assert latents.shape == condition_latents.shape, ( |
|
f"Latents shape {latents.shape} does not match expected shape {condition_latents.shape}. Please check the input." |
|
) |
|
|
|
return latents, condition_latents |
Problem:
__call__ passes batch_size * num_videos_per_prompt to prepare_latents, so random latents and prompt embeddings are expanded. The conditioning video latents are encoded only once per input video and are never repeated, then an assertion requires them to match the expanded latent batch.
Impact:
The public num_videos_per_prompt argument is broken for values greater than 1.
Reproduction:
from types import SimpleNamespace
import torch
from diffusers import LucyEditPipeline
class FakeVAE:
config = SimpleNamespace(
scale_factor_temporal=4,
scale_factor_spatial=8,
z_dim=16,
latents_mean=[0.0] * 16,
latents_std=[1.0] * 16,
)
def encode(self, x):
b, c, f, h, w = x.shape
latent_frames = (f - 1) // self.config.scale_factor_temporal + 1
return SimpleNamespace(latents=torch.zeros(b, 16, latent_frames, h // 8, w // 8))
pipe = object.__new__(LucyEditPipeline)
pipe.vae = FakeVAE()
pipe.vae_scale_factor_temporal = 4
pipe.vae_scale_factor_spatial = 8
video = torch.zeros(1, 3, 17, 16, 16)
LucyEditPipeline.prepare_latents(
pipe,
video=video,
batch_size=2, # one prompt, num_videos_per_prompt=2
num_channels_latents=16,
height=16,
width=16,
dtype=torch.float32,
device=torch.device("cpu"),
generator=torch.Generator().manual_seed(0),
)
Relevant precedent:
|
if isinstance(generator, list): |
|
latent_condition = [ |
|
retrieve_latents(self.vae.encode(video_condition), sample_mode="argmax") for _ in generator |
|
] |
|
latent_condition = torch.cat(latent_condition) |
|
else: |
|
latent_condition = retrieve_latents(self.vae.encode(video_condition), sample_mode="argmax") |
|
latent_condition = latent_condition.repeat(batch_size, 1, 1, 1, 1) |
|
|
|
latent_condition = latent_condition.to(dtype) |
|
latent_condition = (latent_condition - latents_mean) * latents_std |
Suggested fix:
condition_latents = torch.cat(condition_latents, dim=0).to(device=device, dtype=dtype)
if batch_size > condition_latents.shape[0]:
if batch_size % condition_latents.shape[0] != 0:
raise ValueError(
f"Cannot duplicate `video` batch size {condition_latents.shape[0]} to latent batch size {batch_size}."
)
condition_latents = condition_latents.repeat_interleave(batch_size // condition_latents.shape[0], dim=0)
Issue 2: Prompt cleaning crashes when optional ftfy is unavailable
Affected code:
|
def basic_clean(text): |
|
text = ftfy.fix_text(text) |
|
text = html.unescape(html.unescape(text)) |
|
return text.strip() |
|
|
|
|
|
def whitespace_clean(text): |
|
text = re.sub(r"\s+", " ", text) |
|
text = text.strip() |
|
return text |
|
|
|
|
|
def prompt_clean(text): |
|
text = whitespace_clean(basic_clean(text)) |
|
return text |
Problem:
ftfy is optional, but basic_clean() calls ftfy.fix_text() unconditionally. If ftfy is not installed, Lucy prompt encoding raises NameError.
Impact:
A standard install without the optional text-cleaning dependency can import the pipeline but fails at runtime on normal prompt inputs.
Reproduction:
from diffusers.pipelines.lucy import pipeline_lucy_edit as lucy
# Simulate an environment where optional dependency ftfy is not installed.
if hasattr(lucy, "ftfy"):
delattr(lucy, "ftfy")
lucy.prompt_clean("hello")
Relevant precedent:
|
def basic_clean(text): |
|
if is_ftfy_available(): |
|
text = ftfy.fix_text(text) |
|
text = html.unescape(html.unescape(text)) |
|
return text.strip() |
Suggested fix:
def basic_clean(text):
if is_ftfy_available():
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text))
return text.strip()
Issue 3: num_frames is accepted but ignored for latent preparation
Affected code:
|
if num_frames % self.vae_scale_factor_temporal != 1: |
|
logger.warning( |
|
f"`num_frames - 1` has to be divisible by {self.vae_scale_factor_temporal}. Rounding to the nearest number." |
|
) |
|
num_frames = num_frames // self.vae_scale_factor_temporal * self.vae_scale_factor_temporal + 1 |
|
num_frames = max(num_frames, 1) |
|
video = self.video_processor.preprocess_video(video, height=height, width=width).to( |
|
device, dtype=torch.float32 |
|
) |
|
latents, condition_latents = self.prepare_latents( |
|
video, |
|
batch_size * num_videos_per_prompt, |
|
num_channels_latents, |
|
height, |
|
width, |
|
torch.float32, |
|
device, |
|
generator, |
|
latents, |
|
) |
|
num_latent_frames = ( |
|
(video.size(2) - 1) // self.vae_scale_factor_temporal + 1 if latents is None else latents.size(1) |
|
) |
Problem:
__call__ validates and rounds num_frames, but prepare_latents() derives the latent frame count from video.size(2). The requested num_frames does not control generation length and is not validated against the conditioning video length.
Impact:
Users can pass num_frames expecting it to control the output, but the output length follows the input video instead. This is especially confusing because the docstring says num_frames is “The number of frames in the generated video.”
Reproduction:
from types import SimpleNamespace
import torch
from diffusers import LucyEditPipeline
class FakeVAE:
config = SimpleNamespace(
scale_factor_temporal=4,
scale_factor_spatial=8,
z_dim=16,
latents_mean=[0.0] * 16,
latents_std=[1.0] * 16,
)
def encode(self, x):
b, c, f, h, w = x.shape
latent_frames = (f - 1) // self.config.scale_factor_temporal + 1
return SimpleNamespace(latents=torch.zeros(b, 16, latent_frames, h // 8, w // 8))
pipe = object.__new__(LucyEditPipeline)
pipe.vae = FakeVAE()
pipe.vae_scale_factor_temporal = 4
pipe.vae_scale_factor_spatial = 8
for conditioning_frames in (9, 17):
video = torch.zeros(1, 3, conditioning_frames, 16, 16)
latents, _ = LucyEditPipeline.prepare_latents(
pipe,
video=video,
batch_size=1,
num_channels_latents=16,
height=16,
width=16,
dtype=torch.float32,
device=torch.device("cpu"),
generator=torch.Generator().manual_seed(0),
)
print(conditioning_frames, latents.shape[2])
Relevant precedent:
|
def prepare_latents( |
|
self, |
|
image: PipelineImageInput, |
|
batch_size: int, |
|
num_channels_latents: int = 16, |
|
height: int = 480, |
|
width: int = 832, |
|
num_frames: int = 81, |
|
dtype: torch.dtype | None = None, |
|
device: torch.device | None = None, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
last_image: torch.Tensor | None = None, |
|
) -> tuple[torch.Tensor, torch.Tensor]: |
|
num_latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1 |
|
latent_height = height // self.vae_scale_factor_spatial |
|
latent_width = width // self.vae_scale_factor_spatial |
|
|
|
shape = (batch_size, num_channels_latents, num_latent_frames, latent_height, latent_width) |
Suggested fix:
Either remove/reword num_frames for Lucy and validate that the conditioning video length is the generation length, or pass num_frames into prepare_latents() and crop/validate the conditioning video before encoding.
Issue 4: No Lucy fast or slow tests exist
Affected code:
|
class LucyEditPipeline(DiffusionPipeline, WanLoraLoaderMixin): |
|
r""" |
|
Pipeline for video-to-video generation using Lucy Edit. |
|
|
|
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods |
|
implemented for all pipelines (downloading, saving, running on a particular device, etc.). |
|
|
|
Args: |
|
tokenizer ([`T5Tokenizer`]): |
|
Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer), |
|
specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant. |
|
text_encoder ([`T5EncoderModel`]): |
|
[T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically |
|
the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant. |
|
transformer ([`WanTransformer3DModel`]): |
|
Conditional Transformer to denoise the input latents. |
|
scheduler ([`UniPCMultistepScheduler`]): |
|
A scheduler to be used in combination with `transformer` to denoise the encoded image latents. |
|
vae ([`AutoencoderKLWan`]): |
|
Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations. |
|
transformer_2 ([`WanTransformer3DModel`], *optional*): |
|
Conditional Transformer to denoise the input latents during the low-noise stage. If provided, enables |
|
two-stage denoising where `transformer` handles high-noise stages and `transformer_2` handles low-noise |
|
stages. If not provided, only `transformer` is used. |
|
boundary_ratio (`float`, *optional*, defaults to `None`): |
|
Ratio of total timesteps to use as the boundary for switching between transformers in two-stage denoising. |
|
The actual boundary timestep is calculated as `boundary_ratio * num_train_timesteps`. When provided, |
|
`transformer` handles timesteps >= boundary_timestep and `transformer_2` handles timesteps < |
|
boundary_timestep. If `None`, only `transformer` is used for the entire denoising process. |
|
""" |
|
|
|
model_cpu_offload_seq = "text_encoder->transformer->transformer_2->vae" |
|
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"] |
|
_optional_components = ["transformer", "transformer_2"] |
|
|
Problem:
There is no tests/pipelines/lucy/ coverage and no test file matching *lucy*. This leaves imports, save/load behavior, callbacks, dtype handling, batching, num_videos_per_prompt, and a slow smoke test untested.
Impact:
The two runtime bugs above are not covered by CI, and regressions in the newly added pipeline family can ship unnoticed. Slow tests are also missing.
Reproduction:
from pathlib import Path
lucy_tests = sorted(Path("tests").rglob("*lucy*"))
print(lucy_tests)
assert lucy_tests, "No Lucy fast or slow tests found"
Relevant precedent:
|
class WanVideoToVideoPipelineFastTests(PipelineTesterMixin, unittest.TestCase): |
|
pipeline_class = WanVideoToVideoPipeline |
|
params = TEXT_TO_IMAGE_PARAMS - {"cross_attention_kwargs"} |
|
batch_params = frozenset(["video", "prompt", "negative_prompt"]) |
|
image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS |
|
required_optional_params = frozenset( |
|
[ |
|
"num_inference_steps", |
|
"generator", |
|
"latents", |
|
"return_dict", |
|
"callback_on_step_end", |
|
"callback_on_step_end_tensor_inputs", |
|
] |
|
) |
|
test_xformers_attention = False |
|
@slow |
|
@require_torch_accelerator |
|
class WanPipelineIntegrationTests(unittest.TestCase): |
|
prompt = "A painting of a squirrel eating a burger." |
|
|
|
def setUp(self): |
|
super().setUp() |
|
gc.collect() |
|
backend_empty_cache(torch_device) |
|
|
|
def tearDown(self): |
|
super().tearDown() |
|
gc.collect() |
|
backend_empty_cache(torch_device) |
|
|
|
@unittest.skip("TODO: test needs to be implemented") |
|
def test_Wanx(self): |
Suggested fix:
Add tests/pipelines/lucy/test_lucy_edit.py with tiny Wan components using WanTransformer3DModel(in_channels=32, out_channels=16), a fast inference test, save/load coverage, callback coverage, num_videos_per_prompt=2, and a slow test for decart-ai/Lucy-Edit-Dev.
lucymodel/pipeline reviewCommit tested:
0f1abc4ae8b0eb2a3b40e82a310507281144c423Review performed against the repository review rules.
Duplicate search status: searched GitHub issues and PRs in
huggingface/diffusersforlucy,LucyEditPipeline,pipeline_lucy_edit,LucyPipelineOutput,num_videos_per_prompt condition_latents,ftfy basic_clean prompt_clean, andLucy tests. No duplicate issue/PR found for the findings below. Existing related PRs found: original implementation PR #12340 and typo PR #12705.Issue 1:
num_videos_per_prompt > 1fails because condition latents are not expandedAffected code:
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 619 to 629 in 0f1abc4
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 403 to 424 in 0f1abc4
Problem:
__call__passesbatch_size * num_videos_per_prompttoprepare_latents, so random latents and prompt embeddings are expanded. The conditioning video latents are encoded only once per input video and are never repeated, then an assertion requires them to match the expanded latent batch.Impact:
The public
num_videos_per_promptargument is broken for values greater than 1.Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Lines 449 to 459 in 0f1abc4
Suggested fix:
Issue 2: Prompt cleaning crashes when optional
ftfyis unavailableAffected code:
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 103 to 117 in 0f1abc4
Problem:
ftfyis optional, butbasic_clean()callsftfy.fix_text()unconditionally. Ifftfyis not installed, Lucy prompt encoding raisesNameError.Impact:
A standard install without the optional text-cleaning dependency can import the pipeline but fails at runtime on normal prompt inputs.
Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/wan/pipeline_wan.py
Lines 78 to 82 in 0f1abc4
Suggested fix:
Issue 3:
num_framesis accepted but ignored for latent preparationAffected code:
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 563 to 568 in 0f1abc4
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 616 to 629 in 0f1abc4
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 387 to 389 in 0f1abc4
Problem:
__call__validates and roundsnum_frames, butprepare_latents()derives the latent frame count fromvideo.size(2). The requestednum_framesdoes not control generation length and is not validated against the conditioning video length.Impact:
Users can pass
num_framesexpecting it to control the output, but the output length follows the input video instead. This is especially confusing because the docstring saysnum_framesis “The number of frames in the generated video.”Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Lines 393 to 411 in 0f1abc4
Suggested fix:
Either remove/reword
num_framesfor Lucy and validate that the conditioning video length is the generation length, or passnum_framesintoprepare_latents()and crop/validate the conditioning video before encoding.Issue 4: No Lucy fast or slow tests exist
Affected code:
diffusers/src/diffusers/pipelines/lucy/pipeline_lucy_edit.py
Lines 134 to 168 in 0f1abc4
Problem:
There is no
tests/pipelines/lucy/coverage and no test file matching*lucy*. This leaves imports, save/load behavior, callbacks, dtype handling, batching,num_videos_per_prompt, and a slow smoke test untested.Impact:
The two runtime bugs above are not covered by CI, and regressions in the newly added pipeline family can ship unnoticed. Slow tests are also missing.
Reproduction:
Relevant precedent:
diffusers/tests/pipelines/wan/test_wan_video_to_video.py
Lines 35 to 50 in 0f1abc4
diffusers/tests/pipelines/wan/test_wan.py
Lines 185 to 201 in 0f1abc4
Suggested fix:
Add
tests/pipelines/lucy/test_lucy_edit.pywith tiny Wan components usingWanTransformer3DModel(in_channels=32, out_channels=16), a fast inference test, save/load coverage, callback coverage,num_videos_per_prompt=2, and a slow test fordecart-ai/Lucy-Edit-Dev.