Skip to content

latent_consistency_models model/pipeline review #13637

@hlky

Description

@hlky

latent_consistency_models model/pipeline review

Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423

Review performed against the repository review rules.

Issue 1: Img2Img does not serialize requires_safety_checker

Affected code:

self.register_modules(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
unet=unet,
scheduler=scheduler,
safety_checker=safety_checker,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
)
if safety_checker is None and requires_safety_checker:
logger.warning(
f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure"
" that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered"
" results in services or applications open to the public. Both the diffusers team and Hugging Face"
" strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling"
" it only for use-cases that involve analyzing network behavior or auditing its results. For more"
" information, please have a look at https://github.com/huggingface/diffusers/pull/254 ."
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)

Problem:
LatentConsistencyModelImg2ImgPipeline.__init__ accepts requires_safety_checker, but never calls self.register_to_config(...). The text2img LCM pipeline does register it, so img2img saved configs lose this user choice.

Impact:
Pipelines constructed with requires_safety_checker=False do not persist that setting through config/save/load paths, causing inconsistent serialization behavior between the two LCM pipelines.

Reproduction:

from diffusers import LatentConsistencyModelPipeline, LatentConsistencyModelImg2ImgPipeline

kwargs = dict(
    vae=None,
    text_encoder=None,
    tokenizer=None,
    unet=None,
    scheduler=None,
    safety_checker=None,
    feature_extractor=None,
    requires_safety_checker=False,
)

for cls in (LatentConsistencyModelPipeline, LatentConsistencyModelImg2ImgPipeline):
    pipe = cls(**kwargs)
    print(cls.__name__, pipe.config.get("requires_safety_checker"))
# Text2Img prints False; Img2Img prints None.

Relevant precedent:

self.register_modules(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
unet=unet,
scheduler=scheduler,
safety_checker=safety_checker,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
self.register_to_config(requires_safety_checker=requires_safety_checker)

unet=unet,
scheduler=scheduler,
safety_checker=safety_checker,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
self.register_to_config(requires_safety_checker=requires_safety_checker)

Suggested fix:

self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
self.register_to_config(requires_safety_checker=requires_safety_checker)

Duplicate search:
No matching issue or PR found for LatentConsistencyModelImg2ImgPipeline requires_safety_checker.

Issue 2: Img2Img accepts a safety checker without a feature extractor

Affected code:

self.register_modules(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
unet=unet,
scheduler=scheduler,
safety_checker=safety_checker,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
)
if safety_checker is None and requires_safety_checker:
logger.warning(
f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure"
" that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered"
" results in services or applications open to the public. Both the diffusers team and Hugging Face"
" strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling"
" it only for use-cases that involve analyzing network behavior or auditing its results. For more"
" information, please have a look at https://github.com/huggingface/diffusers/pull/254 ."
)
self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)

# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.run_safety_checker
def run_safety_checker(self, image, device, dtype):
if self.safety_checker is None:
has_nsfw_concept = None
else:
if torch.is_tensor(image):
feature_extractor_input = self.image_processor.postprocess(image, output_type="pil")
else:
feature_extractor_input = self.image_processor.numpy_to_pil(image)
safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="pt").to(device)
image, has_nsfw_concept = self.safety_checker(
images=image, clip_input=safety_checker_input.pixel_values.to(dtype)
)

Problem:
The img2img constructor does not reject safety_checker != None with feature_extractor=None. Later, run_safety_checker unconditionally calls self.feature_extractor(...), producing a late TypeError.

Impact:
Users can construct an invalid pipeline successfully and only fail during inference/safety checking with an unclear 'NoneType' object is not callable error.

Reproduction:

import torch
from diffusers import LatentConsistencyModelImg2ImgPipeline

class DummySafetyChecker:
    def __call__(self, images, clip_input):
        return images, [False] * images.shape[0]

pipe = LatentConsistencyModelImg2ImgPipeline(
    vae=None,
    text_encoder=None,
    tokenizer=None,
    unet=None,
    scheduler=None,
    safety_checker=DummySafetyChecker(),
    feature_extractor=None,
    requires_safety_checker=True,
)

try:
    pipe.run_safety_checker(torch.zeros(1, 3, 8, 8), "cpu", torch.float32)
except Exception as e:
    print(type(e).__name__, str(e))
# TypeError 'NoneType' object is not callable

Relevant precedent:

if safety_checker is not None and feature_extractor is None:
raise ValueError(
"Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety"
" checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead."
)

if safety_checker is not None and feature_extractor is None:
raise ValueError(
"Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety"
" checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead."
)

Suggested fix:

if safety_checker is not None and feature_extractor is None:
    raise ValueError(
        "Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety"
        " checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead."
    )

Duplicate search:
No matching issue or PR found for LatentConsistencyModelImg2ImgPipeline feature_extractor.

Issue 3: LCM pipeline __call__ docs have stale parameters and defaults

Affected code:

def __call__(
self,
prompt: str | list[str] = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 4,
original_inference_steps: int = None,
timesteps: list[int] = None,
guidance_scale: float = 8.5,
num_images_per_prompt: int | None = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
):
r"""
The call function to the pipeline for generation.
Args:
prompt (`str` or `list[str]`, *optional*):
The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
original_inference_steps (`int`, *optional*):
The original number of inference steps use to generate a linearly-spaced timestep schedule, from which
we will draw `num_inference_steps` evenly spaced timesteps from as our final timestep schedule,
following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the
scheduler's `original_inference_steps` attribute.
timesteps (`list[int]`, *optional*):
Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
timesteps on the original LCM training/distillation timestep schedule are used. Must be in descending
order.
guidance_scale (`float`, *optional*, defaults to 7.5):
A higher guidance scale value encourages the model to generate images closely linked to the text
`prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Note that the original latent consistency models paper uses a different CFG formulation where the

def __call__(
self,
prompt: str | list[str] = None,
image: PipelineImageInput = None,
num_inference_steps: int = 4,
strength: float = 0.8,
original_inference_steps: int = None,
timesteps: list[int] = None,
guidance_scale: float = 8.5,
num_images_per_prompt: int | None = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
):
r"""
The call function to the pipeline for generation.
Args:
prompt (`str` or `list[str]`, *optional*):
The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The height in pixels of the generated image.
width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
The width in pixels of the generated image.
num_inference_steps (`int`, *optional*, defaults to 50):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference.
original_inference_steps (`int`, *optional*):
The original number of inference steps use to generate a linearly-spaced timestep schedule, from which
we will draw `num_inference_steps` evenly spaced timesteps from as our final timestep schedule,
following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the
scheduler's `original_inference_steps` attribute.
timesteps (`list[int]`, *optional*):
Custom timesteps to use for the denoising process. If not defined, equal spaced `num_inference_steps`
timesteps on the original LCM training/distillation timestep schedule are used. Must be in descending
order.
guidance_scale (`float`, *optional*, defaults to 7.5):
A higher guidance scale value encourages the model to generate images closely linked to the text
`prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
Note that the original latent consistency models paper uses a different CFG formulation where the
guidance scales are decreased by 1 (so in the paper formulation CFG is enabled when `guidance_scale >

Problem:
Both __call__ docstrings say num_inference_steps defaults to 50 and guidance_scale defaults to 7.5, while the signatures default to 4 and 8.5. The img2img docstring also documents nonexistent height and width parameters and omits its real image and strength parameters.

Impact:
The generated API docs mislead users about LCM's fast-step defaults and img2img inputs.

Reproduction:

import inspect
from diffusers import LatentConsistencyModelPipeline, LatentConsistencyModelImg2ImgPipeline

for cls in (LatentConsistencyModelPipeline, LatentConsistencyModelImg2ImgPipeline):
    sig = inspect.signature(cls.__call__)
    doc = inspect.getdoc(cls.__call__) or ""
    print(cls.__name__, sig.parameters["num_inference_steps"].default, sig.parameters["guidance_scale"].default)
    print("doc says steps default 50:", "defaults to 50" in doc)
    print("doc says guidance default 7.5:", "defaults to 7.5" in doc)

img_doc = inspect.getdoc(LatentConsistencyModelImg2ImgPipeline.__call__) or ""
print("img2img signature has image:", "image" in inspect.signature(LatentConsistencyModelImg2ImgPipeline.__call__).parameters)
print("img2img docs image:", "image (`" in img_doc)
print("img2img docs height:", "height (`int`" in img_doc)

Relevant precedent:

image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `list[torch.Tensor]`, `list[PIL.Image.Image]`, or `list[np.ndarray]`):
`Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
latents as `image`, but if passing latents directly it is not encoded again.
strength (`float`, *optional*, defaults to 0.8):
Indicates extent to transform the reference `image`. Must be between 0 and 1. `image` is used as a
starting point and more noise is added the higher the `strength`. The number of denoising steps depends
on the amount of noise initially added. When `strength` is 1, added noise is maximum and the denoising
process runs for the full number of iterations specified in `num_inference_steps`. A value of 1
essentially ignores `image`.

Suggested fix:
Update the LCM docstrings to match their signatures: num_inference_steps default 4, guidance_scale default 8.5, and for img2img replace the stale height/width entries with image and strength documentation.

Duplicate search:
No matching issue or PR found for LCM img2img docstring/default mismatches.

Coverage and duplicate-search status:
Fast and slow tests exist for both LCM pipelines under tests/pipelines/latent_consistency_models/; slow coverage is not missing. Local target pytest collection was blocked by the venv torch build missing torch._C._distributed_c10d via shared test utilities. python utils/check_copies.py passed. Broad duplicate searches found historical LCM issues/PRs, but no duplicates for the three findings above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions