Updated the docstring with the shape requirements

adi776borate · adi776borate · commit 4429c94e06bf · 2026-04-14T13:04:01.000Z
diff --git a/src/diffusers/pipelines/flux2/pipeline_flux2_klein_inpaint.py b/src/diffusers/pipelines/flux2/pipeline_flux2_klein_inpaint.py
@@ -855,17 +855,19 @@ def __call__(
                 instead.
             image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
                 `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
-                numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
-                or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
-                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
-                latents as `image`, but if passing latents directly it is not encoded again.
+                numpy array and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor or a list
+                of tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
+                list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)`. It can also accept image latents directly,
+                in which case encoding is skipped. Latents must be in patchified form of shape `(B, latent_channels * 4, H // 2, W // 2)`, where
+                each 2×2 spatial patch has been folded into the channel dimension.
             image_reference (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`, *optional*):
                 `Image`, numpy array or tensor representing an image batch to be used as the reference for the masked
                 area. This allows conditioning the inpainted region on a specific reference image. For both numpy array
-                and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list or tensors,
+                and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor or a list of tensors,
                 the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a list of arrays,
-                the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image latents as
-                `image_reference`, but if passing latents directly it is not encoded again.
+                the expected shape should be `(B, H, W, C)` or `(H, W, C)`. It can also accept image latents directly,
+                in which case encoding is skipped. Latents must be in patchified form of shape `(B, latent_channels * 4, H // 2, W // 2)`, where
+                each 2×2 spatial patch has been folded into the channel dimension.
             mask_image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
                 `Image`, numpy array or tensor representing an image batch to mask `image`. White pixels in the mask
                 are repainted while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a