@@ -855,17 +855,19 @@ def __call__(
855855 instead.
856856 image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
857857 `Image`, numpy array or tensor representing an image batch to be used as the starting point. For both
858- numpy array and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list
859- or tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
860- list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image
861- latents as `image`, but if passing latents directly it is not encoded again.
858+ numpy array and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor or a list
859+ of tensors, the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a
860+ list of arrays, the expected shape should be `(B, H, W, C)` or `(H, W, C)`. It can also accept image latents directly,
861+ in which case encoding is skipped. Latents must be in patchified form of shape `(B, latent_channels * 4, H // 2, W // 2)`, where
862+ each 2×2 spatial patch has been folded into the channel dimension.
862863 image_reference (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`, *optional*):
863864 `Image`, numpy array or tensor representing an image batch to be used as the reference for the masked
864865 area. This allows conditioning the inpainted region on a specific reference image. For both numpy array
865- and pytorch tensor, the expected value range is between `[0, 1]` If it's a tensor or a list or tensors,
866+ and pytorch tensor, the expected value range is between `[0, 1]`. If it's a tensor or a list of tensors,
866867 the expected shape should be `(B, C, H, W)` or `(C, H, W)`. If it is a numpy array or a list of arrays,
867- the expected shape should be `(B, H, W, C)` or `(H, W, C)` It can also accept image latents as
868- `image_reference`, but if passing latents directly it is not encoded again.
868+ the expected shape should be `(B, H, W, C)` or `(H, W, C)`. It can also accept image latents directly,
869+ in which case encoding is skipped. Latents must be in patchified form of shape `(B, latent_channels * 4, H // 2, W // 2)`, where
870+ each 2×2 spatial patch has been folded into the channel dimension.
869871 mask_image (`torch.Tensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
870872 `Image`, numpy array or tensor representing an image batch to mask `image`. White pixels in the mask
871873 are repainted while black pixels are preserved. If `mask_image` is a PIL image, it is converted to a
0 commit comments