[MAX] Add Wan I2V diffusion pipeline#18
Conversation
## Summary Add the Wan image-to-video (I2V) diffusion pipeline, extending the T2V pipeline with image conditioning. ## Description - Extends `WanPipeline` (from modular#6302) with image conditioning support - Encodes the input image via VAE, zero-pads to full video length, and concatenates with noise latents (36-channel input: 16 noise + 4 mask + 16 condition) - Compiles a GPU graph for the I2V channel concatenation - Supports MoE dual-transformer with per-phase LoRA weight swapping - Input images can be provided as file paths or URLs (downloaded at runtime) - Architecture registration for `Wan-AI/Wan2.2-I2V-A14B-Diffusers`, `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` ## Dependencies Depends on modular#6302 (T2V pipeline) — inherits from `WanPipeline`. ## Checklist - [x] PR is small and focused - [x] I ran `./bazelw run format` to format my changes Assisted-by: Claude Code Assisted-by: Claude Code stack-info: PR: #18, branch: jglee-sqbits/stack/6
d96121b to
35ecf0d
Compare
451c1f7 to
cc6ab75
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces the WanI2VPipeline for image-to-video generation, extending the WanPipeline with image conditioning via VAE encoding and temporal masking. Feedback identifies a potential runtime error from a dtype mismatch in the condition buffer and recommends removing redundant pre-compilation logic in the execute method.
| [mask_expanded, latent_cond_np], axis=1 | ||
| ).astype(np.float32) | ||
|
|
||
| return _numpy_f32_to_buffer(condition, self.vae.config.dtype, device) |
There was a problem hiding this comment.
The i2v_condition buffer should be created using the transformer's working dtype (self.transformer.config.dtype) rather than the VAE's dtype. In AutoencoderKLWanModel, the VAE dtype is typically hardcoded to bfloat16. If the pipeline is configured to run in float32, using self.vae.config.dtype here will lead to a dtype mismatch during the concatenation step in _concat_i2v_condition, causing a runtime error when the graph attempts to concatenate tensors of different types.
| return _numpy_f32_to_buffer(condition, self.vae.config.dtype, device) | |
| return _numpy_f32_to_buffer(condition, self.transformer.config.dtype, device) |
| if self._i2v_concat_model is None: | ||
| latent_model_input = self._cast_f32_to_model_dtype.execute(latents)[ | ||
| 0 | ||
| ] | ||
| self._i2v_concat_model = self._compile_i2v_concat( | ||
| latent_model_input, i2v_condition | ||
| ) | ||
|
|
There was a problem hiding this comment.
This pre-compilation block for the I2V concatenation model is redundant. The _concat_i2v_condition method, which is called at every step within the denoising loop, already includes logic to lazily compile _i2v_concat_model upon its first use. Removing this block simplifies the execute method without affecting functionality or performance.
Stacked PRs:
[MAX] Add Wan I2V diffusion pipeline
Summary
Add the Wan image-to-video (I2V) diffusion pipeline, extending the T2V pipeline with image conditioning.
Description
WanPipeline(from [MAX] Add Wan T2V diffusion pipeline with MoE support modular/modular#6302) with image conditioning supportWan-AI/Wan2.2-I2V-A14B-Diffusers,Wan-AI/Wan2.1-I2V-14B-720P-DiffusersDependencies
Depends on modular#6302 (T2V pipeline) — inherits from
WanPipeline.Checklist
./bazelw run formatto format my changesAssisted-by: Claude Code
Assisted-by: Claude Code