Skip to content

[MAX] Add Wan I2V diffusion pipeline#18

Draft
jglee-sqbits wants to merge 1 commit into
jglee-sqbits/stack/5from
jglee-sqbits/stack/6
Draft

[MAX] Add Wan I2V diffusion pipeline#18
jglee-sqbits wants to merge 1 commit into
jglee-sqbits/stack/5from
jglee-sqbits/stack/6

Conversation

@jglee-sqbits
Copy link
Copy Markdown
Collaborator

@jglee-sqbits jglee-sqbits commented Apr 1, 2026

Stacked PRs:


[MAX] Add Wan I2V diffusion pipeline

Summary

Add the Wan image-to-video (I2V) diffusion pipeline, extending the T2V pipeline with image conditioning.

Description

  • Extends WanPipeline (from [MAX] Add Wan T2V diffusion pipeline with MoE support modular/modular#6302) with image conditioning support
  • Encodes the input image via VAE, zero-pads to full video length, and concatenates with noise latents (36-channel input: 16 noise + 4 mask + 16 condition)
  • Compiles a GPU graph for the I2V channel concatenation
  • Supports MoE dual-transformer with per-phase LoRA weight swapping
  • Input images can be provided as file paths or URLs (downloaded at runtime)
  • Architecture registration for Wan-AI/Wan2.2-I2V-A14B-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Dependencies

Depends on modular#6302 (T2V pipeline) — inherits from WanPipeline.

Checklist

  • PR is small and focused
  • I ran ./bazelw run format to format my changes

Assisted-by: Claude Code

Assisted-by: Claude Code

## Summary

Add the Wan image-to-video (I2V) diffusion pipeline, extending the T2V pipeline with image conditioning.

## Description

- Extends `WanPipeline` (from modular#6302) with image conditioning support
- Encodes the input image via VAE, zero-pads to full video length, and concatenates with noise latents (36-channel input: 16 noise + 4 mask + 16 condition)
- Compiles a GPU graph for the I2V channel concatenation
- Supports MoE dual-transformer with per-phase LoRA weight swapping
- Input images can be provided as file paths or URLs (downloaded at runtime)
- Architecture registration for `Wan-AI/Wan2.2-I2V-A14B-Diffusers`, `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers`

## Dependencies

Depends on modular#6302 (T2V pipeline) — inherits from `WanPipeline`.

## Checklist

- [x] PR is small and focused
- [x] I ran `./bazelw run format` to format my changes

Assisted-by: Claude Code

Assisted-by: Claude Code

stack-info: PR: #18, branch: jglee-sqbits/stack/6
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the WanI2VPipeline for image-to-video generation, extending the WanPipeline with image conditioning via VAE encoding and temporal masking. Feedback identifies a potential runtime error from a dtype mismatch in the condition buffer and recommends removing redundant pre-compilation logic in the execute method.

[mask_expanded, latent_cond_np], axis=1
).astype(np.float32)

return _numpy_f32_to_buffer(condition, self.vae.config.dtype, device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The i2v_condition buffer should be created using the transformer's working dtype (self.transformer.config.dtype) rather than the VAE's dtype. In AutoencoderKLWanModel, the VAE dtype is typically hardcoded to bfloat16. If the pipeline is configured to run in float32, using self.vae.config.dtype here will lead to a dtype mismatch during the concatenation step in _concat_i2v_condition, causing a runtime error when the graph attempts to concatenate tensors of different types.

Suggested change
return _numpy_f32_to_buffer(condition, self.vae.config.dtype, device)
return _numpy_f32_to_buffer(condition, self.transformer.config.dtype, device)

Comment on lines +223 to +230
if self._i2v_concat_model is None:
latent_model_input = self._cast_f32_to_model_dtype.execute(latents)[
0
]
self._i2v_concat_model = self._compile_i2v_concat(
latent_model_input, i2v_condition
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This pre-compilation block for the I2V concatenation model is redundant. The _concat_i2v_condition method, which is called at every step within the denoising loop, already includes logic to lazily compile _i2v_concat_model upon its first use. Removing this block simplifies the execute method without affecting functionality or performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant