feat: Add Motif-Video model and pipelines#13551
feat: Add Motif-Video model and pipelines#13551waitingcheung wants to merge 20 commits intohuggingface:mainfrom
Conversation
|
Quick ping for visibility. This PR adds Motif-Video (T2V/I2V + new transformer and pipelines). Would appreciate your feedback, especially on dependency/version constraints:
This is currently blocking some diffusers-side integration, so your input would help. A working branch for this integration is available here. |
…dance support Add complete Motif Video implementation to diffusers: New Models: - Add MotifVideoTransformer3DModel with T5Gemma2Encoder for multimodal conditioning - Supports text-to-video and image-to-video generation with vision tower integration New Pipelines: - Add MotifVideoPipeline for text-to-video generation - Default resolution: 736x1280, 121 frames, 25 fps - Supports classifier-free guidance and AdaptiveProjectedGuidance - Add MotifVideoImage2VideoPipeline for image-to-video generation - First frame conditioning with vision encoder - Same defaults as T2V pipeline Enhanced Guidance: - Update AdaptiveProjectedGuidance with normalization_dims parameter - Support "spatial" normalization for 5D tensors (per-frame spatial normalization) - Support custom dimension lists for flexible normalization - Update AdaptiveProjectedMixGuidance with same parameter Documentation & Tests: - Add comprehensive API documentation for transformer and pipelines - Add test suites for both T2V and I2V pipelines - Register all new components in __init__ files - Add dummy objects for torch and transformers backends Total: 18 files changed, 3416 insertions(+), 2 deletions(-)
cd20ffc to
81cce23
Compare
I think we can guard the transformers import in the pipeline with something like
|
We have something like this at the top of the pipeline code to guide the users to upgrade the # Check transformers version before importing T5Gemma2Encoder
if not is_transformers_version(">=", "5.1.0"):
import transformers
raise ImportError(
f"MotifVideoPipeline requires transformers>=5.1.0. "
f"Found: {transformers.__version__}. "
"Please upgrade transformers: pip install transformers --upgrade"
) |
|
Then it will cut it. |
Co-authored-by: Beomgyu Kim <beomgyu.kim@motiftech.io>
- Use default Attention class with custom MotifVideoAttnProcessor2_0 - Inline cross-attention in transformer blocks - Use dispatch_attention_fn for backend support - Inherit AttentionMixin for attn_processors/set_attn_processor - Move TransformerBlockRegistry to _helpers.py - Add _repeated_blocks for regional compilation
|
@dg845 Thank you again for the initial design review. I have addressed all your comments above. In order to hold all of the attention parameters in an I would be glad to address any further comments you have. Thank you. |
|
@claude can you do a review here? |
|
Claude finished @yiyixuxu's task in 3m 58s —— View job Review of PR #13551: Motif-Video model and pipelines
Thanks for the contribution! The overall structure follows diffusers conventions well — the Correctness / Bugs1. The freqs_dtype=torch.float64,Per the model conventions (models.md gotcha #6), MPS and NPU backends don't support float64. Every other transformer in the repo uses the device-gated pattern: is_mps = hidden_states.device.type == "mps"
is_npu = hidden_states.device.type == "npu"
freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64See 2. Docstring defaults don't match signature defaults — The docstrings say 3. compute_dtype = next(
(p.dtype for p in self.timestep_embedder.parameters() if p.is_floating_point()),
torch.float32,
)
conditioning = self.timestep_embedder(timesteps_proj.to(compute_dtype))Per models.md gotcha #5: "don't cast activations by reading a weight's dtype — the stored weight dtype isn't the compute dtype under gguf / quantized loading." Derive the cast target from the input tensor's dtype instead. 4. if hasattr(self.guider, "_current_sigma") and hasattr(self.scheduler, "sigmas"):
self.guider._current_sigma = float(self.scheduler.sigmas[i])No guider in the repo has a Architecture / Pattern Issues5. Inline cross-attention bypasses the attention processor system — Both
Consider either: (a) incorporating the cross-attention into the processor, or (b) creating a separate 6. Per the attention pattern in models.md, the attention class should declare _available_processors = [Flux2AttnProcessor, Flux2KVAttnProcessor]Add: _available_processors = [MotifVideoAttnProcessor2_0]7. Missing Several methods are identical between T2V and I2V but lack
For the ones that are truly identical, add 8. Redundant The method is only called from Minor / Nits9. Both pipelines accept 10. The T2V pipeline has a 11. Single-file model loader mapping is a no-op lambda — "MotifVideoTransformer3DModel": {
"checkpoint_mapping_fn": lambda checkpoint, **kwargs: checkpoint,
"default_subfolder": "transformer",
},A no-op mapping function means the checkpoint is expected to already be in diffusers format. If this is intentional (i.e., no original format exists to convert from), it's fine but somewhat unusual. If there's an original checkpoint format, a proper conversion function should be added. 12. Guider The new 13. Formatting-only changes in Several hunks in this file are purely reformatting (line wrapping import statements, function arguments). These add noise to the diff and should ideally be separate or omitted. SummaryThe core model architecture and pipeline structure are solid and follow diffusers conventions. The main areas that need attention are:
|
What does this PR do?
This PR adds support for Motif-Video - a text-to-video (T2V) and image-to-video (I2V) diffusion model from Motif Technologies. The implementation includes the transformer architecture, both pipeline variants, guiding configurations, and comprehensive documentation.
Changes
New Files
src/diffusers/models/transformers/transformer_motif_video.py- MotifVideoTransformer3DModelsrc/diffusers/pipelines/motif_video/pipeline_motif_video.py- Text-to-Videosrc/diffusers/pipelines/motif_video/pipeline_motif_video_image2video.py- Image-to-Videosrc/diffusers/pipelines/motif_video/pipeline_output.pytests/pipelines/motif_video/test_motif_video.pytests/pipelines/motif_video/test_motif_video_image2video.pydocs/source/en/api/models/motif_video_transformer_3d.mddocs/source/en/api/pipelines/motif_video.mdKey Features
Version Requirements
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.