feat: Add diffusion finetuning CI pipeline for nightly runs#1728
Merged
feat: Add diffusion finetuning CI pipeline for nightly runs#1728
Conversation
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
…ncher Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside Wan2.1. Parameterize the launcher script to derive model-specific settings (processor, generate config, model name, frame counts) from the recipe config name. Also fix a pre-existing T5 layer norm compatibility issue in finetune.py that affects Hunyuan training with incompatible apex builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
cd1ed6a to
ae6885f
Compare
Contributor
Author
|
/okay to test ae6885f |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
The patch was a workaround for an ABI-incompatible apex build on a specific compute node, not a code issue. CI Docker builds apex from source so it is not needed there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Contributor
Author
|
/ok to test 40f02e2 |
Extend the diffusion nightly CI pipeline to support text-to-image models (Flux and QwenImage) alongside the existing text-to-video models (Wan, HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests. Changes: - Add MEDIA_TYPE branching in launcher for image vs video stages - Add tuxemon dataset download/extraction with JSONL captions - Add image preprocessing and .png inference verification paths - Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml - Register QwenImagePipeline in generate.py output type mapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
995d498 to
7f19d76
Compare
Contributor
Author
|
/ok to test 7f19d76 |
akoumpa
approved these changes
Apr 23, 2026
pthombre
added a commit
that referenced
this pull request
Apr 23, 2026
* feat: Add diffusion pipelines for nightly runs Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * Reduce ci runtime to 30 minutes Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * debug: Check if HF_TOKEN is set Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * test: revert test variables Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * feat: add HunyuanVideo nightly CI test and parameterize diffusion launcher Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside Wan2.1. Parameterize the launcher script to derive model-specific settings (processor, generate config, model name, frame counts) from the recipe config name. Also fix a pre-existing T5 layer norm compatibility issue in finetune.py that affects Hunyuan training with incompatible apex builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * style: ruff format on modified files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * revert: remove patch_t5_layer_norm from finetune.py The patch was a workaround for an ABI-incompatible apex build on a specific compute node, not a code issue. CI Docker builds apex from source so it is not needed there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * feat: add Flux and QwenImage T2I nightly CI tests Extend the diffusion nightly CI pipeline to support text-to-image models (Flux and QwenImage) alongside the existing text-to-video models (Wan, HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests. Changes: - Add MEDIA_TYPE branching in launcher for image vs video stages - Add tuxemon dataset download/extraction with JSONL captions - Add image preprocessing and .png inference verification paths - Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml - Register QwenImagePipeline in generate.py output type mapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> --------- Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 tasks
pthombre
added a commit
that referenced
this pull request
Apr 24, 2026
Cherry-pick of #1728 to r0.4.0, with QwenImage-specific additions dropped because the underlying Qwen-Image support (#1704, #1976) is not on r0.4.0. Concretely, this variant excludes: - examples/diffusion/finetune/qwen_image_t2i_flow.yaml (not created) - "QwenImagePipeline" entry in examples/diffusion/generate/generate.py - qwen_image_t2i_flow.yaml entry in nightly_recipes.yml - qwen_image_t2i_flow*) case block in diffusion_finetune_launcher.sh The remaining CI infrastructure (Wan, Hunyuan, Flux) is unchanged from the original PR. Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
3 tasks
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
* feat: Add diffusion pipelines for nightly runs Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * Reduce ci runtime to 30 minutes Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * debug: Check if HF_TOKEN is set Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * test: revert test variables Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * feat: add HunyuanVideo nightly CI test and parameterize diffusion launcher Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside Wan2.1. Parameterize the launcher script to derive model-specific settings (processor, generate config, model name, frame counts) from the recipe config name. Also fix a pre-existing T5 layer norm compatibility issue in finetune.py that affects Hunyuan training with incompatible apex builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * style: ruff format on modified files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * revert: remove patch_t5_layer_norm from finetune.py The patch was a workaround for an ABI-incompatible apex build on a specific compute node, not a code issue. CI Docker builds apex from source so it is not needed there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> * feat: add Flux and QwenImage T2I nightly CI tests Extend the diffusion nightly CI pipeline to support text-to-image models (Flux and QwenImage) alongside the existing text-to-video models (Wan, HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests. Changes: - Add MEDIA_TYPE branching in launcher for image vs video stages - Add tuxemon dataset download/extraction with JSONL captions - Add image preprocessing and .png inference verification paths - Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml - Register QwenImagePipeline in generate.py output type mapping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> --------- Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
akoumpa
pushed a commit
that referenced
this pull request
Apr 24, 2026
feat: Add diffusion finetuning CI pipeline for nightly runs (#1728) Cherry-pick of #1728 to r0.4.0, with QwenImage-specific additions dropped because the underlying Qwen-Image support (#1704, #1976) is not on r0.4.0. Concretely, this variant excludes: - examples/diffusion/finetune/qwen_image_t2i_flow.yaml (not created) - "QwenImagePipeline" entry in examples/diffusion/generate/generate.py - qwen_image_t2i_flow.yaml entry in nightly_recipes.yml - qwen_image_t2i_flow*) case block in diffusion_finetune_launcher.sh The remaining CI infrastructure (Wan, Hunyuan, Flux) is unchanged from the original PR. Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a 4-stage CI pipeline (data download, preprocessing, finetuning, inference smoke test) for diffusion model nightly testing, starting with the Wan2.1-T2V-1.3B recipe.
Changelog