Skip to content

feat: Add diffusion finetuning CI pipeline for nightly runs#1728

Merged
pthombre merged 10 commits intomainfrom
pranav/diffusion_nightly_runs
Apr 23, 2026
Merged

feat: Add diffusion finetuning CI pipeline for nightly runs#1728
pthombre merged 10 commits intomainfrom
pranav/diffusion_nightly_runs

Conversation

@pthombre
Copy link
Copy Markdown
Contributor

@pthombre pthombre commented Apr 8, 2026

What does this PR do?

Adds a 4-stage CI pipeline (data download, preprocessing, finetuning, inference smoke test) for diffusion model nightly testing, starting with the Wan2.1-T2V-1.3B recipe.

Changelog

  • Add diffusion_finetune_launcher.sh script with data download, video preprocessing, distributed finetuning, and inference validation stages
  • Add nightly_recipes.yml and override_recipes.yml configs for wan2_1_t2v_flow
  • Add ci metadata (recipe_owner, time) to wan2_1_t2v_flow.yaml
  • Extend generate_ci_tests.py to support diffusion_sft stage and custom examples_dir
  • Add consolidated safetensors checkpoint loading support to generate.py
  • Bump diffusers>=0.37.0 to fix NameError in torchao_quantizer (logger used before definition in 0.36.0)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

thomasdhc
thomasdhc previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Contributor

@thomasdhc thomasdhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pthombre and others added 5 commits April 21, 2026 23:27
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
…ncher

Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside
Wan2.1. Parameterize the launcher script to derive model-specific settings
(processor, generate config, model name, frame counts) from the recipe
config name. Also fix a pre-existing T5 layer norm compatibility issue
in finetune.py that affects Hunyuan training with incompatible apex builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre pthombre force-pushed the pranav/diffusion_nightly_runs branch from cd1ed6a to ae6885f Compare April 22, 2026 06:27
@pthombre pthombre marked this pull request as ready for review April 22, 2026 06:28
@pthombre
Copy link
Copy Markdown
Contributor Author

/okay to test ae6885f

pthombre and others added 2 commits April 21, 2026 23:34
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
The patch was a workaround for an ABI-incompatible apex build on a
specific compute node, not a code issue. CI Docker builds apex from
source so it is not needed there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 40f02e2

Extend the diffusion nightly CI pipeline to support text-to-image models
(Flux and QwenImage) alongside the existing text-to-video models (Wan,
HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests.

Changes:
- Add MEDIA_TYPE branching in launcher for image vs video stages
- Add tuxemon dataset download/extraction with JSONL captions
- Add image preprocessing and .png inference verification paths
- Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml
- Register QwenImagePipeline in generate.py output type mapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre pthombre force-pushed the pranav/diffusion_nightly_runs branch from 995d498 to 7f19d76 Compare April 23, 2026 00:05
@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 7f19d76

@akoumpa akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 23, 2026
@pthombre pthombre merged commit 3b21c62 into main Apr 23, 2026
57 checks passed
@pthombre pthombre deleted the pranav/diffusion_nightly_runs branch April 23, 2026 20:07
pthombre added a commit that referenced this pull request Apr 23, 2026
* feat: Add diffusion pipelines for nightly runs

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Reduce ci runtime to 30 minutes

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* debug: Check if HF_TOKEN is set

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* test: revert test variables

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* feat: add HunyuanVideo nightly CI test and parameterize diffusion launcher

Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside
Wan2.1. Parameterize the launcher script to derive model-specific settings
(processor, generate config, model name, frame counts) from the recipe
config name. Also fix a pre-existing T5 layer norm compatibility issue
in finetune.py that affects Hunyuan training with incompatible apex builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* style: ruff format on modified files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* revert: remove patch_t5_layer_norm from finetune.py

The patch was a workaround for an ABI-incompatible apex build on a
specific compute node, not a code issue. CI Docker builds apex from
source so it is not needed there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* feat: add Flux and QwenImage T2I nightly CI tests

Extend the diffusion nightly CI pipeline to support text-to-image models
(Flux and QwenImage) alongside the existing text-to-video models (Wan,
HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests.

Changes:
- Add MEDIA_TYPE branching in launcher for image vs video stages
- Add tuxemon dataset download/extraction with JSONL captions
- Add image preprocessing and .png inference verification paths
- Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml
- Register QwenImagePipeline in generate.py output type mapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

---------

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pthombre added a commit that referenced this pull request Apr 24, 2026
Cherry-pick of #1728 to r0.4.0, with QwenImage-specific additions
dropped because the underlying Qwen-Image support (#1704, #1976) is
not on r0.4.0. Concretely, this variant excludes:

- examples/diffusion/finetune/qwen_image_t2i_flow.yaml (not created)
- "QwenImagePipeline" entry in examples/diffusion/generate/generate.py
- qwen_image_t2i_flow.yaml entry in nightly_recipes.yml
- qwen_image_t2i_flow*) case block in diffusion_finetune_launcher.sh

The remaining CI infrastructure (Wan, Hunyuan, Flux) is unchanged
from the original PR.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
* feat: Add diffusion pipelines for nightly runs

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Reduce ci runtime to 30 minutes

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* debug: Check if HF_TOKEN is set

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* test: revert test variables

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* feat: add HunyuanVideo nightly CI test and parameterize diffusion launcher

Add HunyuanVideo-1.5 to the diffusion finetuning CI pipeline alongside
Wan2.1. Parameterize the launcher script to derive model-specific settings
(processor, generate config, model name, frame counts) from the recipe
config name. Also fix a pre-existing T5 layer norm compatibility issue
in finetune.py that affects Hunyuan training with incompatible apex builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* style: ruff format on modified files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* revert: remove patch_t5_layer_norm from finetune.py

The patch was a workaround for an ABI-incompatible apex build on a
specific compute node, not a code issue. CI Docker builds apex from
source so it is not needed there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* feat: add Flux and QwenImage T2I nightly CI tests

Extend the diffusion nightly CI pipeline to support text-to-image models
(Flux and QwenImage) alongside the existing text-to-video models (Wan,
HunyuanVideo). Uses the diffusers/tuxemon dataset for image CI smoke tests.

Changes:
- Add MEDIA_TYPE branching in launcher for image vs video stages
- Add tuxemon dataset download/extraction with JSONL captions
- Add image preprocessing and .png inference verification paths
- Add ci: sections to flux_t2i_flow.yaml and qwen_image_t2i_flow.yaml
- Register QwenImagePipeline in generate.py output type mapping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

---------

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
akoumpa pushed a commit that referenced this pull request Apr 24, 2026
feat: Add diffusion finetuning CI pipeline for nightly runs (#1728)

Cherry-pick of #1728 to r0.4.0, with QwenImage-specific additions
dropped because the underlying Qwen-Image support (#1704, #1976) is
not on r0.4.0. Concretely, this variant excludes:

- examples/diffusion/finetune/qwen_image_t2i_flow.yaml (not created)
- "QwenImagePipeline" entry in examples/diffusion/generate/generate.py
- qwen_image_t2i_flow.yaml entry in nightly_recipes.yml
- qwen_image_t2i_flow*) case block in diffusion_finetune_launcher.sh

The remaining CI infrastructure (Wan, Hunyuan, Flux) is unchanged
from the original PR.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants