Skip to content

ci: Add FLUX/diffusion support to scripts/performance/run_recipe.py#3176

Merged
suiyoubi merged 3 commits intomainfrom
ci/flux-performance-recipe-support
Apr 14, 2026
Merged

ci: Add FLUX/diffusion support to scripts/performance/run_recipe.py#3176
suiyoubi merged 3 commits intomainfrom
ci/flux-performance-recipe-support

Conversation

@suiyoubi
Copy link
Copy Markdown
Contributor

@suiyoubi suiyoubi commented Apr 6, 2026

Add DIFFUSION_FAMILIES set and _get_diffusion_step() helper so the performance CI path can train diffusion models (FLUX, WAN) without overriding their dataset config or using the wrong GPT step function. Also add megatron.bridge.recipes.flux subpackage so get_library_recipe() can locate FLUX recipes via the standard family-name import path.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added support for diffusion-based model families (Flux and Wan) with dynamic routing
  • Refactor

    • Enhanced model configuration logic to differentiate settings between diffusion and standard models
    • Improved dataset and sequence length configuration handling
    • Reorganized flux recipes module structure for better namespace organization

Add DIFFUSION_FAMILIES set and _get_diffusion_step() helper so the
performance CI path can train diffusion models (FLUX, WAN) without
overriding their dataset config or using the wrong GPT step function.
Also add megatron.bridge.recipes.flux subpackage so get_library_recipe()
can locate FLUX recipes via the standard family-name import path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suiyoubi suiyoubi requested review from a team, erhoo82 and malay-nagda as code owners April 6, 2026 16:06
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Changes add diffusion model family support to the performance recipe runner by introducing routing logic to dynamically select appropriate diffusion forward steps (Flux or Wan) and conditionally skip GPT-style dataset configuration when diffusion models are detected. A new module re-exports flux recipes for convenient access.

Changes

Cohort / File(s) Summary
Diffusion Model Routing
scripts/performance/run_recipe.py
Added DIFFUSION_FAMILIES constant and _get_diffusion_step() helper to route between Flux and Wan forward steps. Modified set_user_overrides() to skip dataset configuration for diffusion runs. Updated main() to dynamically select forward step based on model family and conditioned seq_length override on non-diffusion status.
Flux Recipes Module
src/megatron/bridge/recipes/flux/__init__.py
New module that re-exports public symbols from the upstream diffusion recipes flux module via star import, providing a convenience namespace.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major new diffusion model features but explicitly includes no test results or testing information, with review comments indicating incomplete implementation and missing error handling. Add comprehensive test results demonstrating FLUX/WAN recipe execution, regression testing for existing GPT workflows, error case handling verification, and ensure all code paths are properly tested before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: adding FLUX/diffusion support to the performance recipe script, which aligns with the core purpose of enabling diffusion model training in CI.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ci/flux-performance-recipe-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/run_recipe.py`:
- Around line 235-239: The current routing sets forward_step to the diffusion
step only for pretrain but still uses
megatron.bridge.training.gpt_step.forward_step for SFT/LORA, causing diffusion
recipes to be run with the wrong step; update the logic in run_recipe.py to
detect when args.model_family_name is in DIFFUSION_FAMILIES and args.task is one
of "sft" or "lora" and either (a) set forward_step =
_get_diffusion_step(args.model_family_name) so diffusion finetunes use the
correct diffusion forward_step, or (b) raise a clear ValueError indicating
diffusion models do not support the requested task (include args.task and
args.model_family_name in the message) to fail fast; adjust the branching around
DIFFUSION_FAMILIES, _get_diffusion_step, and forward_step to implement one of
these behaviors.
- Line 55: When is_diffusion (computed as args.model_family_name in
DIFFUSION_FAMILIES) is true, detect explicit CLI overrides (args.data, any args
starting with dataset_ such as args.dataset_path/args.dataset_name, and
args.seq_length) and raise a clear error instead of silently ignoring them;
update the checks around the is_diffusion assignment and the other similar
blocks (the logic at/near lines handling dataset selection and seq_length) to
call parser.error(...) or raise a ValueError with a helpful message referencing
that diffusion models ignore these CLI dataset/sequence overrides, so the run
fails fast and shows which flag(s) were invalid for diffusion runs. Ensure you
reference/inspect args (e.g., args.data, args.seq_length, args.dataset_*) and
DIFFUSION_FAMILIES/is_diffusion when implementing the validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4dbefe89-297b-438e-828f-bea12ecccec5

📥 Commits

Reviewing files that changed from the base of the PR and between 2993a55 and 8c1f540.

📒 Files selected for processing (2)
  • scripts/performance/run_recipe.py
  • src/megatron/bridge/recipes/flux/__init__.py

Comment thread scripts/performance/run_recipe.py
Comment on lines +235 to +239
if args.model_family_name in DIFFUSION_FAMILIES:
forward_step = _get_diffusion_step(args.model_family_name)
else:
from megatron.bridge.training.gpt_step import forward_step

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Diffusion routing still misses the sft path.

This new branch fixes diffusion pretrain, but the sft/lora block below still unconditionally uses megatron.bridge.training.gpt_step.forward_step. Since this PR also makes FLUX recipes discoverable under megatron.bridge.recipes.flux, --task sft --model_family_name flux can resolve a diffusion recipe and then run the GPT step against it. Please either route diffusion finetunes through the correct step or raise a clear error before entering that path.

Based on learnings: when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/run_recipe.py` around lines 235 - 239, The current
routing sets forward_step to the diffusion step only for pretrain but still uses
megatron.bridge.training.gpt_step.forward_step for SFT/LORA, causing diffusion
recipes to be run with the wrong step; update the logic in run_recipe.py to
detect when args.model_family_name is in DIFFUSION_FAMILIES and args.task is one
of "sft" or "lora" and either (a) set forward_step =
_get_diffusion_step(args.model_family_name) so diffusion finetunes use the
correct diffusion forward_step, or (b) raise a clear ValueError indicating
diffusion models do not support the requested task (include args.task and
args.model_family_name in the message) to fail fast; adjust the branching around
DIFFUSION_FAMILIES, _get_diffusion_step, and forward_step to implement one of
these behaviors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems relevant @suiyoubi .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@yaoyu-33 yaoyu-33 added area:diffusion DFM module ci CI, automation, test queue, or workflow infrastructure work needs-review PR is ready for code review and waiting on a reviewer labels Apr 6, 2026
cuichenx
cuichenx previously approved these changes Apr 9, 2026
Diffusion models (FLUX, WAN) do not use tokenizers, so the tokenizer
configuration block should be skipped for them alongside the dataset block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suiyoubi
Copy link
Copy Markdown
Contributor Author

/ok to test 2174f86

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suiyoubi
Copy link
Copy Markdown
Contributor Author

/ok to test 7996380

@suiyoubi suiyoubi merged commit 98790a7 into main Apr 14, 2026
49 checks passed
@suiyoubi suiyoubi deleted the ci/flux-performance-recipe-support branch April 14, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diffusion DFM module ci CI, automation, test queue, or workflow infrastructure work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants