ci: Add FLUX/diffusion support to scripts/performance/run_recipe.py#3176
ci: Add FLUX/diffusion support to scripts/performance/run_recipe.py#3176
Conversation
Add DIFFUSION_FAMILIES set and _get_diffusion_step() helper so the performance CI path can train diffusion models (FLUX, WAN) without overriding their dataset config or using the wrong GPT step function. Also add megatron.bridge.recipes.flux subpackage so get_library_recipe() can locate FLUX recipes via the standard family-name import path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughChanges add diffusion model family support to the performance recipe runner by introducing routing logic to dynamically select appropriate diffusion forward steps (Flux or Wan) and conditionally skip GPT-style dataset configuration when diffusion models are detected. A new module re-exports flux recipes for convenient access. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/run_recipe.py`:
- Around line 235-239: The current routing sets forward_step to the diffusion
step only for pretrain but still uses
megatron.bridge.training.gpt_step.forward_step for SFT/LORA, causing diffusion
recipes to be run with the wrong step; update the logic in run_recipe.py to
detect when args.model_family_name is in DIFFUSION_FAMILIES and args.task is one
of "sft" or "lora" and either (a) set forward_step =
_get_diffusion_step(args.model_family_name) so diffusion finetunes use the
correct diffusion forward_step, or (b) raise a clear ValueError indicating
diffusion models do not support the requested task (include args.task and
args.model_family_name in the message) to fail fast; adjust the branching around
DIFFUSION_FAMILIES, _get_diffusion_step, and forward_step to implement one of
these behaviors.
- Line 55: When is_diffusion (computed as args.model_family_name in
DIFFUSION_FAMILIES) is true, detect explicit CLI overrides (args.data, any args
starting with dataset_ such as args.dataset_path/args.dataset_name, and
args.seq_length) and raise a clear error instead of silently ignoring them;
update the checks around the is_diffusion assignment and the other similar
blocks (the logic at/near lines handling dataset selection and seq_length) to
call parser.error(...) or raise a ValueError with a helpful message referencing
that diffusion models ignore these CLI dataset/sequence overrides, so the run
fails fast and shows which flag(s) were invalid for diffusion runs. Ensure you
reference/inspect args (e.g., args.data, args.seq_length, args.dataset_*) and
DIFFUSION_FAMILIES/is_diffusion when implementing the validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4dbefe89-297b-438e-828f-bea12ecccec5
📒 Files selected for processing (2)
scripts/performance/run_recipe.pysrc/megatron/bridge/recipes/flux/__init__.py
| if args.model_family_name in DIFFUSION_FAMILIES: | ||
| forward_step = _get_diffusion_step(args.model_family_name) | ||
| else: | ||
| from megatron.bridge.training.gpt_step import forward_step | ||
|
|
There was a problem hiding this comment.
Diffusion routing still misses the sft path.
This new branch fixes diffusion pretrain, but the sft/lora block below still unconditionally uses megatron.bridge.training.gpt_step.forward_step. Since this PR also makes FLUX recipes discoverable under megatron.bridge.recipes.flux, --task sft --model_family_name flux can resolve a diffusion recipe and then run the GPT step against it. Please either route diffusion finetunes through the correct step or raise a clear error before entering that path.
Based on learnings: when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/run_recipe.py` around lines 235 - 239, The current
routing sets forward_step to the diffusion step only for pretrain but still uses
megatron.bridge.training.gpt_step.forward_step for SFT/LORA, causing diffusion
recipes to be run with the wrong step; update the logic in run_recipe.py to
detect when args.model_family_name is in DIFFUSION_FAMILIES and args.task is one
of "sft" or "lora" and either (a) set forward_step =
_get_diffusion_step(args.model_family_name) so diffusion finetunes use the
correct diffusion forward_step, or (b) raise a clear ValueError indicating
diffusion models do not support the requested task (include args.task and
args.model_family_name in the message) to fail fast; adjust the branching around
DIFFUSION_FAMILIES, _get_diffusion_step, and forward_step to implement one of
these behaviors.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Diffusion models (FLUX, WAN) do not use tokenizers, so the tokenizer configuration block should be skipped for them alongside the dataset block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/ok to test 2174f86 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/ok to test 7996380 |
Add DIFFUSION_FAMILIES set and _get_diffusion_step() helper so the performance CI path can train diffusion models (FLUX, WAN) without overriding their dataset config or using the wrong GPT step function. Also add megatron.bridge.recipes.flux subpackage so get_library_recipe() can locate FLUX recipes via the standard family-name import path.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Refactor