Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names by huvunvidia · Pull Request #3179 · NVIDIA-NeMo/Megatron-Bridge

huvunvidia · 2026-04-06T18:52:50Z

What does this PR do ?

Adding necessary files to enable Wan Nemo-CI tests (short runs - perf and non-perf)
Updating recipes names (e.g., wan_1_3B_pretrain_config -> wan_1_3b_pretrain_config) for consistency with other models

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added support for diffusion model families (Flux, WAN) in the pretraining recipe system.
- Added dry-run mode to preview and save configurations without executing training.
- Added WAN model pretraining configuration recipes (1.3B and 14B variants).
Improvements
- Optimized configuration handling for diffusion models, skipping unnecessary dataset setup steps.

copy-pr-bot · 2026-04-06T18:52:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-06T19:01:59Z

📝 Walkthrough

Walkthrough

This PR introduces WAN diffusion model support to the performance/pretraining pipeline. The run_recipe.py script is enhanced to detect and handle diffusion model families (flux, wan) with specialized configuration and forward step selection. A new --dryrun mode validates configurations without execution. WAN pretrain configuration recipes are exposed through new wrapper modules.

Changes

Cohort / File(s)	Summary
Diffusion Model Support `scripts/performance/run_recipe.py`	Added `DIFFUSION_FAMILIES` constant and `_get_diffusion_step()` helper for dynamic diffusion forward step selection. Updated `set_user_overrides()` to skip dataset/tokenizer config for diffusion models. Enhanced `main()` with `--dryrun` flag to save and validate ConfigContainer without execution. Modified pretraining control flow to select appropriate forward step based on model family.
WAN Recipes Package `src/megatron/bridge/recipes/wan/__init__.py`, `src/megatron/bridge/recipes/wan/wan.py`	New package initializer and wrapper module exposing `wan_1_3b_pretrain_config()` and `wan_14b_pretrain_config()` functions, delegating to underlying diffusion recipe implementations and enabling direct imports from `megatron.bridge.recipes.wan`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	Changes are minor integration modifications enabling existing tested functionality through new API endpoints without algorithmic or performance changes.
Title check	✅ Passed	The title accurately describes the main changes: enabling WAN model support in nemo-ci tests and updating recipe naming conventions.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch huvu/wan_nemo_ci

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

scripts/performance/run_recipe.py (1)
91-145: Surface ignored diffusion-only CLI overrides.

For flux / wan, this branch now drops dataset, tokenizer, and seq_length inputs silently. That makes wrapper mistakes look like valid runs while the recipe defaults are still being used. At minimum, log that those options were ignored; ideally reject unsupported overrides explicitly.

Based on learnings: when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.

Also applies to: 147-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/run_recipe.py` around lines 91 - 145, When is_diffusion
is true the code silently ignores dataset, tokenizer, and seq_length CLI
overrides; add a guard early (before configuring dataset/tokenizer) that checks
if is_diffusion and any of the relevant CLI options were provided (args.data,
args.seq_length, args.dataset_paths, args.index_mapping_dir, args.dataset_root,
args.tokenizer_type, args.tokenizer_model, args.vocab_size) and raise a clear
ValueError listing the unsupported overrides (or at minimum log a warning) so
callers fail fast instead of silently using recipe defaults; update the branch
that currently sets config.dataset and config.tokenizer (and the later similar
branch around the referenced 147-149 area) to rely on this guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/performance/run_recipe.py`:
- Around line 218-224: Multiple processes hit the dry-run branch and
overwrite/log the same file; make the writer single-writer by only performing
the file write, directory creation, recipe.to_yaml and recipe.print_yaml on one
process (e.g., rank 0). Detect distributed state via
torch.distributed.is_initialized() and torch.distributed.get_rank() (or fall
back to env var RANK) and wrap the current dryrun block so only rank == 0
executes the os.makedirs, recipe.to_yaml(save_path) and recipe.print_yaml(),
while other ranks simply skip the write/print and exit.

In `@src/megatron/bridge/recipes/wan/__init__.py`:
- Around line 20-23: Top-level recipes module doesn't import the WAN wrappers so
getattr(recipes, recipe_name) can't find wan_1_3b_pretrain_config and
wan_14b_pretrain_config; fix by importing the symbols from
megatron.bridge.recipes.wan into the top-level megatron.bridge.recipes __init__
(e.g., from .wan import wan_1_3b_pretrain_config, wan_14b_pretrain_config) and
add those names to the top-level __all__ so CI discovery can locate them.

In `@src/megatron/bridge/recipes/wan/wan.py`:
- Around line 34-35: Update the inaccurate docstring import path in this module:
change the two occurrences that currently point to
megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config (missing the
final module segment) to include the extra "wan" segment so the full target
reads megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config; edit
the module/docstring text in src/megatron/bridge/recipes/wan/wan.py (both
locations referenced around the top and again later) to reflect the corrected
fully-qualified module path.

---

Nitpick comments:
In `@scripts/performance/run_recipe.py`:
- Around line 91-145: When is_diffusion is true the code silently ignores
dataset, tokenizer, and seq_length CLI overrides; add a guard early (before
configuring dataset/tokenizer) that checks if is_diffusion and any of the
relevant CLI options were provided (args.data, args.seq_length,
args.dataset_paths, args.index_mapping_dir, args.dataset_root,
args.tokenizer_type, args.tokenizer_model, args.vocab_size) and raise a clear
ValueError listing the unsupported overrides (or at minimum log a warning) so
callers fail fast instead of silently using recipe defaults; update the branch
that currently sets config.dataset and config.tokenizer (and the later similar
branch around the referenced 147-149 area) to rely on this guard.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 34a32fd6-c268-49ae-92be-13f962def847

📥 Commits

Reviewing files that changed from the base of the PR and between 337e2c2 and e90b834.

📒 Files selected for processing (3)

scripts/performance/run_recipe.py
src/megatron/bridge/recipes/wan/__init__.py
src/megatron/bridge/recipes/wan/wan.py

coderabbitai · 2026-04-06T19:02:02Z

+    if args.dryrun:
+        save_path = args.save_config_filepath or "ConfigContainer.yaml"
+        os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True)
+        recipe.to_yaml(save_path)
+        logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}")
+        recipe.print_yaml()
+        sys.exit(0)


⚠️ Potential issue | 🟠 Major

Make --dryrun single-writer.

Under torchrun, every rank reaches this branch and writes the same YAML path / prints the config. That can clobber the file and spam logs in multi-rank CI runs.

Proposed fix

if args.dryrun: save_path = args.save_config_filepath or "ConfigContainer.yaml" - os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True) - recipe.to_yaml(save_path) - logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}") - recipe.print_yaml() + if get_rank_safe() == 0: + os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True) + recipe.to_yaml(save_path) + logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}") + recipe.print_yaml() sys.exit(0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/performance/run_recipe.py` around lines 218 - 224, Multiple processes hit the dry-run branch and overwrite/log the same file; make the writer single-writer by only performing the file write, directory creation, recipe.to_yaml and recipe.print_yaml on one process (e.g., rank 0). Detect distributed state via torch.distributed.is_initialized() and torch.distributed.get_rank() (or fall back to env var RANK) and wrap the current dryrun block so only rank == 0 executes the os.makedirs, recipe.to_yaml(save_path) and recipe.print_yaml(), while other ranks simply skip the write/print and exit.

coderabbitai · 2026-04-06T19:02:02Z

+__all__ = [
+    "wan_1_3b_pretrain_config",
+    "wan_14b_pretrain_config",
+]


⚠️ Potential issue | 🔴 Critical

Expose WAN wrappers through top-level megatron.bridge.recipes to prevent CI lookup failure.

These names are correctly exported in this subpackage, but CI lookup uses getattr(recipes, recipe_name) on the top-level recipes module. Since src/megatron/bridge/recipes/__init__.py currently does not import megatron.bridge.recipes.wan.*, wan_1_3b_pretrain_config / wan_14b_pretrain_config may remain undiscoverable there.

Proposed fix (outside this file)

--- a/src/megatron/bridge/recipes/__init__.py +++ b/src/megatron/bridge/recipes/__init__.py @@ from megatron.bridge.recipes.qwen import * from megatron.bridge.recipes.qwen_vl import * +from megatron.bridge.recipes.wan import *

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/wan/__init__.py` around lines 20 - 23, Top-level recipes module doesn't import the WAN wrappers so getattr(recipes, recipe_name) can't find wan_1_3b_pretrain_config and wan_14b_pretrain_config; fix by importing the symbols from megatron.bridge.recipes.wan into the top-level megatron.bridge.recipes __init__ (e.g., from .wan import wan_1_3b_pretrain_config, wan_14b_pretrain_config) and add those names to the top-level __all__ so CI discovery can locate them.

coderabbitai · 2026-04-06T19:02:02Z

+    Delegates to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config.
+    """


⚠️ Potential issue | 🟡 Minor

Fix delegation import path in docstrings for accuracy.

The documented target path omits the final module segment (...wan.wan...), which can mislead maintenance/debugging.

Suggested docstring correction

- Delegates to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config. + Delegates to megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config. @@ - Delegates to megatron.bridge.diffusion.recipes.wan.wan_14B_pretrain_config. + Delegates to megatron.bridge.diffusion.recipes.wan.wan.wan_14B_pretrain_config.

Also applies to: 43-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/wan/wan.py` around lines 34 - 35, Update the inaccurate docstring import path in this module: change the two occurrences that currently point to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config (missing the final module segment) to include the extra "wan" segment so the full target reads megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config; edit the module/docstring text in src/megatron/bridge/recipes/wan/wan.py (both locations referenced around the top and again later) to reflect the corrected fully-qualified module path.

huvunvidia · 2026-04-06T19:10:32Z

/ok to test ba6a18b

huvunvidia · 2026-04-06T21:01:04Z

/ok to test 549a877

yaoyu-33 · 2026-04-06T22:26:18Z

seems not ready? pls ping Chen when ready.

huvunvidia · 2026-04-07T18:11:01Z

/ok to test fe1dba9

huvunvidia · 2026-04-09T17:46:31Z

/ok to test 06fbd4d

huvunvidia · 2026-04-09T17:49:51Z

/ok to test 8bd1b30

cuichenx

LGTM thanks

…ing recipes names (#3179) Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

enable nemo-ci tests (non-perf) for Wan

e90b834

huvunvidia requested review from a team, erhoo82 and malay-nagda as code owners April 6, 2026 18:52

update names

9d2aeb6

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

simplfy imports

48b66f4

Huy Vu2 added 2 commits April 6, 2026 13:03

lint

e347a88

Merge remote-tracking branch 'origin/main' into huvu/wan_nemo_ci

549a877

huvunvidia force-pushed the huvu/wan_nemo_ci branch from ba6a18b to 549a877 Compare April 6, 2026 20:11

copy-pr-bot Bot temporarily deployed to test April 6, 2026 21:01 Inactive

cuichenx reviewed Apr 6, 2026

View reviewed changes

Comment thread scripts/performance/run_recipe.py

yaoyu-33 reviewed Apr 6, 2026

View reviewed changes

Comment thread scripts/training/run_recipe.py

update mock data

fe1dba9

huvunvidia marked this pull request as draft April 7, 2026 18:11

copy-pr-bot Bot temporarily deployed to test April 7, 2026 18:11 Inactive

Huy Vu2 added 6 commits April 7, 2026 21:41

add perf recipes runs

e20ce53

add wan perf scripts

48084f2

more gpus for h100 run

5e5594f

more gpus (16) for h100 run

80089db

fix perf tests

fc8dad1

remove yaml files

ffdfd55

huvunvidia changed the title ~~Enable nemo-ci tests (non-perf) for Wan + Updating recipes names~~ Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names Apr 9, 2026

Merge remote-tracking branch 'origin/main' into huvu/wan_nemo_ci

06fbd4d

fix lint

8bd1b30

copy-pr-bot Bot temporarily deployed to test April 9, 2026 17:50 Inactive

huvunvidia marked this pull request as ready for review April 9, 2026 17:54

cuichenx approved these changes Apr 9, 2026

View reviewed changes

huvunvidia enabled auto-merge (squash) April 9, 2026 20:56

yaoyu-33 added area:diffusion DFM module area:recipe Training recipes and launch configs ci CI, automation, test queue, or workflow infrastructure work labels Apr 13, 2026

ko3n1g added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 14, 2026

ko3n1g approved these changes Apr 14, 2026

View reviewed changes

huvunvidia merged commit 5fbb559 into main Apr 14, 2026
55 checks passed

huvunvidia deleted the huvu/wan_nemo_ci branch April 14, 2026 16:24

		Delegates to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config.
		"""

Conversation

huvunvidia commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Apr 6, 2026

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

huvunvidia commented Apr 6, 2026

Uh oh!

huvunvidia commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

yaoyu-33 commented Apr 6, 2026

Uh oh!

huvunvidia commented Apr 7, 2026

Uh oh!

huvunvidia commented Apr 9, 2026

Uh oh!

huvunvidia commented Apr 9, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huvunvidia commented Apr 6, 2026 •

edited

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading