Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names#3179
Enable nemo-ci tests (short runs - perf and non-perf) for Wan + Updating recipes names#3179huvunvidia merged 14 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughThis PR introduces WAN diffusion model support to the performance/pretraining pipeline. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
scripts/performance/run_recipe.py (1)
91-145: Surface ignored diffusion-only CLI overrides.For
flux/wan, this branch now drops dataset, tokenizer, andseq_lengthinputs silently. That makes wrapper mistakes look like valid runs while the recipe defaults are still being used. At minimum, log that those options were ignored; ideally reject unsupported overrides explicitly.Based on learnings: when a feature is not supported, raise an explicit error instead of silently ignoring the input to fail fast with a clear message.
Also applies to: 147-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/run_recipe.py` around lines 91 - 145, When is_diffusion is true the code silently ignores dataset, tokenizer, and seq_length CLI overrides; add a guard early (before configuring dataset/tokenizer) that checks if is_diffusion and any of the relevant CLI options were provided (args.data, args.seq_length, args.dataset_paths, args.index_mapping_dir, args.dataset_root, args.tokenizer_type, args.tokenizer_model, args.vocab_size) and raise a clear ValueError listing the unsupported overrides (or at minimum log a warning) so callers fail fast instead of silently using recipe defaults; update the branch that currently sets config.dataset and config.tokenizer (and the later similar branch around the referenced 147-149 area) to rely on this guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/run_recipe.py`:
- Around line 218-224: Multiple processes hit the dry-run branch and
overwrite/log the same file; make the writer single-writer by only performing
the file write, directory creation, recipe.to_yaml and recipe.print_yaml on one
process (e.g., rank 0). Detect distributed state via
torch.distributed.is_initialized() and torch.distributed.get_rank() (or fall
back to env var RANK) and wrap the current dryrun block so only rank == 0
executes the os.makedirs, recipe.to_yaml(save_path) and recipe.print_yaml(),
while other ranks simply skip the write/print and exit.
In `@src/megatron/bridge/recipes/wan/__init__.py`:
- Around line 20-23: Top-level recipes module doesn't import the WAN wrappers so
getattr(recipes, recipe_name) can't find wan_1_3b_pretrain_config and
wan_14b_pretrain_config; fix by importing the symbols from
megatron.bridge.recipes.wan into the top-level megatron.bridge.recipes __init__
(e.g., from .wan import wan_1_3b_pretrain_config, wan_14b_pretrain_config) and
add those names to the top-level __all__ so CI discovery can locate them.
In `@src/megatron/bridge/recipes/wan/wan.py`:
- Around line 34-35: Update the inaccurate docstring import path in this module:
change the two occurrences that currently point to
megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config (missing the
final module segment) to include the extra "wan" segment so the full target
reads megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config; edit
the module/docstring text in src/megatron/bridge/recipes/wan/wan.py (both
locations referenced around the top and again later) to reflect the corrected
fully-qualified module path.
---
Nitpick comments:
In `@scripts/performance/run_recipe.py`:
- Around line 91-145: When is_diffusion is true the code silently ignores
dataset, tokenizer, and seq_length CLI overrides; add a guard early (before
configuring dataset/tokenizer) that checks if is_diffusion and any of the
relevant CLI options were provided (args.data, args.seq_length,
args.dataset_paths, args.index_mapping_dir, args.dataset_root,
args.tokenizer_type, args.tokenizer_model, args.vocab_size) and raise a clear
ValueError listing the unsupported overrides (or at minimum log a warning) so
callers fail fast instead of silently using recipe defaults; update the branch
that currently sets config.dataset and config.tokenizer (and the later similar
branch around the referenced 147-149 area) to rely on this guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 34a32fd6-c268-49ae-92be-13f962def847
📒 Files selected for processing (3)
scripts/performance/run_recipe.pysrc/megatron/bridge/recipes/wan/__init__.pysrc/megatron/bridge/recipes/wan/wan.py
| if args.dryrun: | ||
| save_path = args.save_config_filepath or "ConfigContainer.yaml" | ||
| os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True) | ||
| recipe.to_yaml(save_path) | ||
| logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}") | ||
| recipe.print_yaml() | ||
| sys.exit(0) |
There was a problem hiding this comment.
Make --dryrun single-writer.
Under torchrun, every rank reaches this branch and writes the same YAML path / prints the config. That can clobber the file and spam logs in multi-rank CI runs.
Proposed fix
if args.dryrun:
save_path = args.save_config_filepath or "ConfigContainer.yaml"
- os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True)
- recipe.to_yaml(save_path)
- logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}")
- recipe.print_yaml()
+ if get_rank_safe() == 0:
+ os.makedirs(os.path.dirname(os.path.abspath(save_path)), exist_ok=True)
+ recipe.to_yaml(save_path)
+ logging.info(f"ConfigContainer saved to: {os.path.abspath(save_path)}")
+ recipe.print_yaml()
sys.exit(0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/run_recipe.py` around lines 218 - 224, Multiple processes
hit the dry-run branch and overwrite/log the same file; make the writer
single-writer by only performing the file write, directory creation,
recipe.to_yaml and recipe.print_yaml on one process (e.g., rank 0). Detect
distributed state via torch.distributed.is_initialized() and
torch.distributed.get_rank() (or fall back to env var RANK) and wrap the current
dryrun block so only rank == 0 executes the os.makedirs,
recipe.to_yaml(save_path) and recipe.print_yaml(), while other ranks simply skip
the write/print and exit.
| __all__ = [ | ||
| "wan_1_3b_pretrain_config", | ||
| "wan_14b_pretrain_config", | ||
| ] |
There was a problem hiding this comment.
Expose WAN wrappers through top-level megatron.bridge.recipes to prevent CI lookup failure.
These names are correctly exported in this subpackage, but CI lookup uses getattr(recipes, recipe_name) on the top-level recipes module. Since src/megatron/bridge/recipes/__init__.py currently does not import megatron.bridge.recipes.wan.*, wan_1_3b_pretrain_config / wan_14b_pretrain_config may remain undiscoverable there.
Proposed fix (outside this file)
--- a/src/megatron/bridge/recipes/__init__.py
+++ b/src/megatron/bridge/recipes/__init__.py
@@
from megatron.bridge.recipes.qwen import *
from megatron.bridge.recipes.qwen_vl import *
+from megatron.bridge.recipes.wan import *🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/wan/__init__.py` around lines 20 - 23, Top-level
recipes module doesn't import the WAN wrappers so getattr(recipes, recipe_name)
can't find wan_1_3b_pretrain_config and wan_14b_pretrain_config; fix by
importing the symbols from megatron.bridge.recipes.wan into the top-level
megatron.bridge.recipes __init__ (e.g., from .wan import
wan_1_3b_pretrain_config, wan_14b_pretrain_config) and add those names to the
top-level __all__ so CI discovery can locate them.
| Delegates to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config. | ||
| """ |
There was a problem hiding this comment.
Fix delegation import path in docstrings for accuracy.
The documented target path omits the final module segment (...wan.wan...), which can mislead maintenance/debugging.
Suggested docstring correction
- Delegates to megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config.
+ Delegates to megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config.
@@
- Delegates to megatron.bridge.diffusion.recipes.wan.wan_14B_pretrain_config.
+ Delegates to megatron.bridge.diffusion.recipes.wan.wan.wan_14B_pretrain_config.Also applies to: 43-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/wan/wan.py` around lines 34 - 35, Update the
inaccurate docstring import path in this module: change the two occurrences that
currently point to
megatron.bridge.diffusion.recipes.wan.wan_1_3B_pretrain_config (missing the
final module segment) to include the extra "wan" segment so the full target
reads megatron.bridge.diffusion.recipes.wan.wan.wan_1_3B_pretrain_config; edit
the module/docstring text in src/megatron/bridge/recipes/wan/wan.py (both
locations referenced around the top and again later) to reflect the corrected
fully-qualified module path.
|
/ok to test ba6a18b |
ba6a18b to
549a877
Compare
|
/ok to test 549a877 |
|
seems not ready? pls ping Chen when ready. |
|
/ok to test fe1dba9 |
|
/ok to test 06fbd4d |
|
/ok to test 8bd1b30 |
…ing recipes names (#3179) Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
What does this PR do ?
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Improvements