[bridge] feat: Add ERNIE 4.5 text-only MoE and VL bridges for Megatro…#3263
[bridge] feat: Add ERNIE 4.5 text-only MoE and VL bridges for Megatro…#3263bo-ke wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request adds comprehensive support for ERNIE 4.5 models to the Megatron Bridge, including both text-only MoE causal language models and vision-language MoE models. It extends the core model conversion infrastructure with enhanced expert parameter handling, introduces new bridge implementations with specialized parameter mappings (including PP-safe skipping and expert index offsetting), implements dual-pool Mixture-of-Experts routing with modality awareness, adds vision transformer components with custom absolute 2D rotary positional embeddings, and provides end-to-end conversion and validation tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (21)
tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py (1)
107-108: Consider addingstrict=Truetozip()for defensive coding.While the three lists are guaranteed to have equal length (all k=5 from topk), adding
strict=Truemakes this assumption explicit and catches any future bugs if the data sources diverge.♻️ Suggested change
print(f" Position {pos}: actual_next='{actual_decoded}'({actual_next}), " - f"top5={list(zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist()))}") + f"top5={list(zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(), strict=True))}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py` around lines 107 - 108, The print line that zips decoded_predictions, topk_ids.tolist(), and topk_vals.tolist() should use zip(..., strict=True) to enforce equal lengths and catch mismatches; update the zip call in the code that prints predictions (the line referencing decoded_predictions, topk_ids, topk_vals) to zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(), strict=True).src/megatron/bridge/models/ernie/ernie_45_provider.py (1)
23-24: Consider using modern union syntax.The import of
Optionalfromtypingcan be replaced with the modernX | Nonesyntax, which is the preferred style in this codebase for source files.♻️ Suggested change
from dataclasses import dataclass -from typing import Callable, Optional +from typing import CallableAnd update line 31:
-def _ernie45_decoder_block_spec(config: "Ernie45ModelProvider", vp_stage: Optional[int] = None): +def _ernie45_decoder_block_spec(config: "Ernie45ModelProvider", vp_stage: int | None = None):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie/ernie_45_provider.py` around lines 23 - 24, Replace the legacy typing.Optional import and usages with the modern union syntax: remove Optional from the import list (leaving dataclass and Callable) and change any type hints in this module that use Optional[...] to the X | None form (e.g., foo: Optional[Bar] -> foo: Bar | None); update the annotation at the location referenced (line with current Optional usage) to use the new union syntax.src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py (1)
182-184:ColumnParallelunpacked but unused in this function.The variable
ColumnParallelis unpacked from_get_linear_modules()but not used in_get_ernie_decoder_layer_spec. This is a minor style issue since the function relies on positional unpacking. Consider using_for unused variables.Suggested fix
- ColumnParallel, RowParallel, DotProductAttention, LayerNormColumnParallel, Norm = ( + _, RowParallel, DotProductAttention, LayerNormColumnParallel, Norm = ( _get_linear_modules() )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py` around lines 182 - 184, The function _get_ernie_decoder_layer_spec unpacks ColumnParallel from _get_linear_modules() but never uses it; update the unpacking to replace ColumnParallel with an underscore (e.g., _, RowParallel, DotProductAttention, LayerNormColumnParallel, Norm = _get_linear_modules()) so the unused value is explicitly ignored and the intent is clear, making sure to only change the unpack target and not other logic in _get_ernie_decoder_layer_spec.src/megatron/bridge/models/ernie_vl/__init__.py (1)
22-29: Consider sorting__all__alphabetically for consistency.Ruff flags that the
__all__list is unsorted. While this doesn't affect functionality, sorting it improves readability and aligns with isort conventions.Suggested fix
__all__ = [ + "Ernie45VLBridge", "Ernie45VLModel", - "Ernie45VLBridge", "Ernie45VLModelProvider", "ErnieMultiTypeMoE", - "MultiTypeMoeSubmodules", "get_ernie45_vl_decoder_block_spec", + "MultiTypeMoeSubmodules", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/__init__.py` around lines 22 - 29, The __all__ list is unsorted; reorder the exported names alphabetically so linters (Ruff/isort) are satisfied — update the __all__ tuple containing "Ernie45VLModel", "Ernie45VLBridge", "Ernie45VLModelProvider", "ErnieMultiTypeMoE", "MultiTypeMoeSubmodules", and "get_ernie45_vl_decoder_block_spec" into alphabetical order (e.g., "Ernie45VLBridge", "Ernie45VLModel", "Ernie45VLModelProvider", "ErnieMultiTypeMoE", "MultiTypeMoeSubmodules", "get_ernie45_vl_decoder_block_spec").tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py (4)
415-422: Prefix unused loop variable with underscore.The
namevariable is not used in the loop body.🧹 Proposed fix
for m in megatron_models: - for name, param in m.named_parameters(): + for _name, param in m.named_parameters(): if param.requires_grad: total_params += 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` around lines 415 - 422, The loop over model parameters uses an unused variable `name`; change the tuple unpacking in the loop that iterates `for name, param in m.named_parameters():` to prefix the unused variable (e.g., `for _, param in m.named_parameters():`) so only `param` is referenced, leaving the rest of the logic (`param.requires_grad`, `param.grad`, and the increments `total_params`, `params_with_grad`, `params_with_nonzero_grad`) unchanged.
83-83: Remove extraneousfprefix from string without placeholders.🧹 Proposed fix
-print_rank_0(f"=== ERNIE 4.5 VL Forward/Backward Test ===") +print_rank_0("=== ERNIE 4.5 VL Forward/Backward Test ===")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` at line 83, The print_rank_0 call uses an unnecessary f-string for a literal with no placeholders; in the call to print_rank_0 (string "=== ERNIE 4.5 VL Forward/Backward Test ===") remove the leading f so it becomes a plain string literal, i.e., update the argument passed to print_rank_0 accordingly.
69-69: Use explicitOptionalor union syntax for optional parameters.PEP 484 prohibits implicit
Optional(using= Nonewithout a type annotation that includesNone).🧹 Proposed fix
+from typing import Optional + def run_forward_backward( hf_model_path: str, tp: int = 1, pp: int = 1, ep: int = 1, seq_len: int = 16, backward: bool = True, with_vision: bool = False, - prompt: str = None, + prompt: Optional[str] = None, ):Same applies to line 467 in
_run().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` at line 69, The parameter annotation uses implicit Optional by assigning prompt: str = None; update the type to explicitly allow None (e.g., use Optional[str] or str | None) for the prompt parameter and likewise for the _run() signature mentioned at line 467; import typing.Optional if using Optional[] and ensure both function signatures (the one containing prompt and the _run method) have explicit union/Optional annotations instead of bare "= None".
36-37: Remove unused imports.
sysandtracebackare imported but never used in this file.🧹 Proposed fix
import argparse import os -import sys -import traceback # Disable torch.compile to avoid triton compatibility issues in some environments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` around lines 36 - 37, Remove the unused module imports 'sys' and 'traceback' from the top-level import list in the file (they are not referenced anywhere); update the import block to only include the modules actually used (e.g., keep other imports like when importing pytest or model helpers) so no unused symbols remain and linters stop flagging these imports.tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py (2)
317-319: Replaceassert Falsewithpytest.fail()for test failure reporting.Using
assert Falsecan be removed by Python's-Oflag. Usepytest.fail()for proper test failure reporting.🧪 Proposed fix
if result.returncode != 0: print(f"STDOUT: {result.stdout}") print(f"STDERR: {result.stderr}") - assert False, ( + pytest.fail( f"ERNIE 4.5 VL {test_name} conversion failed with return code {result.returncode}" )Same pattern applies to line 427-429.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py` around lines 317 - 319, Replace the brittle assert False usage in the failure branches with pytest.fail(...) so failures are reported reliably; specifically in test_ernie45_vl_conversion.py replace the `assert False, (f"ERNIE 4.5 VL {test_name} conversion failed with return code {result.returncode}")` at the block that checks result.returncode and the same pattern around lines 427-429 with `pytest.fail` called with the same formatted message (ensure pytest is imported if not present) so test_name and result.returncode are included in the failure message.
150-181: Consider using more specific exception types in tokenizer fallback logic.The nested try/except blocks catch broad
Exceptiontypes. While this provides robust fallback behavior, catching more specific exceptions (e.g.,OSError,ImportError) would make debugging easier if unexpected errors occur.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py` around lines 150 - 181, Change the broad except Exception blocks in the tokenizer fallback to catch specific errors: for the AutoTokenizer.from_pretrained block (around _local_tokenizer_path / AutoTokenizer.from_pretrained) catch loading-related errors such as OSError, ValueError (e.g., except (OSError, ValueError) as e) so only model I/O/format issues trigger the fallback; for the inner fallback that imports transformers/tokenizers and builds PreTrainedTokenizerFast (around the imports, TkTokenizer, tk.add_tokens, PreTrainedTokenizerFast, fast_tokenizer.save_pretrained) catch import/runtime issues such as ImportError, ModuleNotFoundError, RuntimeError (e.g., except (ImportError, ModuleNotFoundError, RuntimeError) as e) and handle/log the exception before falling back to the final model_dir.mkdir; ensure each except binds the exception variable for debugging.tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py (2)
109-112: Debug print statement should be removed before merge.The loop that prints parameter dtype is debug code and should be removed or guarded behind a debug flag.
🧹 Proposed fix
# Create model with random weights and convert to bfloat16 model = Ernie4_5_MoeForCausalLM(config) model = model.bfloat16() - # Debug: Check model dtype before saving - for name, param in model.named_parameters(): - print(f"Before save - {name}: {param.dtype}") - break # Just check the first parameter - # Copy tokenizer files from the local ERNIE VL model (works offline).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py` around lines 109 - 112, Remove the debug print loop that iterates model.named_parameters() and prints the first parameter's dtype (the "for name, param in model.named_parameters(): print(...); break" block) from test_ernie45_moe_conversion.py; either delete those lines or wrap them behind a proper debug/logging flag so tests do not emit debug output (ensure no stray print remains in the test).
243-243: Replaceassert Falsewithpytest.fail()orraise AssertionError().
assert Falsestatements are stripped when Python runs with-Oflag. Usepytest.fail()for proper test failure in pytest context.🧪 Proposed fix
except Exception as e: - assert False, f"Failed to load created toy MoE model: {e}" + pytest.fail(f"Failed to load created toy MoE model: {e}")Same applies to line 319-322.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py` at line 243, Replace the use of "assert False, f'Failed to load created toy MoE model: {e}'" with a proper pytest failure so failures aren't optimized away; import pytest if missing and call pytest.fail(f"Failed to load created toy MoE model: {e}") (or raise AssertionError(f"...")) in the test function, and make the same replacement for the other occurrences mentioned (the block around the second failure message at lines 319-322) so both test failures are reported reliably under pytest and when Python is run with -O.src/megatron/bridge/models/ernie_vl/vision_attention.py (1)
164-164: Prefix unused variable with underscore.The
block_tablevariable is unpacked but never used. Prefix it with an underscore to indicate it's intentionally unused.🧹 Proposed fix
-query, key, value, rotary_pos_emb, attn_mask_type, block_table = self._adjust_key_value_for_inference( +query, key, value, rotary_pos_emb, attn_mask_type, _block_table = self._adjust_key_value_for_inference(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/vision_attention.py` at line 164, The tuple returned by _adjust_key_value_for_inference is unpacked into query, key, value, rotary_pos_emb, attn_mask_type, block_table but block_table is unused; rename that target to _block_table or simply _ to mark it intentionally unused in the assignment where _adjust_key_value_for_inference(...) is called (e.g., change "..., attn_mask_type, block_table = self._adjust_key_value_for_inference(...)" to "..., attn_mask_type, _block_table = self._adjust_key_value_for_inference(...)" or "..., attn_mask_type, _ = ...") so linters/readers know it is intentionally ignored.tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py (3)
523-523: Remove extraneousfprefix from string without placeholders.This f-string has no placeholders, so the
fprefix is unnecessary.🧹 Proposed fix
-print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison ===") +print_rank_0("=== ERNIE 4.5 VL Logit Comparison ===")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py` at line 523, Remove the unnecessary f-string prefix on the print in print_rank_0: change the call print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison ===") to use a plain string without the f prefix so it becomes print_rank_0("=== ERNIE 4.5 VL Logit Comparison ==="); no other changes needed.
269-274: Addstrict=Truetozip()for safer pairing.When zipping
top5_tokenswithtop5_vals, the lengths should always match. Usingstrict=Truewill catch unexpected mismatches early.🧹 Proposed fix
-print_rank_0(f" HF Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}") +print_rank_0(f" HF Top 5: {list(zip(top5_tokens, top5_vals.tolist(), strict=True))}")Same applies to line 452:
-print_rank_0(f" Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}") +print_rank_0(f" Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist(), strict=True))}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py` around lines 269 - 274, The zip between top5_tokens and top5_vals may mask length mismatches; update the two occurrences where they are paired (the expression using top5_tokens and top5_vals around the log print and the similar use at the later location referenced) to call zip(top5_tokens, top5_vals, strict=True) so any unexpected length mismatch raises immediately; ensure both occurrences (the one that builds the HF Top 5 log and the similar one at the later location) are changed to include strict=True and run under a Python version that supports this parameter.
91-91: Unused import:PIL.ImageThe
Imageclass from PIL is imported but never used in this file.🧹 Proposed fix
from transformers import AutoProcessor -from PIL import Image🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py` at line 91, Remove the unused import "Image" from the top-level import statement (from PIL import Image) in ernie45_vl_logit_compare.py; ensure no other code references the PIL.Image symbol (search for "Image" usages in that module) and run linters/tests to confirm the unused-import warning is resolved.tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py (1)
114-122: Consider logging warnings instead of printing them.The HF ViT loading uses
logging.warning()would allow better control over verbosity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py` around lines 114 - 122, Replace the plain print warnings when loading HF ViT with logging warnings: after calling hf_vit.load_state_dict(...) check the missing and unexpected variables and call logger.warning(...) instead of print(...); ensure you obtain a logger via logging.getLogger(__name__) (and add import logging if missing) so warnings go through the logging framework and respect configured verbosity.src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py (1)
27-27: Remove unused importOptional.
Optionalis imported but not used in this file.🧹 Proposed fix
-from typing import Any, Callable, List, Optional, Tuple, Union +from typing import Any, Callable, List, Tuple, Union🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py` at line 27, Remove the unused typing import Optional from the import list in ernie45_vl_provider.py; update the from typing import ... line (which currently includes Optional) to exclude Optional so only used symbols like Any, Callable, List, Tuple, Union remain, ensuring no other references to Optional exist in functions or classes in this module.src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py (1)
239-259: Remove unusedtext_used_dummyandvision_used_dummyvariables.These variables are assigned but never read, as flagged by static analysis (F841). If they're intended for debugging, consider adding debug logging; otherwise, remove them.
🧹 Proposed fix to remove dead code
- text_used_dummy = False if text_indices.numel() > 0: text_output, _ = self.text_moe_layer(text_hidden, padding_mask=None) elif ep_size > 1: # Inject a dummy token to participate in EP collectives dummy = torch.zeros(1, 1, hidden_size, dtype=flat_hidden.dtype, device=flat_hidden.device) text_output, _ = self.text_moe_layer(dummy, padding_mask=None) - text_used_dummy = True else: text_output = None - vision_used_dummy = False if vision_indices.numel() > 0: vision_output, _ = self.vision_moe_layer(vision_hidden, padding_mask=None) elif ep_size > 1: # Inject a dummy token to participate in EP collectives dummy = torch.zeros(1, 1, hidden_size, dtype=flat_hidden.dtype, device=flat_hidden.device) vision_output, _ = self.vision_moe_layer(dummy, padding_mask=None) - vision_used_dummy = True else: vision_output = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py` around lines 239 - 259, The variables text_used_dummy and vision_used_dummy are assigned but never used (F841); remove their declarations and assignments from the ernie_moe_layer logic that calls text_moe_layer and vision_moe_layer (the branches that create the dummy tensor and call these methods) so the code simply sets text_output/vision_output or calls the dummy path without creating these unused flags; if you actually need that state for debugging, replace the assignments with a debug log referencing text_used_dummy/vision_used_dummy instead of keeping unused variables.src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py (1)
87-87: Remove unused importOptional.Static analysis (F401) flags
Optionalas imported but unused. The file usesOptionalin type hints but these are all from thetypingmodule which is already imported forDictandTuple.🧹 Proposed fix
-from typing import Dict, Optional, Tuple +from typing import Dict, Tuple🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py` at line 87, Remove the unused Optional import from the typing import list in ernie45_vl_bridge.py: update the import statement that currently reads "from typing import Dict, Optional, Tuple" to only import the actually used symbols (Dict and Tuple) so Optional is no longer imported; ensure any type hints referencing Optional remain valid (they already rely on typing.Optional via other imports or use of Union) and run linting to confirm F401 is resolved.src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py (1)
526-527: Prefix unusedrope_deltaswith underscore.Static analysis (RUF059) flags
rope_deltasas unpacked but never used.🧹 Proposed fix
- position_ids, rope_deltas = self.get_rope_index( + position_ids, _rope_deltas = self.get_rope_index( input_ids, rope_mm_token_type_ids, image_grid_thw, video_grid_thw, attention_mask=None, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py` around lines 526 - 527, The unpacking from get_rope_index currently binds an unused variable rope_deltas; change the unpack to position_ids, _rope_deltas = self.get_rope_index(input_ids, ...) (or simply position_ids, _ = self.get_rope_index(...)) so the unused value is prefixed with an underscore to satisfy the RUF059 linter; update the unpack site near the call to self.get_rope_index in modeling_ernie45_vl.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py`:
- Around line 207-278: The class-level _export_buffer on _ConcatBiasMapping can
retain stale text biases between conversion runs; add a defensive classmethod
like clear_export_buffer(cls) that empties _export_buffer and call it at the
start of the export workflow (before any megatron_to_hf calls) so each
conversion run starts with a fresh buffer; update _ConcatBiasMapping to provide
clear_export_buffer and ensure the export orchestration invokes
_ConcatBiasMapping.clear_export_buffer() before exporting so megatron_to_hf (and
its pop usage) cannot leak state across runs.
In `@src/megatron/bridge/models/ernie_vl/vision_model.py`:
- Around line 117-131: The code uses torch.cuda.current_device() when creating
self.inv_freq which can fail before CUDA is initialized; update the
implementation to derive the device from the input tensor (or a passed-in
device) instead of calling torch.cuda.current_device(). Locate where inv_freq is
created (the block that registers "inv_freq" and the subsequent seq =
torch.arange(...)) and replace the device argument with something like device =
input.device (or a provided device parameter) and use that device for
torch.arange and torch.arange(..., device=device). Ensure
self.register_buffer("inv_freq", inv_freq, persistent=False) still stores the
buffer and that subsequent uses of self.inv_freq.device remain valid.
In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.sh`:
- Around line 1-16: Move the shebang (#!/bin/bash) to be the first line of the
script so the kernel uses the correct interpreter; specifically, place the
existing shebang above the copyright header and ensure the existing "set -xeuo
pipefail" and all other header comments follow it unchanged in the script
"L1_Launch_models_ernie_vl.sh".
In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.sh`:
- Line 16: The CI workflow is missing the new L1 launcher
L1_Launch_models_ernie, so update .github/workflows/cicd-main.yml to include
this launcher in the job matrix used for L1 test launchers; locate the matrix
entry (e.g., matrix.launcher or matrix.include block used by the job that runs
launch scripts) and add an entry for "L1_Launch_models_ernie" referencing the
launcher name and optionally the test path
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
so the new script is executed in the CI matrix.
In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`:
- Around line 186-187: The variable mg_sin is computed but never used; either
remove the mg_sin calculation or use it by comparing to hf_sin like you do for
mg_cos/hf_cos: update the mg_sin line in the block where mg_emb is processed
(mg_cos, mg_sin, mg_emb) to either delete the mg_sin = mg_emb.sin().flatten()
line or replace it with a comparison/assertion against hf_sin (mirroring the
existing mg_cos vs hf_cos check) so mg_sin is no longer an unused variable.
- Around line 311-316: The snippet computing cos_sim_ln using
F.cosine_similarity against a placeholder torch.zeros(1) is dead/misleading and
unused; remove the entire cos_sim_ln calculation (the F.cosine_similarity call
and the placeholder torch.zeros(1) argument) and any references to cos_sim_ln,
or if you intend a real check, replace the placeholder with a proper tensor
extracted for comparison (e.g., the actual LN output) and use hf_normed and
F.cosine_similarity accordingly; locate the code around the cos_sim_ln variable
and hf_normed to apply the change.
---
Nitpick comments:
In `@src/megatron/bridge/models/ernie_vl/__init__.py`:
- Around line 22-29: The __all__ list is unsorted; reorder the exported names
alphabetically so linters (Ruff/isort) are satisfied — update the __all__ tuple
containing "Ernie45VLModel", "Ernie45VLBridge", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules", and
"get_ernie45_vl_decoder_block_spec" into alphabetical order (e.g.,
"Ernie45VLBridge", "Ernie45VLModel", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules",
"get_ernie45_vl_decoder_block_spec").
In `@src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py`:
- Around line 182-184: The function _get_ernie_decoder_layer_spec unpacks
ColumnParallel from _get_linear_modules() but never uses it; update the
unpacking to replace ColumnParallel with an underscore (e.g., _, RowParallel,
DotProductAttention, LayerNormColumnParallel, Norm = _get_linear_modules()) so
the unused value is explicitly ignored and the intent is clear, making sure to
only change the unpack target and not other logic in
_get_ernie_decoder_layer_spec.
In `@src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py`:
- Around line 239-259: The variables text_used_dummy and vision_used_dummy are
assigned but never used (F841); remove their declarations and assignments from
the ernie_moe_layer logic that calls text_moe_layer and vision_moe_layer (the
branches that create the dummy tensor and call these methods) so the code simply
sets text_output/vision_output or calls the dummy path without creating these
unused flags; if you actually need that state for debugging, replace the
assignments with a debug log referencing text_used_dummy/vision_used_dummy
instead of keeping unused variables.
In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py`:
- Line 87: Remove the unused Optional import from the typing import list in
ernie45_vl_bridge.py: update the import statement that currently reads "from
typing import Dict, Optional, Tuple" to only import the actually used symbols
(Dict and Tuple) so Optional is no longer imported; ensure any type hints
referencing Optional remain valid (they already rely on typing.Optional via
other imports or use of Union) and run linting to confirm F401 is resolved.
In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py`:
- Line 27: Remove the unused typing import Optional from the import list in
ernie45_vl_provider.py; update the from typing import ... line (which currently
includes Optional) to exclude Optional so only used symbols like Any, Callable,
List, Tuple, Union remain, ensuring no other references to Optional exist in
functions or classes in this module.
In `@src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py`:
- Around line 526-527: The unpacking from get_rope_index currently binds an
unused variable rope_deltas; change the unpack to position_ids, _rope_deltas =
self.get_rope_index(input_ids, ...) (or simply position_ids, _ =
self.get_rope_index(...)) so the unused value is prefixed with an underscore to
satisfy the RUF059 linter; update the unpack site near the call to
self.get_rope_index in modeling_ernie45_vl.py.
In `@src/megatron/bridge/models/ernie_vl/vision_attention.py`:
- Line 164: The tuple returned by _adjust_key_value_for_inference is unpacked
into query, key, value, rotary_pos_emb, attn_mask_type, block_table but
block_table is unused; rename that target to _block_table or simply _ to mark it
intentionally unused in the assignment where
_adjust_key_value_for_inference(...) is called (e.g., change "...,
attn_mask_type, block_table = self._adjust_key_value_for_inference(...)" to
"..., attn_mask_type, _block_table = self._adjust_key_value_for_inference(...)"
or "..., attn_mask_type, _ = ...") so linters/readers know it is intentionally
ignored.
In `@src/megatron/bridge/models/ernie/ernie_45_provider.py`:
- Around line 23-24: Replace the legacy typing.Optional import and usages with
the modern union syntax: remove Optional from the import list (leaving dataclass
and Callable) and change any type hints in this module that use Optional[...] to
the X | None form (e.g., foo: Optional[Bar] -> foo: Bar | None); update the
annotation at the location referenced (line with current Optional usage) to use
the new union syntax.
In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py`:
- Around line 415-422: The loop over model parameters uses an unused variable
`name`; change the tuple unpacking in the loop that iterates `for name, param in
m.named_parameters():` to prefix the unused variable (e.g., `for _, param in
m.named_parameters():`) so only `param` is referenced, leaving the rest of the
logic (`param.requires_grad`, `param.grad`, and the increments `total_params`,
`params_with_grad`, `params_with_nonzero_grad`) unchanged.
- Line 83: The print_rank_0 call uses an unnecessary f-string for a literal with
no placeholders; in the call to print_rank_0 (string "=== ERNIE 4.5 VL
Forward/Backward Test ===") remove the leading f so it becomes a plain string
literal, i.e., update the argument passed to print_rank_0 accordingly.
- Line 69: The parameter annotation uses implicit Optional by assigning prompt:
str = None; update the type to explicitly allow None (e.g., use Optional[str] or
str | None) for the prompt parameter and likewise for the _run() signature
mentioned at line 467; import typing.Optional if using Optional[] and ensure
both function signatures (the one containing prompt and the _run method) have
explicit union/Optional annotations instead of bare "= None".
- Around line 36-37: Remove the unused module imports 'sys' and 'traceback' from
the top-level import list in the file (they are not referenced anywhere); update
the import block to only include the modules actually used (e.g., keep other
imports like when importing pytest or model helpers) so no unused symbols remain
and linters stop flagging these imports.
In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py`:
- Line 523: Remove the unnecessary f-string prefix on the print in print_rank_0:
change the call print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison ===") to use a
plain string without the f prefix so it becomes print_rank_0("=== ERNIE 4.5 VL
Logit Comparison ==="); no other changes needed.
- Around line 269-274: The zip between top5_tokens and top5_vals may mask length
mismatches; update the two occurrences where they are paired (the expression
using top5_tokens and top5_vals around the log print and the similar use at the
later location referenced) to call zip(top5_tokens, top5_vals, strict=True) so
any unexpected length mismatch raises immediately; ensure both occurrences (the
one that builds the HF Top 5 log and the similar one at the later location) are
changed to include strict=True and run under a Python version that supports this
parameter.
- Line 91: Remove the unused import "Image" from the top-level import statement
(from PIL import Image) in ernie45_vl_logit_compare.py; ensure no other code
references the PIL.Image symbol (search for "Image" usages in that module) and
run linters/tests to confirm the unused-import warning is resolved.
In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py`:
- Around line 114-122: Replace the plain print warnings when loading HF ViT with
logging warnings: after calling hf_vit.load_state_dict(...) check the missing
and unexpected variables and call logger.warning(...) instead of print(...);
ensure you obtain a logger via logging.getLogger(__name__) (and add import
logging if missing) so warnings go through the logging framework and respect
configured verbosity.
In `@tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py`:
- Around line 107-108: The print line that zips decoded_predictions,
topk_ids.tolist(), and topk_vals.tolist() should use zip(..., strict=True) to
enforce equal lengths and catch mismatches; update the zip call in the code that
prints predictions (the line referencing decoded_predictions, topk_ids,
topk_vals) to zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(),
strict=True).
In
`@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py`:
- Around line 317-319: Replace the brittle assert False usage in the failure
branches with pytest.fail(...) so failures are reported reliably; specifically
in test_ernie45_vl_conversion.py replace the `assert False, (f"ERNIE 4.5 VL
{test_name} conversion failed with return code {result.returncode}")` at the
block that checks result.returncode and the same pattern around lines 427-429
with `pytest.fail` called with the same formatted message (ensure pytest is
imported if not present) so test_name and result.returncode are included in the
failure message.
- Around line 150-181: Change the broad except Exception blocks in the tokenizer
fallback to catch specific errors: for the AutoTokenizer.from_pretrained block
(around _local_tokenizer_path / AutoTokenizer.from_pretrained) catch
loading-related errors such as OSError, ValueError (e.g., except (OSError,
ValueError) as e) so only model I/O/format issues trigger the fallback; for the
inner fallback that imports transformers/tokenizers and builds
PreTrainedTokenizerFast (around the imports, TkTokenizer, tk.add_tokens,
PreTrainedTokenizerFast, fast_tokenizer.save_pretrained) catch import/runtime
issues such as ImportError, ModuleNotFoundError, RuntimeError (e.g., except
(ImportError, ModuleNotFoundError, RuntimeError) as e) and handle/log the
exception before falling back to the final model_dir.mkdir; ensure each except
binds the exception variable for debugging.
In
`@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py`:
- Around line 109-112: Remove the debug print loop that iterates
model.named_parameters() and prints the first parameter's dtype (the "for name,
param in model.named_parameters(): print(...); break" block) from
test_ernie45_moe_conversion.py; either delete those lines or wrap them behind a
proper debug/logging flag so tests do not emit debug output (ensure no stray
print remains in the test).
- Line 243: Replace the use of "assert False, f'Failed to load created toy MoE
model: {e}'" with a proper pytest failure so failures aren't optimized away;
import pytest if missing and call pytest.fail(f"Failed to load created toy MoE
model: {e}") (or raise AssertionError(f"...")) in the test function, and make
the same replacement for the other occurrences mentioned (the block around the
second failure message at lines 319-322) so both test failures are reported
reliably under pytest and when Python is run with -O.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4b5d32c6-ba49-4551-abc3-4d908e9209e4
📒 Files selected for processing (28)
examples/conversion/hf_megatron_roundtrip_multi_gpu.pysrc/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/conversion/model_bridge.pysrc/megatron/bridge/models/conversion/param_mapping.pysrc/megatron/bridge/models/ernie/__init__.pysrc/megatron/bridge/models/ernie/ernie_45_bridge.pysrc/megatron/bridge/models/ernie/ernie_45_provider.pysrc/megatron/bridge/models/ernie_vl/__init__.pysrc/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.pysrc/megatron/bridge/models/ernie_vl/ernie45_vl_provider.pysrc/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.pysrc/megatron/bridge/models/ernie_vl/ernie_moe_layer.pysrc/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.pysrc/megatron/bridge/models/ernie_vl/vision_attention.pysrc/megatron/bridge/models/ernie_vl/vision_layer_spec.pysrc/megatron/bridge/models/ernie_vl/vision_model.pysrc/megatron/bridge/models/ernie_vl/vision_transformer_config.pytests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.shtests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.shtests/functional_tests/test_groups/models/ernie/__init__.pytests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.pytests/functional_tests/test_groups/models/ernie_vl/__init__.pytests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.pytests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.pytests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.pytests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.pytests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.pytests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py
| class _ConcatBiasMapping(AutoMapping): | ||
| """Mapping for the concatenated text+vision expert bias tensor. | ||
|
|
||
| The on-disk HF format stores a single ``moe_statics.e_score_correction_bias`` | ||
| tensor of shape ``[2, num_experts]`` where row 0 is the text pool bias and | ||
| row 1 is the vision pool bias. This mapping extracts the appropriate row | ||
| based on ``slice_name``. | ||
|
|
||
| For export (megatron_to_hf), the text mapping buffers its bias in a | ||
| class-level dict keyed by resolved HF param name. The vision mapping | ||
| retrieves the buffered text bias, stacks them into ``[2, N]``, and | ||
| returns the merged tensor. This ensures only one entry per HF key. | ||
| """ | ||
|
|
||
| # Class-level buffer: {resolved_hf_param: text_bias_tensor} | ||
| _export_buffer: Dict[str, torch.Tensor] = {} | ||
|
|
||
| def __init__(self, megatron_param: str, hf_param: str, slice_name: str, num_experts: int): | ||
| super().__init__(megatron_param=megatron_param, hf_param=hf_param) | ||
| self._slice_name = slice_name # "text" or "vision" | ||
| self._num_experts = num_experts | ||
| self.allow_hf_name_mismatch = True | ||
|
|
||
| def resolve(self, captures: Tuple[str, ...]): | ||
| resolved_megatron_param, resolved_hf_param = self._resolve_names(captures) | ||
| result = _ConcatBiasMapping( | ||
| megatron_param=resolved_megatron_param, | ||
| hf_param=resolved_hf_param, | ||
| slice_name=self._slice_name, | ||
| num_experts=self._num_experts, | ||
| ) | ||
| return result | ||
|
|
||
| def hf_to_megatron(self, hf_weights, megatron_module): | ||
| """Extract the text or vision slice from the concatenated bias. | ||
|
|
||
| On-disk shape is [2, num_experts]: row 0 = text, row 1 = vision. | ||
| """ | ||
| if self._slice_name == "text": | ||
| sliced = hf_weights[0] | ||
| else: | ||
| sliced = hf_weights[1] | ||
| return super().hf_to_megatron(sliced, megatron_module) | ||
|
|
||
| def megatron_to_hf(self, megatron_weights, megatron_module): | ||
| """Export text+vision expert bias as concatenated [2, N] tensor. | ||
|
|
||
| The text mapping buffers its bias; the vision mapping retrieves it | ||
| and stacks into [2, N]. If the text bias is not yet buffered | ||
| (shouldn't happen in practice), falls back to exporting as-is. | ||
| """ | ||
| result = super().megatron_to_hf(megatron_weights, megatron_module) | ||
| if result is None: | ||
| return result | ||
|
|
||
| hf_key = str(self.hf_param) | ||
|
|
||
| if self._slice_name == "text": | ||
| # Buffer the text bias, return empty dict (don't emit yet) | ||
| for _, tensor in result.items(): | ||
| _ConcatBiasMapping._export_buffer[hf_key] = tensor | ||
| return {} | ||
| else: | ||
| # Vision: retrieve buffered text bias and merge | ||
| text_bias = _ConcatBiasMapping._export_buffer.pop(hf_key, None) | ||
| if text_bias is not None: | ||
| for _, vision_bias in result.items(): | ||
| merged = torch.stack([text_bias, vision_bias], dim=0) | ||
| return {hf_key: merged} | ||
| # Fallback: no buffered text bias, just return vision as-is | ||
| return result | ||
|
|
There was a problem hiding this comment.
Class-level _export_buffer may accumulate stale entries across conversions.
The _export_buffer dict is a class attribute (shared across all instances) used to coordinate text/vision bias export. Issues:
- Entries are only removed via
pop()when vision mapping retrieves them (line 271) - If text-only export runs or export fails partway, text biases remain in the buffer
- Subsequent conversions in the same process may see stale data
Consider clearing the buffer at the start of a conversion run, or using instance-level state tied to the conversion context.
🛡️ Suggested defensive approach
class _ConcatBiasMapping(AutoMapping):
...
- # Class-level buffer: {resolved_hf_param: text_bias_tensor}
- _export_buffer: Dict[str, torch.Tensor] = {}
+ # Class-level buffer: {resolved_hf_param: text_bias_tensor}
+ # Note: Should be cleared at the start of each conversion run to avoid stale state.
+ _export_buffer: Dict[str, torch.Tensor] = {}
+
+ `@classmethod`
+ def clear_export_buffer(cls):
+ """Clear the export buffer. Call before starting a new conversion."""
+ cls._export_buffer.clear()Then call _ConcatBiasMapping.clear_export_buffer() at the start of the conversion process.
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 222-222: Mutable default value for class attribute
(RUF012)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py` around lines 207 -
278, The class-level _export_buffer on _ConcatBiasMapping can retain stale text
biases between conversion runs; add a defensive classmethod like
clear_export_buffer(cls) that empties _export_buffer and call it at the start of
the export workflow (before any megatron_to_hf calls) so each conversion run
starts with a fresh buffer; update _ConcatBiasMapping to provide
clear_export_buffer and ensure the export orchestration invokes
_ConcatBiasMapping.clear_export_buffer() before exporting so megatron_to_hf (and
its pop usage) cannot leak state across runs.
| if not hasattr(self, "inv_freq"): | ||
| inv_freq = 1.0 / ( | ||
| self.theta | ||
| ** ( | ||
| torch.arange( | ||
| 0, self.dim, 2, dtype=torch.float, | ||
| device=torch.cuda.current_device(), | ||
| ) | ||
| / self.dim | ||
| ) | ||
| ) | ||
| self.register_buffer("inv_freq", inv_freq, persistent=False) | ||
| seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype) | ||
| freqs = torch.outer(seq, self.inv_freq) | ||
| return freqs |
There was a problem hiding this comment.
Potential issue: torch.cuda.current_device() before CUDA context.
Using torch.cuda.current_device() in forward() assumes CUDA is already initialized and a device is set. If forward() is called before the model is moved to CUDA (e.g., during CPU-based weight loading), this could fail. Consider deriving the device from the input tensor instead.
🛡️ Proposed fix
def forward(self, seqlen: int) -> torch.Tensor:
"""Compute frequency table for positions 0..seqlen-1.
...
"""
if not hasattr(self, "inv_freq"):
+ # Use the device of the module's parameters if available, fallback to CUDA
+ device = next(self.parameters(), torch.empty(0)).device
+ if device.type == "meta":
+ device = torch.cuda.current_device()
inv_freq = 1.0 / (
self.theta
** (
torch.arange(
0, self.dim, 2, dtype=torch.float,
- device=torch.cuda.current_device(),
+ device=device,
)
/ self.dim
)
)
self.register_buffer("inv_freq", inv_freq, persistent=False)Note: Since ErnieVisionRotaryEmbedding has no parameters, an alternative would be to accept device as an argument to forward().
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not hasattr(self, "inv_freq"): | |
| inv_freq = 1.0 / ( | |
| self.theta | |
| ** ( | |
| torch.arange( | |
| 0, self.dim, 2, dtype=torch.float, | |
| device=torch.cuda.current_device(), | |
| ) | |
| / self.dim | |
| ) | |
| ) | |
| self.register_buffer("inv_freq", inv_freq, persistent=False) | |
| seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype) | |
| freqs = torch.outer(seq, self.inv_freq) | |
| return freqs | |
| if not hasattr(self, "inv_freq"): | |
| # Use the device of the module's parameters if available, fallback to CUDA | |
| device = next(self.parameters(), torch.empty(0)).device | |
| if device.type == "meta": | |
| device = torch.cuda.current_device() | |
| inv_freq = 1.0 / ( | |
| self.theta | |
| ** ( | |
| torch.arange( | |
| 0, self.dim, 2, dtype=torch.float, | |
| device=device, | |
| ) | |
| / self.dim | |
| ) | |
| ) | |
| self.register_buffer("inv_freq", inv_freq, persistent=False) | |
| seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype) | |
| freqs = torch.outer(seq, self.inv_freq) | |
| return freqs |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/ernie_vl/vision_model.py` around lines 117 - 131,
The code uses torch.cuda.current_device() when creating self.inv_freq which can
fail before CUDA is initialized; update the implementation to derive the device
from the input tensor (or a passed-in device) instead of calling
torch.cuda.current_device(). Locate where inv_freq is created (the block that
registers "inv_freq" and the subsequent seq = torch.arange(...)) and replace the
device argument with something like device = input.device (or a provided device
parameter) and use that device for torch.arange and torch.arange(...,
device=device). Ensure self.register_buffer("inv_freq", inv_freq,
persistent=False) still stores the buffer and that subsequent uses of
self.inv_freq.device remain valid.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| #!/bin/bash | ||
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status |
There was a problem hiding this comment.
Move shebang to line 1 - shell scripts require it as the first line.
The shebang #!/bin/bash is on line 15 after the copyright header. For shell scripts to execute properly, the shebang must be the very first line. The copyright header should follow the shebang.
Suggested fix
+#!/bin/bash
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
...
# limitations under the License.
-#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| #!/bin/bash | |
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status | |
| #!/bin/bash | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| set -xeuo pipefail # Exit immediately if a command exits with a non-zero status |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.
(SC1128)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.sh`
around lines 1 - 16, Move the shebang (#!/bin/bash) to be the first line of the
script so the kernel uses the correct interpreter; specifically, place the
existing shebang above the copyright header and ensure the existing "set -xeuo
pipefail" and all other header comments follow it unchanged in the script
"L1_Launch_models_ernie_vl.sh".
| -v -s -x \ | ||
| -m "not pleasefixme" \ | ||
| --tb=short -rA \ | ||
| tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if cicd-main.yml includes the new ERNIE launcher
rg -n "L1_Launch_models_ernie" .github/workflows/cicd-main.yml || echo "WARNING: Launcher not found in CI workflow"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 113
Add this launcher to the CI workflow job matrix.
Per repository guidelines, new L1 test launcher scripts must be added to .github/workflows/cicd-main.yml. The launcher L1_Launch_models_ernie is not currently included in the workflow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.sh` at
line 16, The CI workflow is missing the new L1 launcher L1_Launch_models_ernie,
so update .github/workflows/cicd-main.yml to include this launcher in the job
matrix used for L1 test launchers; locate the matrix entry (e.g.,
matrix.launcher or matrix.include block used by the job that runs launch
scripts) and add an entry for "L1_Launch_models_ernie" referencing the launcher
name and optionally the test path
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
so the new script is executed in the CI matrix.
| mg_cos = mg_emb.cos().flatten() | ||
| mg_sin = mg_emb.sin().flatten() |
There was a problem hiding this comment.
Remove or use unused variable mg_sin.
mg_sin is computed but never used. Either remove it or add a comparison with hf_sin for completeness.
Suggested fix (remove unused variable)
mg_cos = mg_emb.cos().flatten()
- mg_sin = mg_emb.sin().flatten()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mg_cos = mg_emb.cos().flatten() | |
| mg_sin = mg_emb.sin().flatten() | |
| mg_cos = mg_emb.cos().flatten() |
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 187-187: local variable 'mg_sin' is assigned to but never used
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`
around lines 186 - 187, The variable mg_sin is computed but never used; either
remove the mg_sin calculation or use it by comparing to hf_sin like you do for
mg_cos/hf_cos: update the mg_sin line in the block where mg_emb is processed
(mg_cos, mg_sin, mg_emb) to either delete the mg_sin = mg_emb.sin().flatten()
line or replace it with a comparison/assertion against hf_sin (mirroring the
existing mg_cos vs hf_cos check) so mg_sin is no longer an unused variable.
| cos_sim_ln = F.cosine_similarity( | ||
| hf_normed.float().flatten().unsqueeze(0), | ||
| # We can't easily extract just the LN output from the fused module | ||
| # so just compare QKV output | ||
| torch.zeros(1).unsqueeze(0), # placeholder | ||
| ).item() |
There was a problem hiding this comment.
Remove incomplete/dead code for cos_sim_ln.
This code computes cosine similarity against a placeholder torch.zeros(1), which produces a meaningless result, and the variable is never used. The comment even acknowledges the limitation. Consider removing this entirely or implementing a proper comparison.
Suggested fix
- cos_sim_ln = F.cosine_similarity(
- hf_normed.float().flatten().unsqueeze(0),
- # We can't easily extract just the LN output from the fused module
- # so just compare QKV output
- torch.zeros(1).unsqueeze(0), # placeholder
- ).item()
-
cos_sim_qkv = F.cosine_similarity(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cos_sim_ln = F.cosine_similarity( | |
| hf_normed.float().flatten().unsqueeze(0), | |
| # We can't easily extract just the LN output from the fused module | |
| # so just compare QKV output | |
| torch.zeros(1).unsqueeze(0), # placeholder | |
| ).item() |
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 311-311: local variable 'cos_sim_ln' is assigned to but never used
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`
around lines 311 - 316, The snippet computing cos_sim_ln using
F.cosine_similarity against a placeholder torch.zeros(1) is dead/misleading and
unused; remove the entire cos_sim_ln calculation (the F.cosine_similarity call
and the placeholder torch.zeros(1) argument) and any references to cos_sim_ln,
or if you intend a real check, replace the placeholder with a proper tensor
extracted for comparison (e.g., the actual LN output) and use hf_normed and
F.cosine_similarity accordingly; locate the code around the cos_sim_ln variable
and hf_normed to apply the change.
3c27fef to
590bb94
Compare
cuichenx
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Some comments
| hf_expert_number = int(hf_expert_match.group(1)) | ||
| pool_offset = hf_expert_number - local_expert_number | ||
| else: | ||
| pool_offset = 0 |
There was a problem hiding this comment.
can we create a custom mapping class for dual-pool MoE, instead of modifying param_mapping.py
| for task in self._with_progress_tracking(hf_to_megatron_tasks, description): | ||
| # None task means no mapping exists for this param (e.g. MTP layers without bridge mappings) | ||
| if task is None: | ||
| continue |
There was a problem hiding this comment.
if a task is None, this usually mean a weight has no corresponding mapping, so a mapping should be added. Is the case different here?
|
|
||
| # MoE runtime settings | ||
| # NOTE: moe_grouped_gemm=False uses SequentialMLP (per-expert forward); | ||
| # True uses TEGroupedMLP which can produce NaN with certain TE versions. |
There was a problem hiding this comment.
TEGroupedMLP can produce NaN with certain TE versions.
Can you reproduce this issue with other models? TEGroupedMLP used across most models in megatron bridge so it is heavily tested. Please file an issue if you can reproduce this error with an existing model and we'd love to take a look
|
|
||
|
|
||
| @dataclass | ||
| class Ernie45ModelProvider(GPTModelProvider): |
There was a problem hiding this comment.
we can remove provider now, just implement overrides in the provider_bridge function
Thanks @cuichenx for the thorough review! We'll address all your comments in a follow-up commit:
|
ffc045c to
8da82d0
Compare
|
Hi @cuichenx , all comments have been addressed in the latest commit. Here's a summary:
Ready for re-review when you have time. Thanks! |
|
/ok to test 7408a09 |
|
Hi @bo-ke ,
|
Hi @huvunvidia,
|
…n-Core Add Megatron-Bridge support for ERNIE 4.5 models: Text-only MoE bridge (Ernie4_5_MoeForCausalLM / ERNIE-4.5-21B-A3B-PT): - Single-pool MoE: 64 experts, top-6 routing, shared experts - Mixed dense/MoE layers via moe_layer_freq (layer 0 dense, rest MoE) - Custom decoder block spec respecting moe_layer_freq - Expert bias for aux-free load balancing - Sigmoid routing with expert bias for load balancing - Interleaved RoPE (base=500000), RMSNorm, SiLU-gated MLP - Full roundtrip conversion tests (TP/PP parallelism) VL bridge (Ernie4_5_VLForConditionalGeneration / ERNIE-4.5-VL-28B-A3B): - Dual-pool MoE with separate vision/text expert pools - Custom ErnieVisionModel with pooled attention - ErnieMoELayer with expert bias and sigmoid routing - Custom decoder layer spec for mixed dense/MoE architectures - Forward/backward pass and logit comparison test scripts - Full roundtrip conversion tests (TP/PP parallelism) Shared infrastructure improvements: - SequentialMLP EP support in model_bridge.py - gate.weight precision handling in roundtrip script - is_expert pattern and EP=1 fast path in param_mapping.py - PP-safe mapping classes for dense/MoE mixed architectures Verified: Text cosine similarity = 0.999581, VL cosine similarity = 0.999563. Signed-off-by: bo-ke <kebo01@baidu.com>
46e86b9 to
6dccad0
Compare
|
Hi @huvunvidia, I've pushed fixes addressing the CI failures and merge conflicts. Verified locally that the affected Could you trigger CI again? /ok to test 6dccad0 |
|
/ok to test bda0f89 |
…ridges - Remove Ernie45ModelProvider class, fold config into provider_bridge() - Fix NaN issue: enable moe_grouped_gemm + moe_permute_fusion (aligned with DeepSeek V3) - Remove gradient_accumulation_fusion override, use auto-detected default - Move pool_offset logic from base param_mapping.py to _DualPoolExpertMixin in ernie45_vl_bridge.py - Move VL modeling files into modeling_ernie45_vl/ subpackage - Move verification scripts from tests/functional_tests/ to examples/models/vlm/ernie_vl/ - Improve task=None comment in model_bridge.py to explain MTP layers scenario - Clean up is_expert docstring in param_mapping.py (remove ERNIE-specific references) Signed-off-by: kebo01 <kebo01@baidu.com>
…issues Add copyright header to ernie45_vl_vit_debug.py, add missing docstrings for public functions/classes (D103/D101), fix import sorting (I001), remove unused import (F401), and apply ruff format to all ERNIE files. Signed-off-by: kebo01 <kebo01@baidu.com>
Fix Expert Parallelism (EP>1) conversion bug for ERNIE 4.5 VL dual-pool MoE vision experts. The previous _DualPoolExpertMixin had a wrong offset formula, and _DualPoolAutoMapping was dead code due to AutoMapping's internal delegation bypassing the mixin override. Simplify from 5 classes to 2 classes + 2 helper functions: - _OffsetGatedMLPMapping: for vision gate/up_proj (GatedMLP) - _OffsetRowParallelMapping: for vision down_proj (replaces AutoMapping to avoid delegation bypass) - _offset_gather_from_ep_ranks(): shared EP gather with pool offset - _resolve_with_offset(): shared wildcard resolution with HF-side shift Also fix test tokenizer fallback to use publicly available Qwen/Qwen3-0.6B. Signed-off-by: kebo01 <kebo01@baidu.com>
Update test_transpose_non_rank_zero_hf_to_megatron to expect permuted tensors on non-rank-0, matching the behavior change where permute_dims is applied on all ranks for correct ReplicatedMapping shape handling. Signed-off-by: kebo01 <kebo01@baidu.com>
Hi @huvunvidia, Thanks for triggering CI! There was a unit test failure (test_transpose_non_rank_zero_hf_to_megatron) caused by the permute_dims behavior change — I updated the test to match the new behavior (permute applied on all ranks). Please review Thanks. |
Add Megatron-Bridge support for ERNIE 4.5 models:
Text-only MoE bridge (Ernie4_5_MoeForCausalLM / ERNIE-4.5-21B-A3B-PT):
VL bridge (Ernie4_5_VLForConditionalGeneration / ERNIE-4.5-VL-28B-A3B):
Shared infrastructure improvements:
Verified: Text cosine similarity = 0.999581, VL cosine similarity = 0.999563.
What does this PR do ?
Add ERNIE 4.5 model family support to Megatron-Bridge, including both the text-only MoE
model (ERNIE-4.5-21B-A3B-PT) and the vision-language MoE model
(ERNIE-4.5-VL-28B-A3B-Thinking). Both bridges enable full HF↔Megatron-Core roundtrip
conversion with TP/PP/EP parallelism support.
Changelog
dense/MoE layer support via moe_layer_freq
with dual-pool expert routing (text/vision)
architectures
GitHub Actions CI
See the
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/CONTRIBUTING.md#-running-github-ci
in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve
and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/CONTRIBUTING.md
Additional Information
Summary by CodeRabbit
Release Notes
dual-pool expert routing.