Skip to content

[bridge] feat: Add ERNIE 4.5 text-only MoE and VL bridges for Megatro…#3263

Open
bo-ke wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
bo-ke:ernie-upstream
Open

[bridge] feat: Add ERNIE 4.5 text-only MoE and VL bridges for Megatro…#3263
bo-ke wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
bo-ke:ernie-upstream

Conversation

@bo-ke
Copy link
Copy Markdown

@bo-ke bo-ke commented Apr 10, 2026


Add Megatron-Bridge support for ERNIE 4.5 models:

Text-only MoE bridge (Ernie4_5_MoeForCausalLM / ERNIE-4.5-21B-A3B-PT):

  • Single-pool MoE: 64 experts, top-6 routing, shared experts
  • Mixed dense/MoE layers via moe_layer_freq (layer 0 dense, rest MoE)
  • Custom decoder block spec respecting moe_layer_freq
  • Expert bias for aux-free load balancing
  • Sigmoid routing with expert bias for load balancing
  • Interleaved RoPE (base=500000), RMSNorm, SiLU-gated MLP
  • Full roundtrip conversion tests (TP/PP parallelism)

VL bridge (Ernie4_5_VLForConditionalGeneration / ERNIE-4.5-VL-28B-A3B):

  • Dual-pool MoE with separate vision/text expert pools
  • Custom ErnieVisionModel with pooled attention
  • ErnieMoELayer with expert bias and sigmoid routing
  • Custom decoder layer spec for mixed dense/MoE architectures
  • Forward/backward pass and logit comparison test scripts
  • Full roundtrip conversion tests (TP/PP parallelism)

Shared infrastructure improvements:

  • SequentialMLP EP support in model_bridge.py
  • gate.weight precision handling in roundtrip script
  • is_expert pattern and EP=1 fast path in param_mapping.py
  • PP-safe mapping classes for dense/MoE mixed architectures

Verified: Text cosine similarity = 0.999581, VL cosine similarity = 0.999563.

What does this PR do ?

Add ERNIE 4.5 model family support to Megatron-Bridge, including both the text-only MoE
model (ERNIE-4.5-21B-A3B-PT) and the vision-language MoE model
(ERNIE-4.5-VL-28B-A3B-Thinking). Both bridges enable full HF↔Megatron-Core roundtrip
conversion with TP/PP/EP parallelism support.

Changelog

  • Add Ernie45Bridge and Ernie45ModelProvider for text-only MoE conversion with mixed
    dense/MoE layer support via moe_layer_freq
  • Add Ernie45VLBridge, Ernie45VLModel, and Ernie45VLModelProvider for VL MoE conversion
    with dual-pool expert routing (text/vision)
  • Add custom _ernie45_decoder_block_spec to correctly handle mixed dense/MoE layer
    architectures
  • Add custom ErnieVisionModel with pooled attention for the VL vision encoder
  • Add ErnieMoELayer with expert bias and sigmoid routing for aux-free load balancing
  • Improve param_mapping.py with is_expert pattern matching and EP=1 fast path
  • Improve model_bridge.py with SequentialMLP EP support and None task skip
  • Add gate.weight to IGNORE_PRECISION_PARAMS in roundtrip multi-GPU script
  • Add functional tests for both text and VL model conversion (toy model, TP=2, PP=2, EP=2)
  • Add CI launch scripts for both text and VL models

GitHub Actions CI

See the
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/CONTRIBUTING.md#-running-github-ci
in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve
and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed
    https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/CONTRIBUTING.md
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features
    • Added support for converting HuggingFace ERNIE 4.5 MoE text model to Megatron-Core.
    • Added support for converting HuggingFace ERNIE 4.5 Vision-Language MoE model with
      dual-pool expert routing.
    • Enhanced expert parameter mapping for mixture-of-experts architectures.
  • Bug Fixes
    • Improved precision casting during model verification.
    • Enhanced error handling for missing parameters across pipeline parallelism stages.
  • Tests
    • Added comprehensive functional tests for ERNIE 4.5 model conversion and validation.
    • Added vision transformer alignment tests.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

This pull request adds comprehensive support for ERNIE 4.5 models to the Megatron Bridge, including both text-only MoE causal language models and vision-language MoE models. It extends the core model conversion infrastructure with enhanced expert parameter handling, introduces new bridge implementations with specialized parameter mappings (including PP-safe skipping and expert index offsetting), implements dual-pool Mixture-of-Experts routing with modality awareness, adds vision transformer components with custom absolute 2D rotary positional embeddings, and provides end-to-end conversion and validation tests.

Changes

Cohort / File(s) Summary
Core Bridge Infrastructure
src/megatron/bridge/models/conversion/param_mapping.py, src/megatron/bridge/models/conversion/model_bridge.py
Extended expert parameter pattern matching to support both .experts.* and dual-pool MoE paths. Added fast-path for EP size=1 in expert gathering. Implemented expert index remapping for SequentialMLP when EP > 1. Enhanced PP broadcast error reporting with additional context. Modified permutation behavior to apply on all ranks instead of only TP rank 0. Added graceful skipping of unmapped conversion tasks.
ERNIE 4.5 Text-Only Support
src/megatron/bridge/models/ernie/__init__.py, src/megatron/bridge/models/ernie/ernie_45_bridge.py, src/megatron/bridge/models/ernie/ernie_45_provider.py
New ERNIE 4.5 text-only bridge converting Ernie4_5_MoeForCausalLM to Megatron GPTModel. Introduced PP-safe mapping wrappers to handle missing MoE tensors across pipeline stages. Implemented specialized bias squeezing mapping for router expert-bias format conversion. Defined provider with RMSNorm, SiLU activation, RoPE config, and dual-pool MoE parameters. Comprehensive parameter mapping registry for embeddings, attention, dense/expert MLPs, and routers.
ERNIE 4.5 VL Bridge & Providers
src/megatron/bridge/models/ernie_vl/__init__.py, src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py, src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py
New VL bridge for Ernie4_5_VLMoeForConditionalGeneration with offset and concatenation-aware expert bias mappings. Implemented dual-pool expert index offsetting for vision experts stored after text experts. Added support for both nested "builtin" and flat "auto_map" HF config layouts. VL provider with M-RoPE config, dual-pool MoE sizing, vision/language freeze toggles, VL placeholder token IDs, and heterogeneous dense/MoE decoder layers.
ERNIE 4.5 VL Core Modeling
src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py, src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py
Implemented Ernie45VLModel integrating vision tower, resampler, and Megatron GPT language model. Added end-to-end multimodal forward with vision preprocessing, placeholder token scattering, and modality-aware RoPE. Introduced ErnieMultiTypeMoE layer with dual-pool MoE routing: splits tokens by modality, routes through separate text/vision pools, handles empty-modality edge cases with dummy tokens, and preserves expert-parallel collectives. Implemented module-level token-type context for MoE routing without signature changes.
ERNIE 4.5 VL Vision Infrastructure
src/megatron/bridge/models/ernie_vl/vision_model.py, src/megatron/bridge/models/ernie_vl/vision_transformer_config.py, src/megatron/bridge/models/ernie_vl/vision_attention.py, src/megatron/bridge/models/ernie_vl/vision_layer_spec.py, src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py
Added Megatron-native ERNIE vision transformer with patch embedding, rotary embedding, and packed sequence support. Introduced ErnieVisionTransformerConfig extending TransformerConfig with vision-specific fields. Implemented ErnieVLSelfAttention with absolute 2D RoPE application. Created heterogeneous decoder block specs supporting mixed dense/MoE layer patterns via moe_layer_freq. Added per-layer spec builder selecting between TE and non-TE implementations.
Public API Exports
src/megatron/bridge/models/__init__.py
Updated module-level \__all__\\ to export new ERNIE 4.5 and ERNIE 4.5 VL bridge/provider classes for public consumption.
Test Infrastructure & Launch Scripts
tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie*.sh, tests/functional_tests/test_groups/models/ernie/__init__.py, tests/functional_tests/test_groups/models/ernie_vl/__init__.py
Added bash scripts for running ERNIE and ERNIE VL functional test suites with coverage collection. Created package initializers for test modules.
ERNIE 4.5 MoE Conversion Tests
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
Added toy model fixture building small Ernie4_5_MoeForCausalLM checkpoint with sanitized tokenizer. Implemented parametrized GPU test validating conversion across multiple TP/PP/EP configurations via hf_megatron_roundtrip_multi_gpu.py. Verification includes config preservation and expected model structure.
ERNIE 4.5 VL Conversion & Validation Tests
tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py, tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py, tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py, tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py, tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py, tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py
Comprehensive test suite: toy model creation fixture with dual-pool MoE and tokenizer handling; parametrized conversion tests across TP/PP/EP; forward/backward training validation with text+vision inputs and gradient checks; HF vs. Megatron logit equivalence checks with top-k diagnostics; vision transformer output alignment tests; debug script for layer-level ViT comparisons; HF reference loss computation.
Precision & Weight Handling
examples/conversion/hf_megatron_roundtrip_multi_gpu.py
Extended precision mismatch ignoring to include "gate.weight" (MoE router weights). Broadened float32 casting logic to apply whenever source and reference tensor dtypes differ, not just for known precision params.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ❓ Inconclusive PR claims specific numeric verification results (0.999581, 0.999563) but these values are not documented anywhere in the codebase; test infrastructure exists with defined thresholds but actual achieved results are not captured. Document actual numeric test results in code comments, add assertions verifying specific achieved values, or remove claimed numeric values from PR objectives if based on external test runs.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding ERNIE 4.5 model bridges (text-only MoE and vision-language) to Megatron.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (21)
tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py (1)

107-108: Consider adding strict=True to zip() for defensive coding.

While the three lists are guaranteed to have equal length (all k=5 from topk), adding strict=True makes this assumption explicit and catches any future bugs if the data sources diverge.

♻️ Suggested change
         print(f"  Position {pos}: actual_next='{actual_decoded}'({actual_next}), "
-              f"top5={list(zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist()))}")
+              f"top5={list(zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(), strict=True))}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py` around
lines 107 - 108, The print line that zips decoded_predictions,
topk_ids.tolist(), and topk_vals.tolist() should use zip(..., strict=True) to
enforce equal lengths and catch mismatches; update the zip call in the code that
prints predictions (the line referencing decoded_predictions, topk_ids,
topk_vals) to zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(),
strict=True).
src/megatron/bridge/models/ernie/ernie_45_provider.py (1)

23-24: Consider using modern union syntax.

The import of Optional from typing can be replaced with the modern X | None syntax, which is the preferred style in this codebase for source files.

♻️ Suggested change
 from dataclasses import dataclass
-from typing import Callable, Optional
+from typing import Callable

And update line 31:

-def _ernie45_decoder_block_spec(config: "Ernie45ModelProvider", vp_stage: Optional[int] = None):
+def _ernie45_decoder_block_spec(config: "Ernie45ModelProvider", vp_stage: int | None = None):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie/ernie_45_provider.py` around lines 23 - 24,
Replace the legacy typing.Optional import and usages with the modern union
syntax: remove Optional from the import list (leaving dataclass and Callable)
and change any type hints in this module that use Optional[...] to the X | None
form (e.g., foo: Optional[Bar] -> foo: Bar | None); update the annotation at the
location referenced (line with current Optional usage) to use the new union
syntax.
src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py (1)

182-184: ColumnParallel unpacked but unused in this function.

The variable ColumnParallel is unpacked from _get_linear_modules() but not used in _get_ernie_decoder_layer_spec. This is a minor style issue since the function relies on positional unpacking. Consider using _ for unused variables.

Suggested fix
-    ColumnParallel, RowParallel, DotProductAttention, LayerNormColumnParallel, Norm = (
+    _, RowParallel, DotProductAttention, LayerNormColumnParallel, Norm = (
         _get_linear_modules()
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py` around lines
182 - 184, The function _get_ernie_decoder_layer_spec unpacks ColumnParallel
from _get_linear_modules() but never uses it; update the unpacking to replace
ColumnParallel with an underscore (e.g., _, RowParallel, DotProductAttention,
LayerNormColumnParallel, Norm = _get_linear_modules()) so the unused value is
explicitly ignored and the intent is clear, making sure to only change the
unpack target and not other logic in _get_ernie_decoder_layer_spec.
src/megatron/bridge/models/ernie_vl/__init__.py (1)

22-29: Consider sorting __all__ alphabetically for consistency.

Ruff flags that the __all__ list is unsorted. While this doesn't affect functionality, sorting it improves readability and aligns with isort conventions.

Suggested fix
 __all__ = [
+    "Ernie45VLBridge",
     "Ernie45VLModel",
-    "Ernie45VLBridge",
     "Ernie45VLModelProvider",
     "ErnieMultiTypeMoE",
-    "MultiTypeMoeSubmodules",
     "get_ernie45_vl_decoder_block_spec",
+    "MultiTypeMoeSubmodules",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/__init__.py` around lines 22 - 29, The
__all__ list is unsorted; reorder the exported names alphabetically so linters
(Ruff/isort) are satisfied — update the __all__ tuple containing
"Ernie45VLModel", "Ernie45VLBridge", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules", and
"get_ernie45_vl_decoder_block_spec" into alphabetical order (e.g.,
"Ernie45VLBridge", "Ernie45VLModel", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules",
"get_ernie45_vl_decoder_block_spec").
tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py (4)

415-422: Prefix unused loop variable with underscore.

The name variable is not used in the loop body.

🧹 Proposed fix
         for m in megatron_models:
-            for name, param in m.named_parameters():
+            for _name, param in m.named_parameters():
                 if param.requires_grad:
                     total_params += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py`
around lines 415 - 422, The loop over model parameters uses an unused variable
`name`; change the tuple unpacking in the loop that iterates `for name, param in
m.named_parameters():` to prefix the unused variable (e.g., `for _, param in
m.named_parameters():`) so only `param` is referenced, leaving the rest of the
logic (`param.requires_grad`, `param.grad`, and the increments `total_params`,
`params_with_grad`, `params_with_nonzero_grad`) unchanged.

83-83: Remove extraneous f prefix from string without placeholders.

🧹 Proposed fix
-print_rank_0(f"=== ERNIE 4.5 VL Forward/Backward Test ===")
+print_rank_0("=== ERNIE 4.5 VL Forward/Backward Test ===")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` at
line 83, The print_rank_0 call uses an unnecessary f-string for a literal with
no placeholders; in the call to print_rank_0 (string "=== ERNIE 4.5 VL
Forward/Backward Test ===") remove the leading f so it becomes a plain string
literal, i.e., update the argument passed to print_rank_0 accordingly.

69-69: Use explicit Optional or union syntax for optional parameters.

PEP 484 prohibits implicit Optional (using = None without a type annotation that includes None).

🧹 Proposed fix
+from typing import Optional
+
 def run_forward_backward(
     hf_model_path: str,
     tp: int = 1,
     pp: int = 1,
     ep: int = 1,
     seq_len: int = 16,
     backward: bool = True,
     with_vision: bool = False,
-    prompt: str = None,
+    prompt: Optional[str] = None,
 ):

Same applies to line 467 in _run().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py` at
line 69, The parameter annotation uses implicit Optional by assigning prompt:
str = None; update the type to explicitly allow None (e.g., use Optional[str] or
str | None) for the prompt parameter and likewise for the _run() signature
mentioned at line 467; import typing.Optional if using Optional[] and ensure
both function signatures (the one containing prompt and the _run method) have
explicit union/Optional annotations instead of bare "= None".

36-37: Remove unused imports.

sys and traceback are imported but never used in this file.

🧹 Proposed fix
 import argparse
 import os
-import sys
-import traceback

 # Disable torch.compile to avoid triton compatibility issues in some environments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py`
around lines 36 - 37, Remove the unused module imports 'sys' and 'traceback'
from the top-level import list in the file (they are not referenced anywhere);
update the import block to only include the modules actually used (e.g., keep
other imports like when importing pytest or model helpers) so no unused symbols
remain and linters stop flagging these imports.
tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py (2)

317-319: Replace assert False with pytest.fail() for test failure reporting.

Using assert False can be removed by Python's -O flag. Use pytest.fail() for proper test failure reporting.

🧪 Proposed fix
             if result.returncode != 0:
                 print(f"STDOUT: {result.stdout}")
                 print(f"STDERR: {result.stderr}")
-                assert False, (
+                pytest.fail(
                     f"ERNIE 4.5 VL {test_name} conversion failed with return code {result.returncode}"
                 )

Same pattern applies to line 427-429.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py`
around lines 317 - 319, Replace the brittle assert False usage in the failure
branches with pytest.fail(...) so failures are reported reliably; specifically
in test_ernie45_vl_conversion.py replace the `assert False, (f"ERNIE 4.5 VL
{test_name} conversion failed with return code {result.returncode}")` at the
block that checks result.returncode and the same pattern around lines 427-429
with `pytest.fail` called with the same formatted message (ensure pytest is
imported if not present) so test_name and result.returncode are included in the
failure message.

150-181: Consider using more specific exception types in tokenizer fallback logic.

The nested try/except blocks catch broad Exception types. While this provides robust fallback behavior, catching more specific exceptions (e.g., OSError, ImportError) would make debugging easier if unexpected errors occur.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py`
around lines 150 - 181, Change the broad except Exception blocks in the
tokenizer fallback to catch specific errors: for the
AutoTokenizer.from_pretrained block (around _local_tokenizer_path /
AutoTokenizer.from_pretrained) catch loading-related errors such as OSError,
ValueError (e.g., except (OSError, ValueError) as e) so only model I/O/format
issues trigger the fallback; for the inner fallback that imports
transformers/tokenizers and builds PreTrainedTokenizerFast (around the imports,
TkTokenizer, tk.add_tokens, PreTrainedTokenizerFast,
fast_tokenizer.save_pretrained) catch import/runtime issues such as ImportError,
ModuleNotFoundError, RuntimeError (e.g., except (ImportError,
ModuleNotFoundError, RuntimeError) as e) and handle/log the exception before
falling back to the final model_dir.mkdir; ensure each except binds the
exception variable for debugging.
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py (2)

109-112: Debug print statement should be removed before merge.

The loop that prints parameter dtype is debug code and should be removed or guarded behind a debug flag.

🧹 Proposed fix
         # Create model with random weights and convert to bfloat16
         model = Ernie4_5_MoeForCausalLM(config)
         model = model.bfloat16()

-        # Debug: Check model dtype before saving
-        for name, param in model.named_parameters():
-            print(f"Before save - {name}: {param.dtype}")
-            break  # Just check the first parameter
-
         # Copy tokenizer files from the local ERNIE VL model (works offline).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py`
around lines 109 - 112, Remove the debug print loop that iterates
model.named_parameters() and prints the first parameter's dtype (the "for name,
param in model.named_parameters(): print(...); break" block) from
test_ernie45_moe_conversion.py; either delete those lines or wrap them behind a
proper debug/logging flag so tests do not emit debug output (ensure no stray
print remains in the test).

243-243: Replace assert False with pytest.fail() or raise AssertionError().

assert False statements are stripped when Python runs with -O flag. Use pytest.fail() for proper test failure in pytest context.

🧪 Proposed fix
         except Exception as e:
-            assert False, f"Failed to load created toy MoE model: {e}"
+            pytest.fail(f"Failed to load created toy MoE model: {e}")

Same applies to line 319-322.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py`
at line 243, Replace the use of "assert False, f'Failed to load created toy MoE
model: {e}'" with a proper pytest failure so failures aren't optimized away;
import pytest if missing and call pytest.fail(f"Failed to load created toy MoE
model: {e}") (or raise AssertionError(f"...")) in the test function, and make
the same replacement for the other occurrences mentioned (the block around the
second failure message at lines 319-322) so both test failures are reported
reliably under pytest and when Python is run with -O.
src/megatron/bridge/models/ernie_vl/vision_attention.py (1)

164-164: Prefix unused variable with underscore.

The block_table variable is unpacked but never used. Prefix it with an underscore to indicate it's intentionally unused.

🧹 Proposed fix
-query, key, value, rotary_pos_emb, attn_mask_type, block_table = self._adjust_key_value_for_inference(
+query, key, value, rotary_pos_emb, attn_mask_type, _block_table = self._adjust_key_value_for_inference(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/vision_attention.py` at line 164, The
tuple returned by _adjust_key_value_for_inference is unpacked into query, key,
value, rotary_pos_emb, attn_mask_type, block_table but block_table is unused;
rename that target to _block_table or simply _ to mark it intentionally unused
in the assignment where _adjust_key_value_for_inference(...) is called (e.g.,
change "..., attn_mask_type, block_table =
self._adjust_key_value_for_inference(...)" to "..., attn_mask_type, _block_table
= self._adjust_key_value_for_inference(...)" or "..., attn_mask_type, _ = ...")
so linters/readers know it is intentionally ignored.
tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py (3)

523-523: Remove extraneous f prefix from string without placeholders.

This f-string has no placeholders, so the f prefix is unnecessary.

🧹 Proposed fix
-print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison ===")
+print_rank_0("=== ERNIE 4.5 VL Logit Comparison ===")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py`
at line 523, Remove the unnecessary f-string prefix on the print in
print_rank_0: change the call print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison
===") to use a plain string without the f prefix so it becomes print_rank_0("===
ERNIE 4.5 VL Logit Comparison ==="); no other changes needed.

269-274: Add strict=True to zip() for safer pairing.

When zipping top5_tokens with top5_vals, the lengths should always match. Using strict=True will catch unexpected mismatches early.

🧹 Proposed fix
-print_rank_0(f"  HF Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}")
+print_rank_0(f"  HF Top 5: {list(zip(top5_tokens, top5_vals.tolist(), strict=True))}")

Same applies to line 452:

-print_rank_0(f"  Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist()))}")
+print_rank_0(f"  Megatron Top 5: {list(zip(top5_tokens, top5_vals.tolist(), strict=True))}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py`
around lines 269 - 274, The zip between top5_tokens and top5_vals may mask
length mismatches; update the two occurrences where they are paired (the
expression using top5_tokens and top5_vals around the log print and the similar
use at the later location referenced) to call zip(top5_tokens, top5_vals,
strict=True) so any unexpected length mismatch raises immediately; ensure both
occurrences (the one that builds the HF Top 5 log and the similar one at the
later location) are changed to include strict=True and run under a Python
version that supports this parameter.

91-91: Unused import: PIL.Image

The Image class from PIL is imported but never used in this file.

🧹 Proposed fix
 from transformers import AutoProcessor
-from PIL import Image
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py`
at line 91, Remove the unused import "Image" from the top-level import statement
(from PIL import Image) in ernie45_vl_logit_compare.py; ensure no other code
references the PIL.Image symbol (search for "Image" usages in that module) and
run linters/tests to confirm the unused-import warning is resolved.
tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py (1)

114-122: Consider logging warnings instead of printing them.

The HF ViT loading uses print for warnings about missing/unexpected keys. While acceptable for a standalone script, using logging.warning() would allow better control over verbosity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py`
around lines 114 - 122, Replace the plain print warnings when loading HF ViT
with logging warnings: after calling hf_vit.load_state_dict(...) check the
missing and unexpected variables and call logger.warning(...) instead of
print(...); ensure you obtain a logger via logging.getLogger(__name__) (and add
import logging if missing) so warnings go through the logging framework and
respect configured verbosity.
src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py (1)

27-27: Remove unused import Optional.

Optional is imported but not used in this file.

🧹 Proposed fix
-from typing import Any, Callable, List, Optional, Tuple, Union
+from typing import Any, Callable, List, Tuple, Union
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py` at line 27,
Remove the unused typing import Optional from the import list in
ernie45_vl_provider.py; update the from typing import ... line (which currently
includes Optional) to exclude Optional so only used symbols like Any, Callable,
List, Tuple, Union remain, ensuring no other references to Optional exist in
functions or classes in this module.
src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py (1)

239-259: Remove unused text_used_dummy and vision_used_dummy variables.

These variables are assigned but never read, as flagged by static analysis (F841). If they're intended for debugging, consider adding debug logging; otherwise, remove them.

🧹 Proposed fix to remove dead code
-        text_used_dummy = False
         if text_indices.numel() > 0:
             text_output, _ = self.text_moe_layer(text_hidden, padding_mask=None)
         elif ep_size > 1:
             # Inject a dummy token to participate in EP collectives
             dummy = torch.zeros(1, 1, hidden_size, dtype=flat_hidden.dtype, device=flat_hidden.device)
             text_output, _ = self.text_moe_layer(dummy, padding_mask=None)
-            text_used_dummy = True
         else:
             text_output = None

-        vision_used_dummy = False
         if vision_indices.numel() > 0:
             vision_output, _ = self.vision_moe_layer(vision_hidden, padding_mask=None)
         elif ep_size > 1:
             # Inject a dummy token to participate in EP collectives
             dummy = torch.zeros(1, 1, hidden_size, dtype=flat_hidden.dtype, device=flat_hidden.device)
             vision_output, _ = self.vision_moe_layer(dummy, padding_mask=None)
-            vision_used_dummy = True
         else:
             vision_output = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py` around lines 239 -
259, The variables text_used_dummy and vision_used_dummy are assigned but never
used (F841); remove their declarations and assignments from the ernie_moe_layer
logic that calls text_moe_layer and vision_moe_layer (the branches that create
the dummy tensor and call these methods) so the code simply sets
text_output/vision_output or calls the dummy path without creating these unused
flags; if you actually need that state for debugging, replace the assignments
with a debug log referencing text_used_dummy/vision_used_dummy instead of
keeping unused variables.
src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py (1)

87-87: Remove unused import Optional.

Static analysis (F401) flags Optional as imported but unused. The file uses Optional in type hints but these are all from the typing module which is already imported for Dict and Tuple.

🧹 Proposed fix
-from typing import Dict, Optional, Tuple
+from typing import Dict, Tuple
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py` at line 87, Remove
the unused Optional import from the typing import list in ernie45_vl_bridge.py:
update the import statement that currently reads "from typing import Dict,
Optional, Tuple" to only import the actually used symbols (Dict and Tuple) so
Optional is no longer imported; ensure any type hints referencing Optional
remain valid (they already rely on typing.Optional via other imports or use of
Union) and run linting to confirm F401 is resolved.
src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py (1)

526-527: Prefix unused rope_deltas with underscore.

Static analysis (RUF059) flags rope_deltas as unpacked but never used.

🧹 Proposed fix
-        position_ids, rope_deltas = self.get_rope_index(
+        position_ids, _rope_deltas = self.get_rope_index(
             input_ids,
             rope_mm_token_type_ids,
             image_grid_thw,
             video_grid_thw,
             attention_mask=None,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py` around lines 526
- 527, The unpacking from get_rope_index currently binds an unused variable
rope_deltas; change the unpack to position_ids, _rope_deltas =
self.get_rope_index(input_ids, ...) (or simply position_ids, _ =
self.get_rope_index(...)) so the unused value is prefixed with an underscore to
satisfy the RUF059 linter; update the unpack site near the call to
self.get_rope_index in modeling_ernie45_vl.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py`:
- Around line 207-278: The class-level _export_buffer on _ConcatBiasMapping can
retain stale text biases between conversion runs; add a defensive classmethod
like clear_export_buffer(cls) that empties _export_buffer and call it at the
start of the export workflow (before any megatron_to_hf calls) so each
conversion run starts with a fresh buffer; update _ConcatBiasMapping to provide
clear_export_buffer and ensure the export orchestration invokes
_ConcatBiasMapping.clear_export_buffer() before exporting so megatron_to_hf (and
its pop usage) cannot leak state across runs.

In `@src/megatron/bridge/models/ernie_vl/vision_model.py`:
- Around line 117-131: The code uses torch.cuda.current_device() when creating
self.inv_freq which can fail before CUDA is initialized; update the
implementation to derive the device from the input tensor (or a passed-in
device) instead of calling torch.cuda.current_device(). Locate where inv_freq is
created (the block that registers "inv_freq" and the subsequent seq =
torch.arange(...)) and replace the device argument with something like device =
input.device (or a provided device parameter) and use that device for
torch.arange and torch.arange(..., device=device). Ensure
self.register_buffer("inv_freq", inv_freq, persistent=False) still stores the
buffer and that subsequent uses of self.inv_freq.device remain valid.

In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.sh`:
- Around line 1-16: Move the shebang (#!/bin/bash) to be the first line of the
script so the kernel uses the correct interpreter; specifically, place the
existing shebang above the copyright header and ensure the existing "set -xeuo
pipefail" and all other header comments follow it unchanged in the script
"L1_Launch_models_ernie_vl.sh".

In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.sh`:
- Line 16: The CI workflow is missing the new L1 launcher
L1_Launch_models_ernie, so update .github/workflows/cicd-main.yml to include
this launcher in the job matrix used for L1 test launchers; locate the matrix
entry (e.g., matrix.launcher or matrix.include block used by the job that runs
launch scripts) and add an entry for "L1_Launch_models_ernie" referencing the
launcher name and optionally the test path
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
so the new script is executed in the CI matrix.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`:
- Around line 186-187: The variable mg_sin is computed but never used; either
remove the mg_sin calculation or use it by comparing to hf_sin like you do for
mg_cos/hf_cos: update the mg_sin line in the block where mg_emb is processed
(mg_cos, mg_sin, mg_emb) to either delete the mg_sin = mg_emb.sin().flatten()
line or replace it with a comparison/assertion against hf_sin (mirroring the
existing mg_cos vs hf_cos check) so mg_sin is no longer an unused variable.
- Around line 311-316: The snippet computing cos_sim_ln using
F.cosine_similarity against a placeholder torch.zeros(1) is dead/misleading and
unused; remove the entire cos_sim_ln calculation (the F.cosine_similarity call
and the placeholder torch.zeros(1) argument) and any references to cos_sim_ln,
or if you intend a real check, replace the placeholder with a proper tensor
extracted for comparison (e.g., the actual LN output) and use hf_normed and
F.cosine_similarity accordingly; locate the code around the cos_sim_ln variable
and hf_normed to apply the change.

---

Nitpick comments:
In `@src/megatron/bridge/models/ernie_vl/__init__.py`:
- Around line 22-29: The __all__ list is unsorted; reorder the exported names
alphabetically so linters (Ruff/isort) are satisfied — update the __all__ tuple
containing "Ernie45VLModel", "Ernie45VLBridge", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules", and
"get_ernie45_vl_decoder_block_spec" into alphabetical order (e.g.,
"Ernie45VLBridge", "Ernie45VLModel", "Ernie45VLModelProvider",
"ErnieMultiTypeMoE", "MultiTypeMoeSubmodules",
"get_ernie45_vl_decoder_block_spec").

In `@src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py`:
- Around line 182-184: The function _get_ernie_decoder_layer_spec unpacks
ColumnParallel from _get_linear_modules() but never uses it; update the
unpacking to replace ColumnParallel with an underscore (e.g., _, RowParallel,
DotProductAttention, LayerNormColumnParallel, Norm = _get_linear_modules()) so
the unused value is explicitly ignored and the intent is clear, making sure to
only change the unpack target and not other logic in
_get_ernie_decoder_layer_spec.

In `@src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py`:
- Around line 239-259: The variables text_used_dummy and vision_used_dummy are
assigned but never used (F841); remove their declarations and assignments from
the ernie_moe_layer logic that calls text_moe_layer and vision_moe_layer (the
branches that create the dummy tensor and call these methods) so the code simply
sets text_output/vision_output or calls the dummy path without creating these
unused flags; if you actually need that state for debugging, replace the
assignments with a debug log referencing text_used_dummy/vision_used_dummy
instead of keeping unused variables.

In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py`:
- Line 87: Remove the unused Optional import from the typing import list in
ernie45_vl_bridge.py: update the import statement that currently reads "from
typing import Dict, Optional, Tuple" to only import the actually used symbols
(Dict and Tuple) so Optional is no longer imported; ensure any type hints
referencing Optional remain valid (they already rely on typing.Optional via
other imports or use of Union) and run linting to confirm F401 is resolved.

In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py`:
- Line 27: Remove the unused typing import Optional from the import list in
ernie45_vl_provider.py; update the from typing import ... line (which currently
includes Optional) to exclude Optional so only used symbols like Any, Callable,
List, Tuple, Union remain, ensuring no other references to Optional exist in
functions or classes in this module.

In `@src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py`:
- Around line 526-527: The unpacking from get_rope_index currently binds an
unused variable rope_deltas; change the unpack to position_ids, _rope_deltas =
self.get_rope_index(input_ids, ...) (or simply position_ids, _ =
self.get_rope_index(...)) so the unused value is prefixed with an underscore to
satisfy the RUF059 linter; update the unpack site near the call to
self.get_rope_index in modeling_ernie45_vl.py.

In `@src/megatron/bridge/models/ernie_vl/vision_attention.py`:
- Line 164: The tuple returned by _adjust_key_value_for_inference is unpacked
into query, key, value, rotary_pos_emb, attn_mask_type, block_table but
block_table is unused; rename that target to _block_table or simply _ to mark it
intentionally unused in the assignment where
_adjust_key_value_for_inference(...) is called (e.g., change "...,
attn_mask_type, block_table = self._adjust_key_value_for_inference(...)" to
"..., attn_mask_type, _block_table = self._adjust_key_value_for_inference(...)"
or "..., attn_mask_type, _ = ...") so linters/readers know it is intentionally
ignored.

In `@src/megatron/bridge/models/ernie/ernie_45_provider.py`:
- Around line 23-24: Replace the legacy typing.Optional import and usages with
the modern union syntax: remove Optional from the import list (leaving dataclass
and Callable) and change any type hints in this module that use Optional[...] to
the X | None form (e.g., foo: Optional[Bar] -> foo: Bar | None); update the
annotation at the location referenced (line with current Optional usage) to use
the new union syntax.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py`:
- Around line 415-422: The loop over model parameters uses an unused variable
`name`; change the tuple unpacking in the loop that iterates `for name, param in
m.named_parameters():` to prefix the unused variable (e.g., `for _, param in
m.named_parameters():`) so only `param` is referenced, leaving the rest of the
logic (`param.requires_grad`, `param.grad`, and the increments `total_params`,
`params_with_grad`, `params_with_nonzero_grad`) unchanged.
- Line 83: The print_rank_0 call uses an unnecessary f-string for a literal with
no placeholders; in the call to print_rank_0 (string "=== ERNIE 4.5 VL
Forward/Backward Test ===") remove the leading f so it becomes a plain string
literal, i.e., update the argument passed to print_rank_0 accordingly.
- Line 69: The parameter annotation uses implicit Optional by assigning prompt:
str = None; update the type to explicitly allow None (e.g., use Optional[str] or
str | None) for the prompt parameter and likewise for the _run() signature
mentioned at line 467; import typing.Optional if using Optional[] and ensure
both function signatures (the one containing prompt and the _run method) have
explicit union/Optional annotations instead of bare "= None".
- Around line 36-37: Remove the unused module imports 'sys' and 'traceback' from
the top-level import list in the file (they are not referenced anywhere); update
the import block to only include the modules actually used (e.g., keep other
imports like when importing pytest or model helpers) so no unused symbols remain
and linters stop flagging these imports.

In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py`:
- Line 523: Remove the unnecessary f-string prefix on the print in print_rank_0:
change the call print_rank_0(f"=== ERNIE 4.5 VL Logit Comparison ===") to use a
plain string without the f prefix so it becomes print_rank_0("=== ERNIE 4.5 VL
Logit Comparison ==="); no other changes needed.
- Around line 269-274: The zip between top5_tokens and top5_vals may mask length
mismatches; update the two occurrences where they are paired (the expression
using top5_tokens and top5_vals around the log print and the similar use at the
later location referenced) to call zip(top5_tokens, top5_vals, strict=True) so
any unexpected length mismatch raises immediately; ensure both occurrences (the
one that builds the HF Top 5 log and the similar one at the later location) are
changed to include strict=True and run under a Python version that supports this
parameter.
- Line 91: Remove the unused import "Image" from the top-level import statement
(from PIL import Image) in ernie45_vl_logit_compare.py; ensure no other code
references the PIL.Image symbol (search for "Image" usages in that module) and
run linters/tests to confirm the unused-import warning is resolved.

In
`@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py`:
- Around line 114-122: Replace the plain print warnings when loading HF ViT with
logging warnings: after calling hf_vit.load_state_dict(...) check the missing
and unexpected variables and call logger.warning(...) instead of print(...);
ensure you obtain a logger via logging.getLogger(__name__) (and add import
logging if missing) so warnings go through the logging framework and respect
configured verbosity.

In `@tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py`:
- Around line 107-108: The print line that zips decoded_predictions,
topk_ids.tolist(), and topk_vals.tolist() should use zip(..., strict=True) to
enforce equal lengths and catch mismatches; update the zip call in the code that
prints predictions (the line referencing decoded_predictions, topk_ids,
topk_vals) to zip(decoded_predictions, topk_ids.tolist(), topk_vals.tolist(),
strict=True).

In
`@tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py`:
- Around line 317-319: Replace the brittle assert False usage in the failure
branches with pytest.fail(...) so failures are reported reliably; specifically
in test_ernie45_vl_conversion.py replace the `assert False, (f"ERNIE 4.5 VL
{test_name} conversion failed with return code {result.returncode}")` at the
block that checks result.returncode and the same pattern around lines 427-429
with `pytest.fail` called with the same formatted message (ensure pytest is
imported if not present) so test_name and result.returncode are included in the
failure message.
- Around line 150-181: Change the broad except Exception blocks in the tokenizer
fallback to catch specific errors: for the AutoTokenizer.from_pretrained block
(around _local_tokenizer_path / AutoTokenizer.from_pretrained) catch
loading-related errors such as OSError, ValueError (e.g., except (OSError,
ValueError) as e) so only model I/O/format issues trigger the fallback; for the
inner fallback that imports transformers/tokenizers and builds
PreTrainedTokenizerFast (around the imports, TkTokenizer, tk.add_tokens,
PreTrainedTokenizerFast, fast_tokenizer.save_pretrained) catch import/runtime
issues such as ImportError, ModuleNotFoundError, RuntimeError (e.g., except
(ImportError, ModuleNotFoundError, RuntimeError) as e) and handle/log the
exception before falling back to the final model_dir.mkdir; ensure each except
binds the exception variable for debugging.

In
`@tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py`:
- Around line 109-112: Remove the debug print loop that iterates
model.named_parameters() and prints the first parameter's dtype (the "for name,
param in model.named_parameters(): print(...); break" block) from
test_ernie45_moe_conversion.py; either delete those lines or wrap them behind a
proper debug/logging flag so tests do not emit debug output (ensure no stray
print remains in the test).
- Line 243: Replace the use of "assert False, f'Failed to load created toy MoE
model: {e}'" with a proper pytest failure so failures aren't optimized away;
import pytest if missing and call pytest.fail(f"Failed to load created toy MoE
model: {e}") (or raise AssertionError(f"...")) in the test function, and make
the same replacement for the other occurrences mentioned (the block around the
second failure message at lines 319-322) so both test failures are reported
reliably under pytest and when Python is run with -O.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b5d32c6-ba49-4551-abc3-4d908e9209e4

📥 Commits

Reviewing files that changed from the base of the PR and between fad15ab and 0de9acf.

📒 Files selected for processing (28)
  • examples/conversion/hf_megatron_roundtrip_multi_gpu.py
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/model_bridge.py
  • src/megatron/bridge/models/conversion/param_mapping.py
  • src/megatron/bridge/models/ernie/__init__.py
  • src/megatron/bridge/models/ernie/ernie_45_bridge.py
  • src/megatron/bridge/models/ernie/ernie_45_provider.py
  • src/megatron/bridge/models/ernie_vl/__init__.py
  • src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py
  • src/megatron/bridge/models/ernie_vl/ernie45_vl_provider.py
  • src/megatron/bridge/models/ernie_vl/ernie_decoder_layer_spec.py
  • src/megatron/bridge/models/ernie_vl/ernie_moe_layer.py
  • src/megatron/bridge/models/ernie_vl/modeling_ernie45_vl.py
  • src/megatron/bridge/models/ernie_vl/vision_attention.py
  • src/megatron/bridge/models/ernie_vl/vision_layer_spec.py
  • src/megatron/bridge/models/ernie_vl/vision_model.py
  • src/megatron/bridge/models/ernie_vl/vision_transformer_config.py
  • tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.sh
  • tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.sh
  • tests/functional_tests/test_groups/models/ernie/__init__.py
  • tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
  • tests/functional_tests/test_groups/models/ernie_vl/__init__.py
  • tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_fwd_bwd.py
  • tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_logit_compare.py
  • tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_compare.py
  • tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py
  • tests/functional_tests/test_groups/models/ernie_vl/hf_loss_check.py
  • tests/functional_tests/test_groups/models/ernie_vl/test_ernie45_vl_conversion.py

Comment on lines +207 to +278
class _ConcatBiasMapping(AutoMapping):
"""Mapping for the concatenated text+vision expert bias tensor.

The on-disk HF format stores a single ``moe_statics.e_score_correction_bias``
tensor of shape ``[2, num_experts]`` where row 0 is the text pool bias and
row 1 is the vision pool bias. This mapping extracts the appropriate row
based on ``slice_name``.

For export (megatron_to_hf), the text mapping buffers its bias in a
class-level dict keyed by resolved HF param name. The vision mapping
retrieves the buffered text bias, stacks them into ``[2, N]``, and
returns the merged tensor. This ensures only one entry per HF key.
"""

# Class-level buffer: {resolved_hf_param: text_bias_tensor}
_export_buffer: Dict[str, torch.Tensor] = {}

def __init__(self, megatron_param: str, hf_param: str, slice_name: str, num_experts: int):
super().__init__(megatron_param=megatron_param, hf_param=hf_param)
self._slice_name = slice_name # "text" or "vision"
self._num_experts = num_experts
self.allow_hf_name_mismatch = True

def resolve(self, captures: Tuple[str, ...]):
resolved_megatron_param, resolved_hf_param = self._resolve_names(captures)
result = _ConcatBiasMapping(
megatron_param=resolved_megatron_param,
hf_param=resolved_hf_param,
slice_name=self._slice_name,
num_experts=self._num_experts,
)
return result

def hf_to_megatron(self, hf_weights, megatron_module):
"""Extract the text or vision slice from the concatenated bias.

On-disk shape is [2, num_experts]: row 0 = text, row 1 = vision.
"""
if self._slice_name == "text":
sliced = hf_weights[0]
else:
sliced = hf_weights[1]
return super().hf_to_megatron(sliced, megatron_module)

def megatron_to_hf(self, megatron_weights, megatron_module):
"""Export text+vision expert bias as concatenated [2, N] tensor.

The text mapping buffers its bias; the vision mapping retrieves it
and stacks into [2, N]. If the text bias is not yet buffered
(shouldn't happen in practice), falls back to exporting as-is.
"""
result = super().megatron_to_hf(megatron_weights, megatron_module)
if result is None:
return result

hf_key = str(self.hf_param)

if self._slice_name == "text":
# Buffer the text bias, return empty dict (don't emit yet)
for _, tensor in result.items():
_ConcatBiasMapping._export_buffer[hf_key] = tensor
return {}
else:
# Vision: retrieve buffered text bias and merge
text_bias = _ConcatBiasMapping._export_buffer.pop(hf_key, None)
if text_bias is not None:
for _, vision_bias in result.items():
merged = torch.stack([text_bias, vision_bias], dim=0)
return {hf_key: merged}
# Fallback: no buffered text bias, just return vision as-is
return result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Class-level _export_buffer may accumulate stale entries across conversions.

The _export_buffer dict is a class attribute (shared across all instances) used to coordinate text/vision bias export. Issues:

  1. Entries are only removed via pop() when vision mapping retrieves them (line 271)
  2. If text-only export runs or export fails partway, text biases remain in the buffer
  3. Subsequent conversions in the same process may see stale data

Consider clearing the buffer at the start of a conversion run, or using instance-level state tied to the conversion context.

🛡️ Suggested defensive approach
 class _ConcatBiasMapping(AutoMapping):
     ...
-    # Class-level buffer: {resolved_hf_param: text_bias_tensor}
-    _export_buffer: Dict[str, torch.Tensor] = {}
+    # Class-level buffer: {resolved_hf_param: text_bias_tensor}
+    # Note: Should be cleared at the start of each conversion run to avoid stale state.
+    _export_buffer: Dict[str, torch.Tensor] = {}
+
+    `@classmethod`
+    def clear_export_buffer(cls):
+        """Clear the export buffer. Call before starting a new conversion."""
+        cls._export_buffer.clear()

Then call _ConcatBiasMapping.clear_export_buffer() at the start of the conversion process.

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 222-222: Mutable default value for class attribute

(RUF012)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/ernie45_vl_bridge.py` around lines 207 -
278, The class-level _export_buffer on _ConcatBiasMapping can retain stale text
biases between conversion runs; add a defensive classmethod like
clear_export_buffer(cls) that empties _export_buffer and call it at the start of
the export workflow (before any megatron_to_hf calls) so each conversion run
starts with a fresh buffer; update _ConcatBiasMapping to provide
clear_export_buffer and ensure the export orchestration invokes
_ConcatBiasMapping.clear_export_buffer() before exporting so megatron_to_hf (and
its pop usage) cannot leak state across runs.

Comment on lines +117 to +131
if not hasattr(self, "inv_freq"):
inv_freq = 1.0 / (
self.theta
** (
torch.arange(
0, self.dim, 2, dtype=torch.float,
device=torch.cuda.current_device(),
)
/ self.dim
)
)
self.register_buffer("inv_freq", inv_freq, persistent=False)
seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
return freqs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential issue: torch.cuda.current_device() before CUDA context.

Using torch.cuda.current_device() in forward() assumes CUDA is already initialized and a device is set. If forward() is called before the model is moved to CUDA (e.g., during CPU-based weight loading), this could fail. Consider deriving the device from the input tensor instead.

🛡️ Proposed fix
     def forward(self, seqlen: int) -> torch.Tensor:
         """Compute frequency table for positions 0..seqlen-1.
         ...
         """
         if not hasattr(self, "inv_freq"):
+            # Use the device of the module's parameters if available, fallback to CUDA
+            device = next(self.parameters(), torch.empty(0)).device
+            if device.type == "meta":
+                device = torch.cuda.current_device()
             inv_freq = 1.0 / (
                 self.theta
                 ** (
                     torch.arange(
                         0, self.dim, 2, dtype=torch.float,
-                        device=torch.cuda.current_device(),
+                        device=device,
                     )
                     / self.dim
                 )
             )
             self.register_buffer("inv_freq", inv_freq, persistent=False)

Note: Since ErnieVisionRotaryEmbedding has no parameters, an alternative would be to accept device as an argument to forward().

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not hasattr(self, "inv_freq"):
inv_freq = 1.0 / (
self.theta
** (
torch.arange(
0, self.dim, 2, dtype=torch.float,
device=torch.cuda.current_device(),
)
/ self.dim
)
)
self.register_buffer("inv_freq", inv_freq, persistent=False)
seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
return freqs
if not hasattr(self, "inv_freq"):
# Use the device of the module's parameters if available, fallback to CUDA
device = next(self.parameters(), torch.empty(0)).device
if device.type == "meta":
device = torch.cuda.current_device()
inv_freq = 1.0 / (
self.theta
** (
torch.arange(
0, self.dim, 2, dtype=torch.float,
device=device,
)
/ self.dim
)
)
self.register_buffer("inv_freq", inv_freq, persistent=False)
seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
return freqs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/ernie_vl/vision_model.py` around lines 117 - 131,
The code uses torch.cuda.current_device() when creating self.inv_freq which can
fail before CUDA is initialized; update the implementation to derive the device
from the input tensor (or a passed-in device) instead of calling
torch.cuda.current_device(). Locate where inv_freq is created (the block that
registers "inv_freq" and the subsequent seq = torch.arange(...)) and replace the
device argument with something like device = input.device (or a provided device
parameter) and use that device for torch.arange and torch.arange(...,
device=device). Ensure self.register_buffer("inv_freq", inv_freq,
persistent=False) still stores the buffer and that subsequent uses of
self.inv_freq.device remain valid.

Comment on lines +1 to +16
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move shebang to line 1 - shell scripts require it as the first line.

The shebang #!/bin/bash is on line 15 after the copyright header. For shell scripts to execute properly, the shebang must be the very first line. The copyright header should follow the shebang.

Suggested fix
+#!/bin/bash
 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 ...
 # limitations under the License.

-#!/bin/bash
 set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
#!/bin/bash
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -xeuo pipefail # Exit immediately if a command exits with a non-zero status
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 15-15: The shebang must be on the first line. Delete blanks and move comments.

(SC1128)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie_vl.sh`
around lines 1 - 16, Move the shebang (#!/bin/bash) to be the first line of the
script so the kernel uses the correct interpreter; specifically, place the
existing shebang above the copyright header and ensure the existing "set -xeuo
pipefail" and all other header comments follow it unchanged in the script
"L1_Launch_models_ernie_vl.sh".

-v -s -x \
-m "not pleasefixme" \
--tb=short -rA \
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if cicd-main.yml includes the new ERNIE launcher
rg -n "L1_Launch_models_ernie" .github/workflows/cicd-main.yml || echo "WARNING: Launcher not found in CI workflow"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 113


Add this launcher to the CI workflow job matrix.

Per repository guidelines, new L1 test launcher scripts must be added to .github/workflows/cicd-main.yml. The launcher L1_Launch_models_ernie is not currently included in the workflow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/launch_scripts/active/L1_Launch_models_ernie.sh` at
line 16, The CI workflow is missing the new L1 launcher L1_Launch_models_ernie,
so update .github/workflows/cicd-main.yml to include this launcher in the job
matrix used for L1 test launchers; locate the matrix entry (e.g.,
matrix.launcher or matrix.include block used by the job that runs launch
scripts) and add an entry for "L1_Launch_models_ernie" referencing the launcher
name and optionally the test path
tests/functional_tests/test_groups/models/ernie/test_ernie45_moe_conversion.py
so the new script is executed in the CI matrix.

Comment on lines +186 to +187
mg_cos = mg_emb.cos().flatten()
mg_sin = mg_emb.sin().flatten()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove or use unused variable mg_sin.

mg_sin is computed but never used. Either remove it or add a comparison with hf_sin for completeness.

Suggested fix (remove unused variable)
         mg_cos = mg_emb.cos().flatten()
-        mg_sin = mg_emb.sin().flatten()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mg_cos = mg_emb.cos().flatten()
mg_sin = mg_emb.sin().flatten()
mg_cos = mg_emb.cos().flatten()
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 187-187: local variable 'mg_sin' is assigned to but never used

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`
around lines 186 - 187, The variable mg_sin is computed but never used; either
remove the mg_sin calculation or use it by comparing to hf_sin like you do for
mg_cos/hf_cos: update the mg_sin line in the block where mg_emb is processed
(mg_cos, mg_sin, mg_emb) to either delete the mg_sin = mg_emb.sin().flatten()
line or replace it with a comparison/assertion against hf_sin (mirroring the
existing mg_cos vs hf_cos check) so mg_sin is no longer an unused variable.

Comment on lines +311 to +316
cos_sim_ln = F.cosine_similarity(
hf_normed.float().flatten().unsqueeze(0),
# We can't easily extract just the LN output from the fused module
# so just compare QKV output
torch.zeros(1).unsqueeze(0), # placeholder
).item()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove incomplete/dead code for cos_sim_ln.

This code computes cosine similarity against a placeholder torch.zeros(1), which produces a meaningless result, and the variable is never used. The comment even acknowledges the limitation. Consider removing this entirely or implementing a proper comparison.

Suggested fix
-    cos_sim_ln = F.cosine_similarity(
-        hf_normed.float().flatten().unsqueeze(0),
-        # We can't easily extract just the LN output from the fused module
-        # so just compare QKV output
-        torch.zeros(1).unsqueeze(0),  # placeholder
-    ).item()
-
     cos_sim_qkv = F.cosine_similarity(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cos_sim_ln = F.cosine_similarity(
hf_normed.float().flatten().unsqueeze(0),
# We can't easily extract just the LN output from the fused module
# so just compare QKV output
torch.zeros(1).unsqueeze(0), # placeholder
).item()
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 311-311: local variable 'cos_sim_ln' is assigned to but never used

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/test_groups/models/ernie_vl/ernie45_vl_vit_debug.py`
around lines 311 - 316, The snippet computing cos_sim_ln using
F.cosine_similarity against a placeholder torch.zeros(1) is dead/misleading and
unused; remove the entire cos_sim_ln calculation (the F.cosine_similarity call
and the placeholder torch.zeros(1) argument) and any references to cos_sim_ln,
or if you intend a real check, replace the placeholder with a proper tensor
extracted for comparison (e.g., the actual LN output) and use hf_normed and
F.cosine_similarity accordingly; locate the code around the cos_sim_ln variable
and hf_normed to apply the change.

@bo-ke bo-ke force-pushed the ernie-upstream branch 2 times, most recently from 3c27fef to 590bb94 Compare April 10, 2026 10:12
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 13, 2026
@yaoyu-33 yaoyu-33 added the area:model Model implementations and HF bridge logic label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Some comments

Comment thread src/megatron/bridge/models/conversion/model_bridge.py
hf_expert_number = int(hf_expert_match.group(1))
pool_offset = hf_expert_number - local_expert_number
else:
pool_offset = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a custom mapping class for dual-pool MoE, instead of modifying param_mapping.py

for task in self._with_progress_tracking(hf_to_megatron_tasks, description):
# None task means no mapping exists for this param (e.g. MTP layers without bridge mappings)
if task is None:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a task is None, this usually mean a weight has no corresponding mapping, so a mapping should be added. Is the case different here?


# MoE runtime settings
# NOTE: moe_grouped_gemm=False uses SequentialMLP (per-expert forward);
# True uses TEGroupedMLP which can produce NaN with certain TE versions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEGroupedMLP can produce NaN with certain TE versions.

Can you reproduce this issue with other models? TEGroupedMLP used across most models in megatron bridge so it is heavily tested. Please file an issue if you can reproduce this error with an existing model and we'd love to take a look



@dataclass
class Ernie45ModelProvider(GPTModelProvider):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove provider now, just implement overrides in the provider_bridge function

Comment thread examples/models/vlm/ernie_vl/ernie45_vl_fwd_bwd.py
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 13, 2026
@bo-ke
Copy link
Copy Markdown
Author

bo-ke commented Apr 14, 2026

Thanks for the contribution! Some comments


Thanks @cuichenx for the thorough review!

We'll address all your comments in a follow-up commit:

  1. SequentialMLP (model_bridge.py L225): ERNIE VL's dual-pool MoE uses SequentialMLP
    internally (ErnieMoELayer doesn't fit TEGroupedMLP's single-tensor interface). For
    text-only ERNIE, we observed NaN with TEGroupedMLP + sigmoid routing + expert bias
    — will investigate and file an issue if reproducible with existing models. Once
    resolved, we'll switch to GroupedMLP.

  2. Custom mapping class (param_mapping.py L709): Agreed — will move the
    pool_offset logic into a custom subclass in the ERNIE VL bridge instead of
    modifying param_mapping.py.

  3. task is None guard (model_bridge.py L930): This handles a specific case:
    ERNIE HF config stores num_nextn_predict_layers but ships no MTP weights, so
    tasks for those params resolve to None (no registered mapping). We've already set
    mtp_num_layers=None in provider_bridge to avoid creating MTP layers — will
    improve the comment to make this clearer.

    with sigmoid routing + expert bias and file a separate issue if confirmed.

  4. Remove Provider class: Will refactor to fold overrides directly into
    provider_bridge, following the Qwen3MoE pattern.

  5. Modeling folder: Will consolidate ernie_moe_layer.py and related files into
    a modeling_ernie45_vl.py following the qwen_vl pattern.

  6. Verification scripts: Will move ernie45_vl_fwd_bwd.py /
    ernie45_vl_logit_compare.py to examples/conversion/ and keep only the actual
    roundtrip tests in the test folder.


@bo-ke bo-ke force-pushed the ernie-upstream branch 2 times, most recently from ffc045c to 8da82d0 Compare April 14, 2026 14:30
@bo-ke
Copy link
Copy Markdown
Author

bo-ke commented Apr 14, 2026

Hi @cuichenx , all comments have been addressed in the latest commit. Here's a summary:

  1. TEGroupedMLP / NaN issue (ernie_45_bridge.py): Switched to moe_grouped_gemm=True +
    moe_permute_fusion=True, aligned with DeepSeek V3 pattern. The NaN was likely caused by the
    moe_permute_fusion=False + moe_grouped_gemm=True combination together with our specific TE
    version and environment. Could not reproduce in a toy model with random weights, but the fix
    aligns with the proven DeepSeek V3 configuration. Also removed the explicit
    gradient_accumulation_fusion override — now uses the auto-detected default from
    GPTModelProvider.

  2. Custom mapping class for dual-pool MoE (param_mapping.py): Moved all pool_offset logic out
    of base param_mapping.py into a new _DualPoolExpertMixin in ernie45_vl_bridge.py. The base
    param_mapping.py now has zero ERNIE-specific references.

  3. task=None comment (model_bridge.py): Improved the comment to explain this is expected for
    MTP layers where HF config declares num_nextn_predict_layers but ships no weights.

  4. NaN with TEGroupedMLP (ernie_45_bridge.py): Could not reproduce with other models. The
    issue was specific to ERNIE's sigmoid routing + expert bias combination when
    moe_permute_fusion=False. Fixed by enabling both flags.

  5. Remove provider class (ernie_45_provider.py): Deleted Ernie45ModelProvider, folded all
    config into provider_bridge().

  6. Modeling files in separate folder (ernie_vl): Moved all VL modeling files into
    modeling_ernie45_vl/ subpackage.

  7. Verification scripts (examples): Moved from tests/functional_tests/ to
    examples/models/vlm/ernie_vl/.

Ready for re-review when you have time. Thanks!

@bo-ke bo-ke requested a review from cuichenx April 14, 2026 14:42
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026
@huvunvidia
Copy link
Copy Markdown
Contributor

/ok to test 7408a09

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 20, 2026
@huvunvidia
Copy link
Copy Markdown
Contributor

Hi @bo-ke ,
Can you help check these PR blockers?

  • The failed CI tests
  • Merge conflict
    Thank you.

@chtruong814 chtruong814 added the waiting-on-customer Waiting on the original author to respond label Apr 21, 2026
@bo-ke
Copy link
Copy Markdown
Author

bo-ke commented Apr 22, 2026

Hi @bo-ke , Can you help check these PR blockers?

  • The failed CI tests
  • Merge conflict
    Thank you.

Hi @huvunvidia,

  • Thanks for the reminder! I've checked the PR status:
  • I'll address the CI issues and update the PR shortly.

…n-Core

Add Megatron-Bridge support for ERNIE 4.5 models:

Text-only MoE bridge (Ernie4_5_MoeForCausalLM / ERNIE-4.5-21B-A3B-PT):
- Single-pool MoE: 64 experts, top-6 routing, shared experts
- Mixed dense/MoE layers via moe_layer_freq (layer 0 dense, rest MoE)
- Custom decoder block spec respecting moe_layer_freq
- Expert bias for aux-free load balancing
- Sigmoid routing with expert bias for load balancing
- Interleaved RoPE (base=500000), RMSNorm, SiLU-gated MLP
- Full roundtrip conversion tests (TP/PP parallelism)

VL bridge (Ernie4_5_VLForConditionalGeneration / ERNIE-4.5-VL-28B-A3B):
- Dual-pool MoE with separate vision/text expert pools
- Custom ErnieVisionModel with pooled attention
- ErnieMoELayer with expert bias and sigmoid routing
- Custom decoder layer spec for mixed dense/MoE architectures
- Forward/backward pass and logit comparison test scripts
- Full roundtrip conversion tests (TP/PP parallelism)

Shared infrastructure improvements:
- SequentialMLP EP support in model_bridge.py
- gate.weight precision handling in roundtrip script
- is_expert pattern and EP=1 fast path in param_mapping.py
- PP-safe mapping classes for dense/MoE mixed architectures

Verified: Text cosine similarity = 0.999581, VL cosine similarity = 0.999563.

Signed-off-by: bo-ke <kebo01@baidu.com>
@bo-ke bo-ke force-pushed the ernie-upstream branch 4 times, most recently from 46e86b9 to 6dccad0 Compare April 22, 2026 10:37
@bo-ke
Copy link
Copy Markdown
Author

bo-ke commented Apr 22, 2026

Hi @huvunvidia,

I've pushed fixes addressing the CI failures and merge conflicts. Verified locally that the affected
tests pass (toy model creation + EP conversion on 2 GPUs). Lint/format also clean.

Could you trigger CI again? /ok to test 6dccad0

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Apr 22, 2026
@huvunvidia
Copy link
Copy Markdown
Contributor

/ok to test bda0f89

@huvunvidia huvunvidia added the needs-review PR is ready for code review and waiting on a reviewer label Apr 22, 2026
bo-ke added 4 commits April 23, 2026 14:52
…ridges

- Remove Ernie45ModelProvider class, fold config into provider_bridge()
- Fix NaN issue: enable moe_grouped_gemm + moe_permute_fusion (aligned with DeepSeek V3)
- Remove gradient_accumulation_fusion override, use auto-detected default
- Move pool_offset logic from base param_mapping.py to _DualPoolExpertMixin in ernie45_vl_bridge.py
- Move VL modeling files into modeling_ernie45_vl/ subpackage
- Move verification scripts from tests/functional_tests/ to examples/models/vlm/ernie_vl/
- Improve task=None comment in model_bridge.py to explain MTP layers scenario
- Clean up is_expert docstring in param_mapping.py (remove ERNIE-specific references)

Signed-off-by: kebo01 <kebo01@baidu.com>
…issues

Add copyright header to ernie45_vl_vit_debug.py, add missing docstrings
for public functions/classes (D103/D101), fix import sorting (I001),
remove unused import (F401), and apply ruff format to all ERNIE files.

Signed-off-by: kebo01 <kebo01@baidu.com>
Fix Expert Parallelism (EP>1) conversion bug for ERNIE 4.5 VL dual-pool
MoE vision experts. The previous _DualPoolExpertMixin had a wrong offset
formula, and _DualPoolAutoMapping was dead code due to AutoMapping's
internal delegation bypassing the mixin override.

Simplify from 5 classes to 2 classes + 2 helper functions:
- _OffsetGatedMLPMapping: for vision gate/up_proj (GatedMLP)
- _OffsetRowParallelMapping: for vision down_proj (replaces AutoMapping
  to avoid delegation bypass)
- _offset_gather_from_ep_ranks(): shared EP gather with pool offset
- _resolve_with_offset(): shared wildcard resolution with HF-side shift

Also fix test tokenizer fallback to use publicly available Qwen/Qwen3-0.6B.

Signed-off-by: kebo01 <kebo01@baidu.com>
Update test_transpose_non_rank_zero_hf_to_megatron to expect permuted
tensors on non-rank-0, matching the behavior change where permute_dims
is applied on all ranks for correct ReplicatedMapping shape handling.

Signed-off-by: kebo01 <kebo01@baidu.com>
@bo-ke
Copy link
Copy Markdown
Author

bo-ke commented Apr 23, 2026

/ok to test bda0f89

Hi @huvunvidia,

Thanks for triggering CI! There was a unit test failure (test_transpose_non_rank_zero_hf_to_megatron) caused by the permute_dims behavior change — I updated the test to match the new behavior (permute applied on all ranks).

Please review Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic community-request needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants