Skip to content

feat(vlm): add Nemotron-Omni RADIO post-load patches#2311

Open
yuekaizhang wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
yuekaizhang:n3-omni-fix
Open

feat(vlm): add Nemotron-Omni RADIO post-load patches#2311
yuekaizhang wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
yuekaizhang:n3-omni-fix

Conversation

@yuekaizhang
Copy link
Copy Markdown
Contributor

This PR improves nemotron-3-omni:

  • enable_radio_vit_fused_attn(): route RADIO timm ViT attention through F.scaled_dot_product_attention so the (B, H, seq, seq) attention tensor (~5 GiB per block at RADIO-v2-H + dynamic-resolution patch counts) is not materialized.
  • apply_parameter_freezing(): new freeze_video_embedder knob (default False). patch_generator.video_embedder is only exercised on video inputs; on image-only training it sits in the optimizer without state (no grad → no lazy init), so dcp.load on resume raises a missing-key error. Independent of freeze_vision_tower so the image encoder can stay trainable while the video branch is frozen out.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi
HuiyingLi previously approved these changes May 25, 2026
Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you @yuekaizhang !

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 29bae71

@HuiyingLi
Copy link
Copy Markdown
Contributor

Hi @yuekaizhang could you please fix the ci errors, thank you~

- enable_radio_vit_fused_attn(): route RADIO timm ViT attention through
  F.scaled_dot_product_attention so the (B, H, seq, seq) attention tensor
  (~5 GiB per block at RADIO-v2-H + dynamic-resolution patch counts) is
  not materialized. Mirrors the Megatron-Bridge path's
  vision_config.use_flash_attn=True. No-op on non-RADIO models; invoked
  unconditionally from apply_model_infrastructure().
- apply_parameter_freezing(): new freeze_video_embedder knob (default
  False). patch_generator.video_embedder is only exercised on video
  inputs; on image-only training it sits in the optimizer without state
  (no grad → no lazy init), so dcp.load on resume raises a missing-key
  error. Independent of freeze_vision_tower so the image encoder can
  stay trainable while the video branch is frozen out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
@yuekaizhang
Copy link
Copy Markdown
Contributor Author

/ok to test 60760f8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants