fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1 by huvunvidia · Pull Request #4407 · NVIDIA-NeMo/Megatron-Bridge

huvunvidia · 2026-06-17T20:31:18Z

What

Fixes a batch-collapse bug in nemotron_omni_collate_fn that triggers at micro_batch_size > 1 with the dynamic-resolution image processor.

The core bug

adjust_image_tokens (in nemotron_vl_utils.py) is a single-sequence helper — it reads token positions from input_ids[0] and its removal branch ends with input_ids[keep_mask].unsqueeze(0). On a (B, S) tensor, boolean-mask indexing flattens row-major over all rows, so .unsqueeze(0) yields (1, ΣS). Calling it once on the whole batch silently fused all samples into one causal sequence — cross-sample attention and continuous position_ids (the (1, 696)-with-two-samples symptom). Correct at B=1, broken at B>1.

The fix (`collate_fn.py`)

Leave the helper alone; fix the caller. Gate a per-row path on exactly the broken case (is_dynamic_res_processor and not is_video and bsz > 1); everything else keeps the original single call. Per row: slice this row's tile counts off the flat num_tiles_for_adjust with a cursor, run adjust_image_tokens on a genuine (1, L) row (flatten is a no-op), then re-pad to (B, new_max) and re-stack so samples stay independent.

Second-order subtlety: each row is sliced to its true pre-pad length (ids_list[b].shape[0]) before adjusting — padded_ids was padded to the longest raw row, whose length is inflated by the biggest image's un-trimmed <image> run. Adjusting the padded row would trim <image> tokens but leave trailing pad, keeping new_max huge. Pad can't be stripped by value because pad == eos == <|im_end|> appears mid-sequence.

Downstream consequences handled

position_ids now uses .contiguous() — .expand() gives stride-0 rows (shared storage) that break pin_memory ("more than one element refers to a single memory location") once batch_size > 1.
The pack branch is now reachable at B>1, which exposes a pre-existing cu_seqlens-vs-vision-fanout NaN. That is a separate limitation — keep PACKED_SEQ=false for this model.

Scope

Only the dynamic-res, non-video, B>1 path changed. B==1, the single-example video path, and the static-tile processor are byte-for-byte unchanged (original single call).

🤖 Generated with Claude Code

`adjust_image_tokens` is a single-sequence helper: it reads token positions from `input_ids[0]` only, and its removal branch ends with `input_ids[keep_tokens_mask].unsqueeze(0)`. On a `(B, S)` tensor, boolean-mask indexing flattens row-major over all rows, so `.unsqueeze(0)` yields `(1, ΣS)`. Calling it once on the whole `(B, max_len)` batch therefore fused all samples into a single causal sequence with cross-sample attention and continuous position_ids. Correct at B=1, broken at B>1. Fix the caller, not the helper: gate a per-row path on exactly the broken case (`is_dynamic_res_processor and not is_video and bsz > 1`). Per row, slice that row's image count off the flat `num_tiles_for_adjust` with a cursor, run `adjust_image_tokens` on a genuine `(1, L)` row (where the flatten is a no-op), then re-pad to a rectangular `(B, new_max)` batch so samples stay independent. Every other path (B==1, video, static-tile) keeps the original single call. Each row is sliced to its true pre-pad length (`ids_list[b].shape[0]`) before adjusting: `padded_ids` was padded to the longest raw row, whose length is inflated by the biggest image's un-trimmed <image> run, so adjusting the padded row would trim <image> tokens but leave trailing pad, keeping new_max huge. Pad can't be stripped by value because pad == eos == <|im_end|> appears mid-sequence. Downstream consequences also handled: - position_ids now uses `.contiguous()`: `.expand()` gives stride-0 rows that break `pin_memory` ("more than one element refers to a single memory location") once batch_size > 1. - the pack branch is now reachable at B>1, which exposes a pre-existing cu_seqlens-vs-vision-fanout NaN; keep PACKED_SEQ=false for this model. Signed-off-by: Huy Vu2 <huvu@nvidia.com>

copy-pr-bot · 2026-06-17T20:31:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-06-17T20:33:57Z

+                # row, that padding is carried through and inflates new_max (so every row
+                # ends up padded). `ids_list` holds the true per-example unpadded lengths
+                # from the per-example processor calls. (Can't strip by value: pad == eos.)
+                real_len = ids_list[b].shape[0] if b < len(ids_list) else batch["input_ids"].shape[1]


nit: The defensive guard b < len(ids_list) is always True here — ids_list is built from per_ex_batches which has exactly len(examples) == bsz entries. Not a bug, but worth a comment or just dropping the fallback to avoid suggesting ids_list could be shorter than bsz.

claude · 2026-06-17T20:35:05Z

Light Review

The fix is correct: adjust_image_tokens indexes input_ids[0] and flattens via boolean mask, so calling it on a (B, S) tensor with B>1 silently fuses all rows into one sequence. The per-row loop with cursor-based tile slicing and re-padding is the right approach.

The .contiguous() addition on position_ids is also necessary -- expand() produces a stride-0 view that breaks pin_memory at B>1.

Observations

The guard (b < len(ids_list)) on line 318 is always True since ids_list has exactly bsz entries (both are derived from examples). The fallback branch is dead code. Not a bug, but slightly misleading -- consider dropping the else branch or adding a comment.
No unit test covers the new B>1 dynamic-res path. The existing test_nemotron_omni_collate tests all use a single example. A unit test that passes 2+ examples with different image counts through the dynamic-res processor mock would catch future regressions in the per-row logic, cursor tracking, and re-padding.

Suggested test cases

No perf tests impacted.

huvunvidia · 2026-06-17T20:36:53Z

/ok to test 0ddf1d2

Signed-off-by: Huy Vu2 <huvu@nvidia.com>

huvunvidia · 2026-06-17T21:33:40Z

/ok to test 106edc3

claude Bot reviewed Jun 17, 2026

View reviewed changes

huvunvidia added area:data Dataset builders, preprocessing, and samplers bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 17, 2026

copy-pr-bot Bot temporarily deployed to public June 17, 2026 20:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 20:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 21:13 Inactive

style(data): apply ruff-format to collate_fn position_ids block

106edc3

Signed-off-by: Huy Vu2 <huvu@nvidia.com>

yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407

fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407
huvunvidia wants to merge 2 commits into
mainfrom
huvu/nemotron_omni_hf_dataset_fix

huvunvidia commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

claude Bot Jun 17, 2026

Uh oh!

claude Bot commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

huvunvidia commented Jun 17, 2026

What

The core bug

The fix (collate_fn.py)

Downstream consequences handled

Scope

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

claude Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 17, 2026

Light Review

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

huvunvidia commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The fix (`collate_fn.py`)