Skip to content

fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407

Open
huvunvidia wants to merge 2 commits into
mainfrom
huvu/nemotron_omni_hf_dataset_fix
Open

fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407
huvunvidia wants to merge 2 commits into
mainfrom
huvu/nemotron_omni_hf_dataset_fix

Conversation

@huvunvidia

Copy link
Copy Markdown
Contributor

What

Fixes a batch-collapse bug in nemotron_omni_collate_fn that triggers at micro_batch_size > 1 with the dynamic-resolution image processor.

The core bug

adjust_image_tokens (in nemotron_vl_utils.py) is a single-sequence helper — it reads token positions from input_ids[0] and its removal branch ends with input_ids[keep_mask].unsqueeze(0). On a (B, S) tensor, boolean-mask indexing flattens row-major over all rows, so .unsqueeze(0) yields (1, ΣS). Calling it once on the whole batch silently fused all samples into one causal sequence — cross-sample attention and continuous position_ids (the (1, 696)-with-two-samples symptom). Correct at B=1, broken at B>1.

The fix (collate_fn.py)

Leave the helper alone; fix the caller. Gate a per-row path on exactly the broken case (is_dynamic_res_processor and not is_video and bsz > 1); everything else keeps the original single call. Per row: slice this row's tile counts off the flat num_tiles_for_adjust with a cursor, run adjust_image_tokens on a genuine (1, L) row (flatten is a no-op), then re-pad to (B, new_max) and re-stack so samples stay independent.

Second-order subtlety: each row is sliced to its true pre-pad length (ids_list[b].shape[0]) before adjusting — padded_ids was padded to the longest raw row, whose length is inflated by the biggest image's un-trimmed <image> run. Adjusting the padded row would trim <image> tokens but leave trailing pad, keeping new_max huge. Pad can't be stripped by value because pad == eos == <|im_end|> appears mid-sequence.

Downstream consequences handled

  1. position_ids now uses .contiguous().expand() gives stride-0 rows (shared storage) that break pin_memory ("more than one element refers to a single memory location") once batch_size > 1.
  2. The pack branch is now reachable at B>1, which exposes a pre-existing cu_seqlens-vs-vision-fanout NaN. That is a separate limitation — keep PACKED_SEQ=false for this model.

Scope

Only the dynamic-res, non-video, B>1 path changed. B==1, the single-example video path, and the static-tile processor are byte-for-byte unchanged (original single call).

🤖 Generated with Claude Code

`adjust_image_tokens` is a single-sequence helper: it reads token positions
from `input_ids[0]` only, and its removal branch ends with
`input_ids[keep_tokens_mask].unsqueeze(0)`. On a `(B, S)` tensor, boolean-mask
indexing flattens row-major over all rows, so `.unsqueeze(0)` yields
`(1, ΣS)`. Calling it once on the whole `(B, max_len)` batch therefore fused
all samples into a single causal sequence with cross-sample attention and
continuous position_ids. Correct at B=1, broken at B>1.

Fix the caller, not the helper: gate a per-row path on exactly the broken
case (`is_dynamic_res_processor and not is_video and bsz > 1`). Per row, slice
that row's image count off the flat `num_tiles_for_adjust` with a cursor, run
`adjust_image_tokens` on a genuine `(1, L)` row (where the flatten is a no-op),
then re-pad to a rectangular `(B, new_max)` batch so samples stay independent.
Every other path (B==1, video, static-tile) keeps the original single call.

Each row is sliced to its true pre-pad length (`ids_list[b].shape[0]`) before
adjusting: `padded_ids` was padded to the longest raw row, whose length is
inflated by the biggest image's un-trimmed <image> run, so adjusting the padded
row would trim <image> tokens but leave trailing pad, keeping new_max huge.
Pad can't be stripped by value because pad == eos == <|im_end|> appears
mid-sequence.

Downstream consequences also handled:
- position_ids now uses `.contiguous()`: `.expand()` gives stride-0 rows that
  break `pin_memory` ("more than one element refers to a single memory
  location") once batch_size > 1.
- the pack branch is now reachable at B>1, which exposes a pre-existing
  cu_seqlens-vs-vision-fanout NaN; keep PACKED_SEQ=false for this model.

Signed-off-by: Huy Vu2 <huvu@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

# row, that padding is carried through and inflates new_max (so every row
# ends up padded). `ids_list` holds the true per-example unpadded lengths
# from the per-example processor calls. (Can't strip by value: pad == eos.)
real_len = ids_list[b].shape[0] if b < len(ids_list) else batch["input_ids"].shape[1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The defensive guard b < len(ids_list) is always True here — ids_list is built from per_ex_batches which has exactly len(examples) == bsz entries. Not a bug, but worth a comment or just dropping the fallback to avoid suggesting ids_list could be shorter than bsz.

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Light Review

The fix is correct: adjust_image_tokens indexes input_ids[0] and flattens via boolean mask, so calling it on a (B, S) tensor with B>1 silently fuses all rows into one sequence. The per-row loop with cursor-based tile slicing and re-padding is the right approach.

The .contiguous() addition on position_ids is also necessary -- expand() produces a stride-0 view that breaks pin_memory at B>1.

Observations

  • The guard (b < len(ids_list)) on line 318 is always True since ids_list has exactly bsz entries (both are derived from examples). The fallback branch is dead code. Not a bug, but slightly misleading -- consider dropping the else branch or adding a comment.
  • No unit test covers the new B>1 dynamic-res path. The existing test_nemotron_omni_collate tests all use a single example. A unit test that passes 2+ examples with different image counts through the dynamic-res processor mock would catch future regressions in the per-row logic, cursor tracking, and re-padding.

Suggested test cases

No perf tests impacted.

@huvunvidia huvunvidia added area:data Dataset builders, preprocessing, and samplers bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 17, 2026
@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 0ddf1d2

@huvunvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 106edc3

@yaoyu-33 yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers bug Something isn't working needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants