fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407
fix(data): adjust image tokens per-row in nemotron_omni collate at mbs>1#4407huvunvidia wants to merge 2 commits into
Conversation
`adjust_image_tokens` is a single-sequence helper: it reads token positions
from `input_ids[0]` only, and its removal branch ends with
`input_ids[keep_tokens_mask].unsqueeze(0)`. On a `(B, S)` tensor, boolean-mask
indexing flattens row-major over all rows, so `.unsqueeze(0)` yields
`(1, ΣS)`. Calling it once on the whole `(B, max_len)` batch therefore fused
all samples into a single causal sequence with cross-sample attention and
continuous position_ids. Correct at B=1, broken at B>1.
Fix the caller, not the helper: gate a per-row path on exactly the broken
case (`is_dynamic_res_processor and not is_video and bsz > 1`). Per row, slice
that row's image count off the flat `num_tiles_for_adjust` with a cursor, run
`adjust_image_tokens` on a genuine `(1, L)` row (where the flatten is a no-op),
then re-pad to a rectangular `(B, new_max)` batch so samples stay independent.
Every other path (B==1, video, static-tile) keeps the original single call.
Each row is sliced to its true pre-pad length (`ids_list[b].shape[0]`) before
adjusting: `padded_ids` was padded to the longest raw row, whose length is
inflated by the biggest image's un-trimmed <image> run, so adjusting the padded
row would trim <image> tokens but leave trailing pad, keeping new_max huge.
Pad can't be stripped by value because pad == eos == <|im_end|> appears
mid-sequence.
Downstream consequences also handled:
- position_ids now uses `.contiguous()`: `.expand()` gives stride-0 rows that
break `pin_memory` ("more than one element refers to a single memory
location") once batch_size > 1.
- the pack branch is now reachable at B>1, which exposes a pre-existing
cu_seqlens-vs-vision-fanout NaN; keep PACKED_SEQ=false for this model.
Signed-off-by: Huy Vu2 <huvu@nvidia.com>
| # row, that padding is carried through and inflates new_max (so every row | ||
| # ends up padded). `ids_list` holds the true per-example unpadded lengths | ||
| # from the per-example processor calls. (Can't strip by value: pad == eos.) | ||
| real_len = ids_list[b].shape[0] if b < len(ids_list) else batch["input_ids"].shape[1] |
There was a problem hiding this comment.
nit: The defensive guard b < len(ids_list) is always True here — ids_list is built from per_ex_batches which has exactly len(examples) == bsz entries. Not a bug, but worth a comment or just dropping the fallback to avoid suggesting ids_list could be shorter than bsz.
Light ReviewThe fix is correct: adjust_image_tokens indexes input_ids[0] and flattens via boolean mask, so calling it on a (B, S) tensor with B>1 silently fuses all rows into one sequence. The per-row loop with cursor-based tile slicing and re-padding is the right approach. The .contiguous() addition on position_ids is also necessary -- expand() produces a stride-0 view that breaks pin_memory at B>1. Observations
Suggested test cases No perf tests impacted. |
|
/ok to test 0ddf1d2 |
Signed-off-by: Huy Vu2 <huvu@nvidia.com>
|
/ok to test 106edc3 |
What
Fixes a batch-collapse bug in
nemotron_omni_collate_fnthat triggers atmicro_batch_size > 1with the dynamic-resolution image processor.The core bug
adjust_image_tokens(innemotron_vl_utils.py) is a single-sequence helper — it reads token positions frominput_ids[0]and its removal branch ends withinput_ids[keep_mask].unsqueeze(0). On a(B, S)tensor, boolean-mask indexing flattens row-major over all rows, so.unsqueeze(0)yields(1, ΣS). Calling it once on the whole batch silently fused all samples into one causal sequence — cross-sample attention and continuous position_ids (the(1, 696)-with-two-samples symptom). Correct atB=1, broken atB>1.The fix (
collate_fn.py)Leave the helper alone; fix the caller. Gate a per-row path on exactly the broken case (
is_dynamic_res_processor and not is_video and bsz > 1); everything else keeps the original single call. Per row: slice this row's tile counts off the flatnum_tiles_for_adjustwith a cursor, runadjust_image_tokenson a genuine(1, L)row (flatten is a no-op), then re-pad to(B, new_max)and re-stack so samples stay independent.Second-order subtlety: each row is sliced to its true pre-pad length (
ids_list[b].shape[0]) before adjusting —padded_idswas padded to the longest raw row, whose length is inflated by the biggest image's un-trimmed<image>run. Adjusting the padded row would trim<image>tokens but leave trailing pad, keepingnew_maxhuge. Pad can't be stripped by value becausepad == eos == <|im_end|>appears mid-sequence.Downstream consequences handled
position_idsnow uses.contiguous()—.expand()gives stride-0 rows (shared storage) that breakpin_memory("more than one element refers to a single memory location") oncebatch_size > 1.B>1, which exposes a pre-existing cu_seqlens-vs-vision-fanout NaN. That is a separate limitation — keepPACKED_SEQ=falsefor this model.Scope
Only the dynamic-res, non-video,
B>1path changed.B==1, the single-example video path, and the static-tile processor are byte-for-byte unchanged (original single call).🤖 Generated with Claude Code