fix: handle internvl_hf video-only inputs and enable frame sampling#1279
Conversation
1835a30 to
24b7ea9
Compare
|
Hi @Luodian! Need for review and tests. |
There was a problem hiding this comment.
Clean and well-scoped fix. Two minor points to consider:
-
Implicit behavioral change: Since
num_framesdefaults to32in__init__, settingdo_sample_frames=Truemeans frame sampling will now actually take effect by default. Previouslynum_frames=32was silently ignored (sincedo_sample_frameswas never set). This is arguably the correct behavior, but worth noting in the PR description as it changes the default video processing pipeline for allinternvl_hfusers on video tasks. -
Validation question: Did you run an end-to-end video eval (e.g., a small subset of VideoMME or similar) to confirm the fix works beyond
py_compileandpre-commit? A quick--limit 4smoke test would be reassuring.
Otherwise LGTM — the visuals = None guard and image_sizes fix are correct and consistent with existing patterns in the file. cc @kcz358
@mwxely Hello, Yes, I did. Actually, I ran it with |
Summary
This PR fixes two
internvl_hfvideo handling bugs inlmms_eval/models/chat/internvl_hf.py:num_framesorfpsis configuredNonefor video-only examplesimage_sizesfrom an empty image listWhy
This PR is motivated by two related
internvl_hfbugs reported upstream:generate_until()passesimages=[]into the processor, and later also assumesvisualsis non-empty when buildingimage_sizesnum_framesis not actually applied, because InternVL video processing keepsdo_sample_frames=Falseby default unless it is explicitly enabledTogether, these two issues make
internvl_hfunreliable on video tasks:video-only samples can raise an
IndexError, and sampled-video runs can still process all frames and overflow the model context length.Closes #1241
Closes #1242
Validation
pre-commit run --files lmms_eval/models/chat/internvl_hf.pypython -m py_compile lmms_eval/models/chat/internvl_hf.py