You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(datasets): lmms_video_utils video backend for LLaVA-OneVision-2 codec stream (#183)
* chore(datasets): allow lmms_video_utils as video_backend literal
* feat(datasets): add lmms_video_utils video backend
New load_video_lmms_video_utils returns (canvases, fps, CodecVideoOutput)
so downstream processors aware of codec metadata can use per-patch
positions and per-canvas timestamps directly. Decoder backend/device
read from extra_kwargs (video_decode_backend / video_decode_device),
default pyav/cpu; 'cuda' resolves to cuda:LOCAL_RANK.
* feat: LlavaOv2IterableDataset + codec metadata in OV2 processor
Adds llava_ov2_iterable dataset that pipes lmms_video_utils canvases
plus their CodecVideoOutput metadata into LlavaOnevision2DataProcessor.
The processor gains a video_metadata kwarg: when supplied, it bypasses
OV2's video_processor and constructs pixel_values_videos /
video_grid_thw / patch_positions / frame_timestamps directly from the
codec output, preserving real source-frame coordinates instead of
falling back to arange(T). The frame-sampling path is untouched.
* chore(deps): add lmms-video-utils[all] as optional video extra
Pulled in by [video] and rolled up into [all] so existing 'install
.[all]' workflows get torchcodec-backed codec-stream support for free,
while base installs stay slim.
* refactor: extract CodecVideoLoadingMixin from LlavaOv2IterableDataset
Move the dataset-level codec video orchestration (collecting canvases
plus their CodecVideoOutput metadata across a message list) into a
reusable CodecVideoLoadingMixin alongside multimodal_mixin. The backend
implementation (load_video_lmms_video_utils) stays in
MultiModalDataLoadingMixin; LlavaOv2IterableDataset now inherits the
mixin and only handles image collection plus processor dispatch.
0 commit comments