feat(datasets): lmms_video_utils video backend for LLaVA-OneVision-2 codec stream#183
Conversation
New load_video_lmms_video_utils returns (canvases, fps, CodecVideoOutput) so downstream processors aware of codec metadata can use per-patch positions and per-canvas timestamps directly. Decoder backend/device read from extra_kwargs (video_decode_backend / video_decode_device), default pyav/cpu; 'cuda' resolves to cuda:LOCAL_RANK.
Adds llava_ov2_iterable dataset that pipes lmms_video_utils canvases plus their CodecVideoOutput metadata into LlavaOnevision2DataProcessor. The processor gains a video_metadata kwarg: when supplied, it bypasses OV2's video_processor and constructs pixel_values_videos / video_grid_thw / patch_positions / frame_timestamps directly from the codec output, preserving real source-frame coordinates instead of falling back to arange(T). The frame-sampling path is untouched.
Pulled in by [video] and rolled up into [all] so existing 'install .[all]' workflows get torchcodec-backed codec-stream support for free, while base installs stay slim.
| import torch | ||
| from PIL import Image | ||
|
|
||
| from lmms_engine.datasets.iterable.vision_iterable_dataset import ( |
There was a problem hiding this comment.
@kcz358 can we extract an abstract class, and name it to something like codec native input.
also for processor, ov2 is a specification in both data and processor.
| if self.config.video_sampling_strategy == "fps": | ||
| overrides["target_fps"] = float(fps) | ||
| elif self.config.video_sampling_strategy == "frame_num": | ||
| overrides["max_frames"] = int(self.config.frame_num) |
There was a problem hiding this comment.
I think there should not be max_frames, more like we set a visual tokens budget?
| overrides["max_frames"] = int(self.config.frame_num) | ||
|
|
||
| if video_kwargs: | ||
| qwen_to_ours = { |
There was a problem hiding this comment.
maybe should not consider qwen_to_ours, we can convert all data offline, prepare all things needed using ffprobe.
Move the dataset-level codec video orchestration (collecting canvases plus their CodecVideoOutput metadata across a message list) into a reusable CodecVideoLoadingMixin alongside multimodal_mixin. The backend implementation (load_video_lmms_video_utils) stays in MultiModalDataLoadingMixin; LlavaOv2IterableDataset now inherits the mixin and only handles image collection plus processor dispatch.
|
@Luodian Refactored: extracted the codec video loading logic into a reusable The backend implementation ( |
Summary
Plumbs the new
lmms-video-utilscodec-stream video frontend into lmms-engine so LLaVA-OneVision-2 can train on the same patch-positions / per-canvas-timestamp format the model expects, but without depending on the closed-sourcecv-preinferbinary.Changes
datasets.config:video_backendliteral gains"lmms_video_utils".datasets.multimodal_mixin: addsload_video_lmms_video_utilswhich returns(canvases, fps, CodecVideoOutput). Decoder backend / device read fromextra_kwargs.video_decode_backend/extra_kwargs.video_decode_device(defaultspyav/cpu;"cuda"resolves tocuda:LOCAL_RANK).datasets.iterable.llava_ov2_iterable_dataset(new): registersllava_ov2_iterabledataset that propagates theCodecVideoOutputmetadata into the processor.datasets.processor.llava_onevision2_processor:process()learns avideo_metadatakwarg. When supplied, the processor bypasses OV2's bundledvideo_processorand constructspixel_values_videos/video_grid_thw/patch_positions/frame_timestampsdirectly from the codec output, preserving real source-frame coordinates instead of falling back toarange(T). The frame-sampling path is unchanged.pyproject.toml: adds[video]extra (lmms-video-utils[all]>=0.1.0) and rolls it into[all].Example config
Validation
max_stepsrun onLLaVA-Video-178K/llava_video_0_30_s_cap_oe.parquetwith the OV-2 8B checkpoint completed end-to-end.num_workers=0whenvideo_decode_device=cuda(forked workers can't share the parent's CUDA context). Multi-worker GPU decoding would need amultiprocessing_context='spawn'toggle, which is a separate change.Backward compatibility
Existing video backends (
decord,qwen_vl_utils,qwen_omni_utils) and all other dataset / processor paths are untouched. Base install is unchanged; users must opt intopip install .[video](or.[all]) to pull inlmms-video-utils.