feat(datasets): lmms_video_utils video backend for LLaVA-OneVision-2 codec stream by kcz358 · Pull Request #183 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-06-05T04:21:39Z

Summary

Plumbs the new lmms-video-utils codec-stream video frontend into lmms-engine so LLaVA-OneVision-2 can train on the same patch-positions / per-canvas-timestamp format the model expects, but without depending on the closed-source cv-preinfer binary.

Changes

datasets.config: video_backend literal gains "lmms_video_utils".
datasets.multimodal_mixin: adds load_video_lmms_video_utils which returns (canvases, fps, CodecVideoOutput). Decoder backend / device read from extra_kwargs.video_decode_backend / extra_kwargs.video_decode_device (defaults pyav / cpu; "cuda" resolves to cuda:LOCAL_RANK).
datasets.iterable.llava_ov2_iterable_dataset (new): registers llava_ov2_iterable dataset that propagates the CodecVideoOutput metadata into the processor.
datasets.processor.llava_onevision2_processor: process() learns a video_metadata kwarg. When supplied, the processor bypasses OV2's bundled video_processor and constructs pixel_values_videos / video_grid_thw / patch_positions / frame_timestamps directly from the codec output, preserving real source-frame coordinates instead of falling back to arange(T). The frame-sampling path is unchanged.
pyproject.toml: adds [video] extra (lmms-video-utils[all]>=0.1.0) and rolls it into [all].

Example config

dataset_config:
  dataset_type: llava_ov2_iterable
  video_backend: lmms_video_utils
  extra_kwargs:
    video_decode_backend: torchcodec
    video_decode_device: cuda

Validation

3-step max_steps run on LLaVA-Video-178K/llava_video_0_30_s_cap_oe.parquet with the OV-2 8B checkpoint completed end-to-end.
DataLoader must run with num_workers=0 when video_decode_device=cuda (forked workers can't share the parent's CUDA context). Multi-worker GPU decoding would need a multiprocessing_context='spawn' toggle, which is a separate change.

Backward compatibility

Existing video backends (decord, qwen_vl_utils, qwen_omni_utils) and all other dataset / processor paths are untouched. Base install is unchanged; users must opt into pip install .[video] (or .[all]) to pull in lmms-video-utils.

New load_video_lmms_video_utils returns (canvases, fps, CodecVideoOutput) so downstream processors aware of codec metadata can use per-patch positions and per-canvas timestamps directly. Decoder backend/device read from extra_kwargs (video_decode_backend / video_decode_device), default pyav/cpu; 'cuda' resolves to cuda:LOCAL_RANK.

Adds llava_ov2_iterable dataset that pipes lmms_video_utils canvases plus their CodecVideoOutput metadata into LlavaOnevision2DataProcessor. The processor gains a video_metadata kwarg: when supplied, it bypasses OV2's video_processor and constructs pixel_values_videos / video_grid_thw / patch_positions / frame_timestamps directly from the codec output, preserving real source-frame coordinates instead of falling back to arange(T). The frame-sampling path is untouched.

Pulled in by [video] and rolled up into [all] so existing 'install .[all]' workflows get torchcodec-backed codec-stream support for free, while base installs stay slim.

Luodian · 2026-06-15T03:34:41Z

+import torch
+from PIL import Image
+
+from lmms_engine.datasets.iterable.vision_iterable_dataset import (


@kcz358 can we extract an abstract class, and name it to something like codec native input.

also for processor, ov2 is a specification in both data and processor.

Luodian · 2026-06-15T03:36:09Z

+        if self.config.video_sampling_strategy == "fps":
+            overrides["target_fps"] = float(fps)
+        elif self.config.video_sampling_strategy == "frame_num":
+            overrides["max_frames"] = int(self.config.frame_num)


I think there should not be max_frames, more like we set a visual tokens budget?

Luodian · 2026-06-15T03:37:01Z

+            overrides["max_frames"] = int(self.config.frame_num)
+
+        if video_kwargs:
+            qwen_to_ours = {


maybe should not consider qwen_to_ours, we can convert all data offline, prepare all things needed using ffprobe.

Move the dataset-level codec video orchestration (collecting canvases plus their CodecVideoOutput metadata across a message list) into a reusable CodecVideoLoadingMixin alongside multimodal_mixin. The backend implementation (load_video_lmms_video_utils) stays in MultiModalDataLoadingMixin; LlavaOv2IterableDataset now inherits the mixin and only handles image collection plus processor dispatch.

kcz358 · 2026-06-26T07:29:18Z

@Luodian Refactored: extracted the codec video loading logic into a reusable CodecVideoLoadingMixin (in datasets/codec_video_mixin.py, alongside multimodal_mixin). LlavaOv2IterableDataset now inherits this mixin instead of inlining the codec orchestration.

The backend implementation (load_video_lmms_video_utils) stays in MultiModalDataLoadingMixin; the new mixin only handles the dataset-level orchestration (collecting canvases + their CodecVideoOutput metadata across a message list). This way any future codec-aware dataset can reuse it. (714856f)

kcz358 added 4 commits June 4, 2026 21:12

chore(datasets): allow lmms_video_utils as video_backend literal

a4ea857

chore(deps): add lmms-video-utils[all] as optional video extra

9ca9323

Pulled in by [video] and rolled up into [all] so existing 'install .[all]' workflows get torchcodec-backed codec-stream support for free, while base installs stay slim.

Luodian self-requested a review June 26, 2026 04:01

Luodian approved these changes Jun 26, 2026

View reviewed changes

kcz358 merged commit fea2e57 into main Jun 26, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datasets): lmms_video_utils video backend for LLaVA-OneVision-2 codec stream#183

feat(datasets): lmms_video_utils video backend for LLaVA-OneVision-2 codec stream#183
kcz358 merged 5 commits into
mainfrom
feat/lmms-video-utils-backend

kcz358 commented Jun 5, 2026

Uh oh!

Luodian Jun 15, 2026

Uh oh!

Luodian Jun 15, 2026

Uh oh!

Luodian Jun 15, 2026

Uh oh!

kcz358 commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kcz358 commented Jun 5, 2026

Summary

Changes

Example config

Validation

Backward compatibility

Uh oh!

Luodian Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Luodian Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Luodian Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants