Add video understanding to the VLM path#123
Open
amazloumi wants to merge 4 commits into
Open
Conversation
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends the existing VLM wrapper to support video clips (as ordered frame batches) using the same registry-driven composition approach as the current image-VLM path, and adds token-reducing pooling connectors so multi-frame clips fit within the sequence budget.
Changes:
- Add pooling connectors (
avgpool,attentional_pool) with aVisionAdapter.output_num_tokens()contract, and thread adapter-derived visual token counts through build/strategy/seq-len checks. - Add a WebVid-style video data pipeline (timestamp sampling + PyAV decode, dataset + collator,
[video]config and JobConfig wiring) and hook it intoscripts/train.py. - Generalize the VLM visual projection path to accept both
(B,3,H,W)and(B,F,3,H,W)and add docs/configs/tests for video training across all four fusion archs.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds av to the locked dependency set. |
| pyproject.toml | Adds runtime dependency on av>=17.1.0. |
| README.md | Updates VLM docs to describe video + pooling connectors and adds a video training example command. |
| CHANGELOG.md | Documents the new video + pooling-connector features and associated components. |
| scripts/train.py | Wires [video] to build a video dataset/collator and passes frames_per_clip into model build. |
| kempnerforge/config/video.py | Introduces [video] VideoConfig with validation. |
| kempnerforge/config/job.py | Threads video-aware visual token budgeting into seq-len checks; adds is_video and [video]/[vlm] invariant. |
| kempnerforge/config/adapter.py | Adds pooling-related config fields and token-count prediction via output_num_tokens. |
| kempnerforge/model/vlm.py | Generalizes visual projection to 4D/5D inputs and uses adapter-derived token counts for prefix length/splits. |
| kempnerforge/model/adapter.py | Adds VisionAdapter base, pooling adapters, and shared pooled-token-count helper. |
| kempnerforge/distributed/parallel.py | Ensures parallel build sizes Transformer’s image-prefix split using adapter-derived visual_tokens and frames_per_clip. |
| kempnerforge/data/video_io.py | Adds timestamp sampling policy and PyAV-based frame decoding. |
| kempnerforge/data/video_dataset.py | Adds WebVidVideoDataset and VideoCollator producing fixed-shape (B,F,3,H,W) + frame_mask. |
| docs/how-to/train-on-video.md | New guide explaining token budget, config, and usage for video training. |
| docs/how-to/index.md | Links the new “Train on video” guide in the how-to index. |
| configs/train/vlm_video_webvid.toml | Adds a reference training config for video VLM on WebVid-10M using avgpool. |
| tests/unit/test_vlm.py | Adds unit tests for pooling token plumbing and video forward across all four archs. |
| tests/unit/test_adapter.py | Adds unit tests for pooled token counting, pooling adapters, registry wiring, and config integration. |
| tests/unit/test_video_io.py | Adds unit/integration tests for timestamp sampling and frame decoding (with skips when needed). |
| tests/unit/test_video_dataset.py | Adds unit tests for dataset path mapping, masking/padding behavior, and synthetic integration. |
| tests/unit/test_video_config.py | Adds tests for VideoConfig validation and JobConfig [video] wiring. |
| tests/unit/test_moma.py | Updates MoMa stubs to satisfy the new output_num_tokens/frames_per_clip expectations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the existing image-VLM path to ingest video — a clip is an ordered set of frames — through the same registry-driven, composition-over-inheritance design. Trains from scratch end-to-end on WebVid-10M across all four fusion archs (joint-decoder, cross-attention, MoT, MoMa). The text-only and single-image paths are unchanged (bit-exact).
Pooling connector + token-count plumbing
avgpoolandattentional_pool(Molmo2-style, mean-query MHA) adapters via@register_adapter; introduce a typedVisionAdapterbase withoutput_num_tokens()so the visual-token count is adapter-derived.config/job.py,distributed/parallel.py,model/vlm.py). Projection adapters stay identity → image path bit-exact.Video data path
data/video_io.py: timestamp-based frame sampling (2 fps, uniform, first & last frame kept — Molmo2 §3.1/§A) + PyAV decode. (torchcodec, the paper's decoder, can't load on the cluster — no system FFmpeg + CUDA-lib mismatch — so we use PyAV, whose wheel bundles FFmpeg; lazily imported.)data/video_dataset.py:WebVidVideoDataset(verifiedid[:2]/id[:4]/id[:6]/id.mp4mapping, CSV manifest, reuses the image preprocessing) +VideoCollator→(B, F, 3, H, W)+ frame mask.config/video.py:[video]VideoConfig(data_root, split, fps, max_frames, frame_size, max_samples) wired intoJobConfig(+is_video).avto dependencies.Frame-aware model + training wiring
_project_image_features→_project_visual_features: folds the frame axis through the encoder + pooler to(B, F·P′, dim)(image is theF == 1case).frames_per_clipso the static visual-token count equalsF·P′(drives the residual budget and MoT's positional split; static == runtime).scripts/train.pybuilds the video dataset/collator when[video]is set.configs/train/vlm_video_webvid.toml(SigLIP2 + avgpool + WebVid).Design faithfulness
ModalityStrategyclasses were generalized rather than duplicated; new components arrive via the registry with no edits to the text/image fast paths.Deferred to follow-up PRs:
per-frame timestamp tokens + special tokens, grounding (
<points>/<tracks>+ point-F1/track-J&F eval), frame-mask-aware attention, bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.Testing
uv run ruff checkpassesuv run ruff format --checkpassesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v— 1493 passed, 2 skippeduv run torchrun --nproc_per_node=4 -m pytest tests/distributed— 99 passed, 2 skipped, 0 failed (incl. VLM FSDP/MoT/MoMa/cross-attn suites)uv run pytest tests/e2e/ --e2e— 25 passed, 1 skipped (7B--slow). 5 failures (PP / checkpoint-resume / sigterm) are pre-existing onmain— verified identical with this branch's changes stashed; they're in code paths this PR doesn't touch.configs/train/vlm_video_webvid.toml(+ per-arch variants).