Skip to content

Add video understanding to the VLM path#123

Open
amazloumi wants to merge 4 commits into
mainfrom
worktree-video-pipeline
Open

Add video understanding to the VLM path#123
amazloumi wants to merge 4 commits into
mainfrom
worktree-video-pipeline

Conversation

@amazloumi

Copy link
Copy Markdown
Member

Summary

Extends the existing image-VLM path to ingest video — a clip is an ordered set of frames — through the same registry-driven, composition-over-inheritance design. Trains from scratch end-to-end on WebVid-10M across all four fusion archs (joint-decoder, cross-attention, MoT, MoMa). The text-only and single-image paths are unchanged (bit-exact).

Pooling connector + token-count plumbing

  • Add avgpool and attentional_pool (Molmo2-style, mean-query MHA) adapters via @register_adapter; introduce a typed VisionAdapter base with output_num_tokens() so the visual-token count is adapter-derived.
  • Thread that count through the build path, the four modality strategies, and the three seq-len checks (config/job.py, distributed/parallel.py, model/vlm.py). Projection adapters stay identity → image path bit-exact.

Video data path

  • data/video_io.py: timestamp-based frame sampling (2 fps, uniform, first & last frame kept — Molmo2 §3.1/§A) + PyAV decode. (torchcodec, the paper's decoder, can't load on the cluster — no system FFmpeg + CUDA-lib mismatch — so we use PyAV, whose wheel bundles FFmpeg; lazily imported.)
  • data/video_dataset.py: WebVidVideoDataset (verified id[:2]/id[:4]/id[:6]/id.mp4 mapping, CSV manifest, reuses the image preprocessing) + VideoCollator(B, F, 3, H, W) + frame mask.
  • config/video.py: [video] VideoConfig (data_root, split, fps, max_frames, frame_size, max_samples) wired into JobConfig (+ is_video).
  • Adds av to dependencies.

Frame-aware model + training wiring

  • Generalize _project_image_features_project_visual_features: folds the frame axis through the encoder + pooler to (B, F·P′, dim) (image is the F == 1 case).
  • Thread frames_per_clip so the static visual-token count equals F·P′ (drives the residual budget and MoT's positional split; static == runtime).
  • scripts/train.py builds the video dataset/collator when [video] is set.
  • Add configs/train/vlm_video_webvid.toml (SigLIP2 + avgpool + WebVid).

Design faithfulness

  • Video is not a new arch, so the four existing ModalityStrategy classes were generalized rather than duplicated; new components arrive via the registry with no edits to the text/image fast paths.
  • FSDP2-only (PP still rejected for VLM); fixed-shape collation for DP-rank consistency; DCP-stable FQNs.

Deferred to follow-up PRs:
per-frame timestamp tokens + special tokens, grounding (<points>/<tracks> + point-F1/track-J&F eval), frame-mask-aware attention, bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.

Testing

  • uv run ruff check passes
  • uv run ruff format --check passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v — 1493 passed, 2 skipped
  • uv run torchrun --nproc_per_node=4 -m pytest tests/distributed — 99 passed, 2 skipped, 0 failed (incl. VLM FSDP/MoT/MoMa/cross-attn suites)
  • uv run pytest tests/e2e/ --e2e — 25 passed, 1 skipped (7B --slow). 5 failures (PP / checkpoint-resume / sigterm) are pre-existing on main — verified identical with this branch's changes stashed; they're in code paths this PR doesn't touch.
  • Tested on 4× H100 (FSDP) — from-scratch WebVid training runs end-to-end for all four archs via configs/train/vlm_video_webvid.toml (+ per-arch variants).

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.25843% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
kempnerforge/data/video_io.py 78.43% 7 Missing and 4 partials ⚠️
kempnerforge/data/video_dataset.py 92.38% 4 Missing and 4 partials ⚠️
kempnerforge/model/adapter.py 96.33% 2 Missing and 2 partials ⚠️
kempnerforge/model/vlm.py 96.87% 1 Missing ⚠️
Files with missing lines Coverage Δ
kempnerforge/config/adapter.py 100.00% <100.00%> (ø)
kempnerforge/config/job.py 88.59% <100.00%> (+1.72%) ⬆️
kempnerforge/config/video.py 100.00% <100.00%> (ø)
kempnerforge/distributed/parallel.py 59.06% <100.00%> (+0.24%) ⬆️
kempnerforge/model/vlm.py 99.07% <96.87%> (+0.12%) ⬆️
kempnerforge/model/adapter.py 97.27% <96.33%> (-2.73%) ⬇️
kempnerforge/data/video_dataset.py 92.38% <92.38%> (ø)
kempnerforge/data/video_io.py 78.43% <78.43%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the existing VLM wrapper to support video clips (as ordered frame batches) using the same registry-driven composition approach as the current image-VLM path, and adds token-reducing pooling connectors so multi-frame clips fit within the sequence budget.

Changes:

  • Add pooling connectors (avgpool, attentional_pool) with a VisionAdapter.output_num_tokens() contract, and thread adapter-derived visual token counts through build/strategy/seq-len checks.
  • Add a WebVid-style video data pipeline (timestamp sampling + PyAV decode, dataset + collator, [video] config and JobConfig wiring) and hook it into scripts/train.py.
  • Generalize the VLM visual projection path to accept both (B,3,H,W) and (B,F,3,H,W) and add docs/configs/tests for video training across all four fusion archs.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds av to the locked dependency set.
pyproject.toml Adds runtime dependency on av>=17.1.0.
README.md Updates VLM docs to describe video + pooling connectors and adds a video training example command.
CHANGELOG.md Documents the new video + pooling-connector features and associated components.
scripts/train.py Wires [video] to build a video dataset/collator and passes frames_per_clip into model build.
kempnerforge/config/video.py Introduces [video] VideoConfig with validation.
kempnerforge/config/job.py Threads video-aware visual token budgeting into seq-len checks; adds is_video and [video]/[vlm] invariant.
kempnerforge/config/adapter.py Adds pooling-related config fields and token-count prediction via output_num_tokens.
kempnerforge/model/vlm.py Generalizes visual projection to 4D/5D inputs and uses adapter-derived token counts for prefix length/splits.
kempnerforge/model/adapter.py Adds VisionAdapter base, pooling adapters, and shared pooled-token-count helper.
kempnerforge/distributed/parallel.py Ensures parallel build sizes Transformer’s image-prefix split using adapter-derived visual_tokens and frames_per_clip.
kempnerforge/data/video_io.py Adds timestamp sampling policy and PyAV-based frame decoding.
kempnerforge/data/video_dataset.py Adds WebVidVideoDataset and VideoCollator producing fixed-shape (B,F,3,H,W) + frame_mask.
docs/how-to/train-on-video.md New guide explaining token budget, config, and usage for video training.
docs/how-to/index.md Links the new “Train on video” guide in the how-to index.
configs/train/vlm_video_webvid.toml Adds a reference training config for video VLM on WebVid-10M using avgpool.
tests/unit/test_vlm.py Adds unit tests for pooling token plumbing and video forward across all four archs.
tests/unit/test_adapter.py Adds unit tests for pooled token counting, pooling adapters, registry wiring, and config integration.
tests/unit/test_video_io.py Adds unit/integration tests for timestamp sampling and frame decoding (with skips when needed).
tests/unit/test_video_dataset.py Adds unit tests for dataset path mapping, masking/padding behavior, and synthetic integration.
tests/unit/test_video_config.py Adds tests for VideoConfig validation and JobConfig [video] wiring.
tests/unit/test_moma.py Updates MoMa stubs to satisfy the new output_num_tokens/frames_per_clip expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kempnerforge/model/vlm.py
Comment thread kempnerforge/model/adapter.py
Comment thread kempnerforge/config/adapter.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants