Add video understanding to the VLM path by amazloumi · Pull Request #123 · KempnerInstitute/KempnerForge

amazloumi · 2026-06-25T03:15:48Z

Summary

Extends the existing image-VLM path to ingest video — a clip is an ordered set of frames — through the same registry-driven, composition-over-inheritance design. Trains from scratch end-to-end on WebVid-10M across all four fusion archs (joint-decoder, cross-attention, MoT, MoMa). The text-only and single-image paths are unchanged (bit-exact).

Pooling connector + token-count plumbing

Add avgpool and attentional_pool (Molmo2-style, mean-query MHA) adapters via @register_adapter; introduce a typed VisionAdapter base with output_num_tokens() so the visual-token count is adapter-derived.
Thread that count through the build path, the four modality strategies, and the three seq-len checks (config/job.py, distributed/parallel.py, model/vlm.py). Projection adapters stay identity → image path bit-exact.

Video data path

data/video_io.py: timestamp-based frame sampling (2 fps, uniform, first & last frame kept — Molmo2 §3.1/§A) + PyAV decode. (torchcodec, the paper's decoder, can't load on the cluster — no system FFmpeg + CUDA-lib mismatch — so we use PyAV, whose wheel bundles FFmpeg; lazily imported.)
data/video_dataset.py: WebVidVideoDataset (verified id[:2]/id[:4]/id[:6]/id.mp4 mapping, CSV manifest, reuses the image preprocessing) + VideoCollator → (B, F, 3, H, W) + frame mask.
config/video.py: [video] VideoConfig (data_root, split, fps, max_frames, frame_size, max_samples) wired into JobConfig (+ is_video).
Adds av to dependencies.

Frame-aware model + training wiring

Generalize _project_image_features → _project_visual_features: folds the frame axis through the encoder + pooler to (B, F·P′, dim) (image is the F == 1 case).
Thread frames_per_clip so the static visual-token count equals F·P′ (drives the residual budget and MoT's positional split; static == runtime).
scripts/train.py builds the video dataset/collator when [video] is set.
Add configs/train/vlm_video_webvid.toml (SigLIP2 + avgpool + WebVid).

Design faithfulness

Video is not a new arch, so the four existing ModalityStrategy classes were generalized rather than duplicated; new components arrive via the registry with no edits to the text/image fast paths.
FSDP2-only (PP still rejected for VLM); fixed-shape collation for DP-rank consistency; DCP-stable FQNs.

Deferred to follow-up PRs:
per-frame timestamp tokens + special tokens, grounding (<points>/<tracks> + point-F1/track-J&F eval), frame-mask-aware attention, bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.

Testing

uv run ruff check passes
uv run ruff format --check passes
uv run pyright kempnerforge/ passes (0 errors)
uv run pytest tests/unit/ -v — 1493 passed, 2 skipped
uv run torchrun --nproc_per_node=4 -m pytest tests/distributed — 99 passed, 2 skipped, 0 failed (incl. VLM FSDP/MoT/MoMa/cross-attn suites)
uv run pytest tests/e2e/ --e2e — 25 passed, 1 skipped (7B --slow). 5 failures (PP / checkpoint-resume / sigterm) are pre-existing on main — verified identical with this branch's changes stashed; they're in code paths this PR doesn't touch.
Tested on 4× H100 (FSDP) — from-scratch WebVid training runs end-to-end for all four archs via configs/train/vlm_video_webvid.toml (+ per-arch variants).

codecov · 2026-06-25T03:19:13Z

Codecov Report

❌ Patch coverage is 93.25843% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
kempnerforge/data/video_io.py	78.43%	7 Missing and 4 partials ⚠️
kempnerforge/data/video_dataset.py	92.38%	4 Missing and 4 partials ⚠️
kempnerforge/model/adapter.py	96.33%	2 Missing and 2 partials ⚠️
kempnerforge/model/vlm.py	96.87%	1 Missing ⚠️

Files with missing lines	Coverage Δ
kempnerforge/config/adapter.py	`100.00% <100.00%> (ø)`
kempnerforge/config/job.py	`88.59% <100.00%> (+1.72%)`	⬆️
kempnerforge/config/video.py	`100.00% <100.00%> (ø)`
kempnerforge/distributed/parallel.py	`59.06% <100.00%> (+0.24%)`	⬆️
kempnerforge/model/vlm.py	`99.07% <96.87%> (+0.12%)`	⬆️
kempnerforge/model/adapter.py	`97.27% <96.33%> (-2.73%)`	⬇️
kempnerforge/data/video_dataset.py	`92.38% <92.38%> (ø)`
kempnerforge/data/video_io.py	`78.43% <78.43%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR extends the existing VLM wrapper to support video clips (as ordered frame batches) using the same registry-driven composition approach as the current image-VLM path, and adds token-reducing pooling connectors so multi-frame clips fit within the sequence budget.

Changes:

Add pooling connectors (avgpool, attentional_pool) with a VisionAdapter.output_num_tokens() contract, and thread adapter-derived visual token counts through build/strategy/seq-len checks.
Add a WebVid-style video data pipeline (timestamp sampling + PyAV decode, dataset + collator, [video] config and JobConfig wiring) and hook it into scripts/train.py.
Generalize the VLM visual projection path to accept both (B,3,H,W) and (B,F,3,H,W) and add docs/configs/tests for video training across all four fusion archs.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
uv.lock	Adds `av` to the locked dependency set.
pyproject.toml	Adds runtime dependency on `av>=17.1.0`.
README.md	Updates VLM docs to describe video + pooling connectors and adds a video training example command.
CHANGELOG.md	Documents the new video + pooling-connector features and associated components.
scripts/train.py	Wires `[video]` to build a video dataset/collator and passes `frames_per_clip` into model build.
kempnerforge/config/video.py	Introduces `[video]` `VideoConfig` with validation.
kempnerforge/config/job.py	Threads video-aware visual token budgeting into seq-len checks; adds `is_video` and `[video]`/`[vlm]` invariant.
kempnerforge/config/adapter.py	Adds pooling-related config fields and token-count prediction via `output_num_tokens`.
kempnerforge/model/vlm.py	Generalizes visual projection to 4D/5D inputs and uses adapter-derived token counts for prefix length/splits.
kempnerforge/model/adapter.py	Adds `VisionAdapter` base, pooling adapters, and shared pooled-token-count helper.
kempnerforge/distributed/parallel.py	Ensures parallel build sizes Transformer’s image-prefix split using adapter-derived `visual_tokens` and `frames_per_clip`.
kempnerforge/data/video_io.py	Adds timestamp sampling policy and PyAV-based frame decoding.
kempnerforge/data/video_dataset.py	Adds `WebVidVideoDataset` and `VideoCollator` producing fixed-shape `(B,F,3,H,W)` + `frame_mask`.
docs/how-to/train-on-video.md	New guide explaining token budget, config, and usage for video training.
docs/how-to/index.md	Links the new “Train on video” guide in the how-to index.
configs/train/vlm_video_webvid.toml	Adds a reference training config for video VLM on WebVid-10M using `avgpool`.
tests/unit/test_vlm.py	Adds unit tests for pooling token plumbing and video forward across all four archs.
tests/unit/test_adapter.py	Adds unit tests for pooled token counting, pooling adapters, registry wiring, and config integration.
tests/unit/test_video_io.py	Adds unit/integration tests for timestamp sampling and frame decoding (with skips when needed).
tests/unit/test_video_dataset.py	Adds unit tests for dataset path mapping, masking/padding behavior, and synthetic integration.
tests/unit/test_video_config.py	Adds tests for `VideoConfig` validation and JobConfig `[video]` wiring.
tests/unit/test_moma.py	Updates MoMa stubs to satisfy the new `output_num_tokens`/`frames_per_clip` expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ore runtime

Add video understanding to the VLM path (all four archs)

3bbfa05

amazloumi added 2 commits June 24, 2026 23:21

docs: document video understanding (CHANGELOG, README, how-to)

95dd34a

Fix docs build warning and add video decode/dataset test coverage

63779f9

amazloumi requested review from Naeemkh, camilobrownpinilla, Copilot and mmshad and removed request for Naeemkh June 25, 2026 03:47

Copilot started reviewing on behalf of amazloumi June 25, 2026 03:48 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread kempnerforge/model/vlm.py

Comment thread kempnerforge/model/adapter.py

Comment thread kempnerforge/config/adapter.py Outdated

Validate frames-per-clip and reject ragged attentional_pool grids bef…

ddd75bd

…ore runtime

Copilot started work on behalf of amazloumi June 25, 2026 04:19 View session

Copilot finished work on behalf of amazloumi June 25, 2026 04:20

Copilot started work on behalf of amazloumi June 25, 2026 04:22 View session

Copilot finished work on behalf of amazloumi June 25, 2026 04:23

Copilot started work on behalf of amazloumi June 25, 2026 04:23 View session

Copilot finished work on behalf of amazloumi June 25, 2026 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add video understanding to the VLM path#123

Add video understanding to the VLM path#123
amazloumi wants to merge 4 commits into
mainfrom
worktree-video-pipeline

amazloumi commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

amazloumi commented Jun 25, 2026

Summary

Testing

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 25, 2026 •

edited

Loading