Mask padded video frames from attention#132
Merged
Merged
Conversation
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR wires a per-frame frame_mask through the VLM stack so padded video frames are masked out of attention (and, for MoMa, excluded from expert-choice routing), ensuring real-token outputs are invariant to padded-frame content across all four VLM architectures.
Changes:
- Add a generic per-token validity mask (
ModalityContext.key_padding_mask) and thread it throughTransformerblocks (and MoT/MoMa paths). - Consume the mask in attention implementations (shared self-attention, MoT self-attention, cross-attention image K/V masking) with a NaN guard for fully-masked rows.
- Pass
frame_maskfrom the training loop, add unit tests for masking invariance/NaN-guard behavior, and update docs/CHANGELOG.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
kempnerforge/model/vlm.py |
Expands frame_mask to per-visual-token masks and places masks into ModalityContext per-arch. |
kempnerforge/model/modality.py |
Adds ModalityContext.key_padding_mask field and invariant checks. |
kempnerforge/model/transformer.py |
Threads key_padding_mask through blocks and MoT/MoMa branches. |
kempnerforge/model/attention.py |
ANDs key_padding_mask with causal/doc masks + NaN guard in shared self-attention. |
kempnerforge/model/mot.py |
Builds explicit causal-and-valid attention mask when key_padding_mask is provided + NaN guard. |
kempnerforge/model/cross_attention.py |
Uses image_mask with a NaN guard to keep all-masked rows finite. |
kempnerforge/model/moma.py |
Excludes padded positions from expert-choice routing via key_padding_mask. |
scripts/train.py |
Threads batch["frame_mask"] (when present) into the model forward. |
tests/unit/test_vlm.py |
Adds masking invariance, mask expansion, F=1 no-op, and all-padded NaN-guard tests across archs. |
tests/unit/test_moma.py |
Adds unit test ensuring padded tokens are excluded from MoMa routing and don’t perturb real outputs. |
tests/unit/test_modality_context.py |
Adds invariant tests for key_padding_mask. |
docs/how-to/train-on-video.md |
Updates training-on-video docs to reflect frame-mask-aware behavior and trade-offs. |
CHANGELOG.md |
Documents the new frame padding masking behavior and related changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Naeemkh
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
frame_maskso real tokens never attend to padded-frame visual tokens, across all four archs. One generic per-token validity mask,ModalityContext.key_padding_mask (B, S), threads through the model — consumers are modality-agnostic.AttentionANDs it with the causal (and doc) mask — covers Joint-Decoder + MoMa;MoTAttentionbuilds an explicit causal-AND-valid mask; Cross-Attention masks the padded image K/V via its existingimage_mask.MoMaFFNexcludes padded positions from expert-choice routing — otherwise padded tokens consume expert capacity and perturb which real tokens get processed (caught by a per-arch invariance test).vlm.py:_visual_token_maskexpandsframe_mask (B,F)→(B, F·P′); the four strategies place it.scripts/train.pythreadsbatch["frame_mask"].F=1) and text paths are unchanged (no-op when nothing is padded). Foundation for variable-length / mixed image+video batches.MoEMLP— a "generic token-validity in MoE" change (also fixes padded text). MoT-dense (default) and MoMa are fully masked here.Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v --timeout=60passes — affected files + the doc_ids/packing regression: 234 passed (new:test_vlm.py::TestFramePaddingMask— per-arch masking invariance, image no-op, undecodable-clip NaN guard, mask expansion;test_moma.pyFFN routing exclusion;test_modality_context.pyinvariant)uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v←parallel.pyunchanged, but the model forward (attention/mot/moma) changed; worth running on GPUsCloses #131