Skip to content

Mask padded video frames from attention#132

Merged
amazloumi merged 3 commits into
mainfrom
feat/mask-padded-frames
Jun 30, 2026
Merged

Mask padded video frames from attention#132
amazloumi merged 3 commits into
mainfrom
feat/mask-padded-frames

Conversation

@amazloumi

@amazloumi amazloumi commented Jun 29, 2026

Copy link
Copy Markdown
Member

Summary

  • Consume the per-frame frame_mask so real tokens never attend to padded-frame visual tokens, across all four archs. One generic per-token validity mask, ModalityContext.key_padding_mask (B, S), threads through the model — consumers are modality-agnostic.
  • Shared Attention ANDs it with the causal (and doc) mask — covers Joint-Decoder + MoMa; MoTAttention builds an explicit causal-AND-valid mask; Cross-Attention masks the padded image K/V via its existingimage_mask.
  • MoMaFFN excludes padded positions from expert-choice routing — otherwise padded tokens consume expert capacity and perturb which real tokens get processed (caught by a per-arch invariance test).
  • NaN guard unmasks fully-masked query rows (an all-padded / undecodable clip) so softmax stays finite.
  • vlm.py: _visual_token_mask expands frame_mask (B,F)(B, F·P′); the four strategies place it. scripts/train.py threads batch["frame_mask"].
  • Pure mask — no new state-dict keys (checkpoints stay compatible both ways); image (F=1) and text paths are unchanged (no-op when nothing is padded). Foundation for variable-length / mixed image+video batches.
  • Docs + CHANGELOG updated; "frame-mask-aware attention" removed from the video Deferred list.
  • Follow-up (out of scope): MoT with an MoE FFN still routes padded tokens through the shared MoEMLP — a "generic token-validity in MoE" change (also fixes padded text). MoT-dense (default) and MoMa are fully masked here.

Testing

  • uv run ruff check kempnerforge/ tests/ passes
  • uv run ruff format --check kempnerforge/ tests/ scripts/ passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v --timeout=60 passes — affected files + the doc_ids/packing regression: 234 passed (new: test_vlm.py::TestFramePaddingMask — per-arch masking invariance, image no-op, undecodable-clip NaN guard, mask expansion; test_moma.py FFN routing exclusion; test_modality_context.py invariant)
  • If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -vparallel.py unchanged, but the model forward (attention/mot/moma) changed; worth running on GPUs

[x] e2e: 24 passed, 6 failed — all 6 failures are pre-existing on main (PP/checkpoint/HF/sigterm infra, confirmed via baseline run) and unrelated to this PR; tracked in #133. The basic FSDP/TP e2e tests, which exercise the changed model forward, pass.

Closes #131

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.87879% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
kempnerforge/model/attention.py 55.55% 7 Missing and 1 partial ⚠️
Files with missing lines Coverage Δ
kempnerforge/model/cross_attention.py 100.00% <100.00%> (ø)
kempnerforge/model/modality.py 100.00% <100.00%> (ø)
kempnerforge/model/moma.py 98.07% <100.00%> (+0.09%) ⬆️
kempnerforge/model/mot.py 94.21% <100.00%> (+0.35%) ⬆️
kempnerforge/model/transformer.py 94.62% <100.00%> (+0.02%) ⬆️
kempnerforge/model/vlm.py 99.18% <100.00%> (+0.11%) ⬆️
kempnerforge/model/attention.py 87.87% <55.55%> (-3.14%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires a per-frame frame_mask through the VLM stack so padded video frames are masked out of attention (and, for MoMa, excluded from expert-choice routing), ensuring real-token outputs are invariant to padded-frame content across all four VLM architectures.

Changes:

  • Add a generic per-token validity mask (ModalityContext.key_padding_mask) and thread it through Transformer blocks (and MoT/MoMa paths).
  • Consume the mask in attention implementations (shared self-attention, MoT self-attention, cross-attention image K/V masking) with a NaN guard for fully-masked rows.
  • Pass frame_mask from the training loop, add unit tests for masking invariance/NaN-guard behavior, and update docs/CHANGELOG.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
kempnerforge/model/vlm.py Expands frame_mask to per-visual-token masks and places masks into ModalityContext per-arch.
kempnerforge/model/modality.py Adds ModalityContext.key_padding_mask field and invariant checks.
kempnerforge/model/transformer.py Threads key_padding_mask through blocks and MoT/MoMa branches.
kempnerforge/model/attention.py ANDs key_padding_mask with causal/doc masks + NaN guard in shared self-attention.
kempnerforge/model/mot.py Builds explicit causal-and-valid attention mask when key_padding_mask is provided + NaN guard.
kempnerforge/model/cross_attention.py Uses image_mask with a NaN guard to keep all-masked rows finite.
kempnerforge/model/moma.py Excludes padded positions from expert-choice routing via key_padding_mask.
scripts/train.py Threads batch["frame_mask"] (when present) into the model forward.
tests/unit/test_vlm.py Adds masking invariance, mask expansion, F=1 no-op, and all-padded NaN-guard tests across archs.
tests/unit/test_moma.py Adds unit test ensuring padded tokens are excluded from MoMa routing and don’t perturb real outputs.
tests/unit/test_modality_context.py Adds invariant tests for key_padding_mask.
docs/how-to/train-on-video.md Updates training-on-video docs to reflect frame-mask-aware behavior and trade-offs.
CHANGELOG.md Documents the new frame padding masking behavior and related changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kempnerforge/model/attention.py Outdated
Comment thread docs/how-to/train-on-video.md Outdated
Comment thread CHANGELOG.md Outdated
@amazloumi amazloumi merged commit 32accd1 into main Jun 30, 2026
9 checks passed
@amazloumi amazloumi deleted the feat/mask-padded-frames branch June 30, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mask padded video frames from attention in the VLM video path

4 participants