Support ragged grids in attentional_pool#136
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the vision connector’s attentional_pool adapter to support ragged patch grids (where the grid side is not divisible by the pooling window) by padding to ceil(grid/window) * window and masking padded patches so partial edge windows only pool over real patches. This brings attentional_pool to parity with avgpool, unblocks Molmo2-style 3×3 video pooling on SigLIP’s 14×14 grid, and ensures config-time token-count estimation matches runtime behavior.
Changes:
- Implement ragged-grid handling in
AttentionalPoolAdapter.forwardvia bottom/right padding + per-window boolean attention mask + masked-mean query. - Update token-count prediction to
ceil(grid/window)²forattentional_pool(and remove the divisibility-only enforcement seam for current pooling adapters). - Add/adjust unit tests and update documentation + changelog to reflect ragged support.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| kempnerforge/model/adapter.py | Adds ragged-grid padding/masking to attentional_pool and aligns output_num_tokens with ceil(grid/window)². |
| tests/unit/test_adapter.py | Updates expectations and adds coverage for ragged token counts and masked edge-window correctness. |
| docs/how-to/train-on-video.md | Documents that both pooling connectors support ragged windows and gives the 14×14 @ window=3 example (25 tokens). |
| CHANGELOG.md | Records the new ragged attentional_pool capability and its practical impact on 3×3 video pooling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
attentional_poolnow pools ragged patch grids (grid not divisible by the pool window) instead of rejecting them: each partial edge window pools only its real patches via a masked attention — pad the grid toceil(grid/w)·w, build a per-window key-padding mask, use the masked window-mean as the query, and mask padded patches out of the SDPA K/V. Mirrorsavgpool's ragged handling.14 % 3 != 0→ 5×5 = 25 tokens), which previously had to fall back toavgpoolor a divisible window.attn_mask=Nonepath.AttentionalPoolAdapter.output_num_tokens→ceil(grid/w)**2(dropsrequire_divisible);DIVISIBLE_ONLY_POOL_TYPESis emptied (kept as a seam for a future divisible-only connector), soconfig/adapter.py'soutput_num_tokensand themax_seq_lenchecks accept ragged.output_num_tokens). The "v1 requires divisible / ragged is a follow-up" caveats are removed from the docstrings + how-to.kempnerforge/model/adapter.py,tests/unit/test_adapter.py,docs/how-to/train-on-video.md,CHANGELOG.md.Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors, 0 warnings, 0 informations)uv run pytest tests/unit/ -v --timeout=60— rantests/unit/test_adapter.py: 91 passed (new: ragged token count14×14 @ 3 → 25, masked edge-window correctness — a 1-real-patch window equals attention over that single patch,output_num_tokensmatches forward for ragged, config accepts ragged; flipped the old ragged-rejection tests). Full unit suite runs in CI.uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v— n/a: nodistributed//parallelism code changed; the adapter runs unchanged under FSDP.uv run pytest tests/e2e/ --e2e -v— n/a: pure adapter-forward math + config gate.Closes #135
refs: KEM-546