[1/n] Add vision transformer, connector, and MultimodalTransformer#692
[1/n] Add vision transformer, connector, and MultimodalTransformer#692jason718 wants to merge 4 commits into
Conversation
f3361b9 to
34ba65d
Compare
CI failure analysisAll 3 failing CI jobs are due to the PR coming from an external fork ( Test (CPU)5 tests fail — all with credential errors, not logic errors:
These tests all pass on PRs from within the main repo (e.g. #681) where secrets are available. Test checkpointFails with S3 Test olmo3 ladderFails with empty Beaker token — same root cause. These failures are pre-existing on all external fork PRs and do not reflect on the correctness of the vision architecture code being introduced here. |
VLM/multimodal vision-language architecture stack: - VisionBackbone (OpenAI CLIP / SigLIP / SigLIP2): OpenAI-style ViT encoder with configurable image size, patch size, embedding dim, and attention heads. Supports CLIP (openai), SigLIP (siglip), and SigLIP2 (siglip2) initialisation. - VisionConnector: attention-pooling (2×2) + SwiGLU MLP projector that maps vision embeddings to the language-model hidden dimension. - MultimodalTransformer: composite model that fuses image patch tokens into the LM token stream at image-patch positions, then runs the full LM forward pass. - Removed DINOv2 backbone variants (not used in Molmo2). - HF parity tests for CLIP, SigLIP, and SigLIP2 vision encoders.
34ba65d to
c7b2ebb
Compare
There was a problem hiding this comment.
Pull request overview
Adds the first slice of multimodal/VLM support to olmo_core.nn by introducing a vision backbone (ViT variants), a vision→LM connector, and a composite multimodal model that splices projected image features into the LM embedding stream. This lays the groundwork for subsequent PRs to build full multimodal training/inference workflows.
Changes:
- Added new vision modules:
VisionTransformer/SiglipVisionTransformer,VisionConnector, andMultimodalTransformerwith corresponding config objects. - Extended the LM
Transformer.forward()API to optionally accept precomputedinput_embeddings(for multimodal fusion). - Added unit tests and HF parity tests for CLIP/SigLIP/SigLIP2 equivalence (skipping when checkpoints/deps aren’t available).
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/olmo_core/nn/vision/__init__.py |
Exposes vision and multimodal public API from the new vision package. |
src/olmo_core/nn/vision/config.py |
Adds VisionBackboneConfig / VisionBackboneType and factory methods for CLIP/SigLIP/SigLIP2 variants. |
src/olmo_core/nn/vision/image_vit.py |
Implements CLIP-style and SigLIP-style ViT encoders returning per-layer hidden states. |
src/olmo_core/nn/vision/connector.py |
Implements pooling + projection connector from vision features to LM d_model. |
src/olmo_core/nn/vision/multimodal.py |
Implements composite multimodal model that injects pooled image features into LM embeddings. |
src/olmo_core/nn/transformer/model.py |
Adds input_embeddings support to Transformer.forward() for multimodal embedding splice. |
src/olmo_core/nn/__init__.py |
Re-exports vision/multimodal modules at the olmo_core.nn package level. |
src/test/nn/vision/__init__.py |
Adds test package initializer for vision tests. |
src/test/nn/vision/config_test.py |
Tests vision config factories, build dispatch, and basic invariants. |
src/test/nn/vision/image_vit_test.py |
Tests ViT forward shapes, determinism, CLS presence, and pos-emb interpolation behavior. |
src/test/nn/vision/connector_test.py |
Tests connector pooling/projector variants, padding mask behavior, and multi-layer inputs. |
src/test/nn/vision/multimodal_test.py |
Tests multimodal forward in text-only and image-splice modes, plus multi-crop and meta device. |
src/test/nn/vision/parity_test.py |
HF numerical parity tests for CLIP/SigLIP/SigLIP2 vision encoders (skip when unavailable). |
CHANGELOG.md |
Documents the addition of the new multimodal/vision modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- image_vit.py: drop the `transformers` dependency. The needed activations (`quick_gelu`, `gelu_pytorch_tanh`, `gelu`) are implemented with plain PyTorch ops (verified bit-exact vs transformers); unknown names raise. - transformer/model.py: raise early when `input_embeddings` is used with context parallelism, which would misalign the (unsharded) embeddings against sharded input_ids/labels/RoPE. - multimodal.py: make the image-feature splice robust to a non-contiguous `h` by calling `.contiguous()` before the masked view-assignment. - config.py: use `Self` for factory return types; clarify the `image_num_layers`=23 docstring (full CLIP tower is 24; final block unused when reading from layer -2). - Convert relative imports to absolute across the vision modules and drop external-project references from docstrings. - parity_test.py: try `local_files_only=True` first so cached checkpoints don't trigger a network download.
- image_vit.py: drop the `transformers` dependency. The needed activations (`quick_gelu`, `gelu_pytorch_tanh`, `gelu`) are implemented with plain PyTorch ops (verified bit-exact vs transformers); unknown names raise. - transformer/model.py: raise early when `input_embeddings` is used with context parallelism, which would misalign the (unsharded) embeddings against sharded input_ids/labels/RoPE. - multimodal.py: make the image-feature splice robust to a non-contiguous `h` by calling `.contiguous()` before the masked view-assignment. - config.py: use `Self` for factory return types; clarify the `image_num_layers`=23 docstring (full CLIP tower is 24; final block unused when reading from layer -2). - Convert relative imports to absolute across the vision modules and drop external-project references from docstrings. - parity_test.py: try `local_files_only=True` first so cached checkpoints don't trigger a network download.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary
This is the first in a series of a few PRs adding VLM/multimodal support to OLMo-core.
What's in this PR
Three new modules under
src/olmo_core/nn/vision/:VisionBackbone(image_vit.py,config.py)OpenAI-style Vision Transformer encoder supporting two architecture families:
openai(CLIP)openai/clip-vit-large-patch14-336siglipgoogle/siglip-so400m-patch14-384siglip2google/siglip2-so400m-patch14-384Key config knobs:
image_default_input_size,image_patch_size,image_emb_dim,image_num_heads,image_num_layers,image_num_pos. Factory class-methods cover all standard Molmo2 variants.VisionConnector(connector.py)Maps variable-length vision embeddings to the LM hidden dimension:
output_dim = lm.d_modelConstructed via
VisionConnectorConfig.from_vision_backbone(vis_cfg, output_dim, mlp_hidden_size).MultimodalTransformer(multimodal.py)Composite model holding
lm: Transformer,vision: VisionBackbone, andconnector: VisionConnector. Forward pass:image_patch_token_idpositions.HF Parity Tests
src/test/nn/vision/parity_test.pyloads real checkpoints and asserts numerical equivalence between our implementation and HuggingFace's reference models. Tests skip automatically when checkpoints are not cached.Measured max absolute errors on CPU in float32 (24–27 transformer layers):
openai/clip-vit-large-patch14-336atol=3e-3, rtol=1e-3google/siglip-so400m-patch14-384atol=3e-3, rtol=1e-3google/siglip2-so400m-patch14-384atol=3e-3, rtol=1e-3These errors reflect genuine float32 accumulation across 24–27 attention/MLP layers from independent kernel implementations — not any precision loss. They are ~10–100× tighter than fp16 parity would require.
Files changed