Skip to content

[1/n] Add vision transformer, connector, and MultimodalTransformer#692

Open
jason718 wants to merge 4 commits into
allenai:mainfrom
jason718:pr/vlm-1-vision-architecture
Open

[1/n] Add vision transformer, connector, and MultimodalTransformer#692
jason718 wants to merge 4 commits into
allenai:mainfrom
jason718:pr/vlm-1-vision-architecture

Conversation

@jason718
Copy link
Copy Markdown

@jason718 jason718 commented May 27, 2026

Summary

This is the first in a series of a few PRs adding VLM/multimodal support to OLMo-core.


What's in this PR

Three new modules under src/olmo_core/nn/vision/:

VisionBackbone (image_vit.py, config.py)

OpenAI-style Vision Transformer encoder supporting two architecture families:

Type HF checkpoint
openai (CLIP) openai/clip-vit-large-patch14-336
siglip google/siglip-so400m-patch14-384
siglip2 google/siglip2-so400m-patch14-384

Key config knobs: image_default_input_size, image_patch_size, image_emb_dim, image_num_heads, image_num_layers, image_num_pos. Factory class-methods cover all standard Molmo2 variants.

VisionConnector (connector.py)

Maps variable-length vision embeddings to the LM hidden dimension:

  1. 2×2 attention pooling over the patch grid
  2. SwiGLU MLP projector → output_dim = lm.d_model

Constructed via VisionConnectorConfig.from_vision_backbone(vis_cfg, output_dim, mlp_hidden_size).

MultimodalTransformer (multimodal.py)

Composite model holding lm: Transformer, vision: VisionBackbone, and connector: VisionConnector. Forward pass:

  1. Run VisionBackbone on each crop's patch tensor.
  2. Project patch embeddings to LM width via VisionConnector.
  3. Splice patch tokens into the text embedding stream at image_patch_token_id positions.
  4. Run the standard LM forward pass on the fused sequence.

HF Parity Tests

src/test/nn/vision/parity_test.py loads real checkpoints and asserts numerical equivalence between our implementation and HuggingFace's reference models. Tests skip automatically when checkpoints are not cached.

Measured max absolute errors on CPU in float32 (24–27 transformer layers):

Checkpoint max abs error mean abs error tolerance used
openai/clip-vit-large-patch14-336 4.9e-04 3.1e-06 atol=3e-3, rtol=1e-3
google/siglip-so400m-patch14-384 2.8e-03 7.8e-06 atol=3e-3, rtol=1e-3
google/siglip2-so400m-patch14-384 3.1e-04 3.0e-06 atol=3e-3, rtol=1e-3

These errors reflect genuine float32 accumulation across 24–27 attention/MLP layers from independent kernel implementations — not any precision loss. They are ~10–100× tighter than fp16 parity would require.


Files changed

src/olmo_core/nn/vision/__init__.py        (new)
src/olmo_core/nn/vision/config.py          (new)  VisionBackboneConfig + factory methods
src/olmo_core/nn/vision/image_vit.py       (new)  VisionTransformer, SiglipVisionTransformer
src/olmo_core/nn/vision/connector.py       (new)  VisionConnector
src/olmo_core/nn/vision/multimodal.py      (new)  MultimodalTransformer, MultimodalTransformerConfig

src/test/nn/vision/__init__.py             (new)
src/test/nn/vision/config_test.py          (new)
src/test/nn/vision/image_vit_test.py       (new)
src/test/nn/vision/connector_test.py       (new)
src/test/nn/vision/multimodal_test.py      (new)
src/test/nn/vision/parity_test.py          (new)  CLIP + SigLIP + SigLIP2 HF parity

@jason718 jason718 changed the title [1/7] Add vision transformer, connector, and MultimodalTransformer [1/n] Add vision transformer, connector, and MultimodalTransformer May 27, 2026
@jason718 jason718 force-pushed the pr/vlm-1-vision-architecture branch from f3361b9 to 34ba65d Compare May 27, 2026 06:20
@jason718
Copy link
Copy Markdown
Author

CI failure analysis

All 3 failing CI jobs are due to the PR coming from an external fork (jason718/OLMo-core), which does not receive GitHub Actions secrets. None are caused by the code changes in this PR.

Test (CPU)

5 tests fail — all with credential errors, not logic errors:

Test Error
test/data/mixes_test.py::test_olmoe_mix S3 403 Forbidden — no AWS creds
test/data/mixes_test.py::test_dolma17_mix S3 403 Forbidden — no AWS creds
test/data/mixes_test.py::test_v3_small_ppl_validation_mix S3 403 Forbidden — no AWS creds
test/io_test.py::test_s3_functionality S3 403 Forbidden — no AWS creds
test/launch/beaker_test.py::test_get_beaker_client_caching BeakerConfigurationError: token is empty

These tests all pass on PRs from within the main repo (e.g. #681) where secrets are available.

Test checkpoint

Fails with S3 AccessDenied on checkpoint read — same root cause (no AWS creds on fork).

Test olmo3 ladder

Fails with empty Beaker token — same root cause.

These failures are pre-existing on all external fork PRs and do not reflect on the correctness of the vision architecture code being introduced here.

VLM/multimodal vision-language architecture stack:

- VisionBackbone (OpenAI CLIP / SigLIP / SigLIP2): OpenAI-style ViT encoder
  with configurable image size, patch size, embedding dim, and attention heads.
  Supports CLIP (openai), SigLIP (siglip), and SigLIP2 (siglip2) initialisation.
- VisionConnector: attention-pooling (2×2) + SwiGLU MLP projector that
  maps vision embeddings to the language-model hidden dimension.
- MultimodalTransformer: composite model that fuses image patch tokens into
  the LM token stream at image-patch positions, then runs the full LM
  forward pass.
- Removed DINOv2 backbone variants (not used in Molmo2).
- HF parity tests for CLIP, SigLIP, and SigLIP2 vision encoders.
@jason718 jason718 force-pushed the pr/vlm-1-vision-architecture branch from 34ba65d to c7b2ebb Compare May 28, 2026 18:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the first slice of multimodal/VLM support to olmo_core.nn by introducing a vision backbone (ViT variants), a vision→LM connector, and a composite multimodal model that splices projected image features into the LM embedding stream. This lays the groundwork for subsequent PRs to build full multimodal training/inference workflows.

Changes:

  • Added new vision modules: VisionTransformer / SiglipVisionTransformer, VisionConnector, and MultimodalTransformer with corresponding config objects.
  • Extended the LM Transformer.forward() API to optionally accept precomputed input_embeddings (for multimodal fusion).
  • Added unit tests and HF parity tests for CLIP/SigLIP/SigLIP2 equivalence (skipping when checkpoints/deps aren’t available).

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/olmo_core/nn/vision/__init__.py Exposes vision and multimodal public API from the new vision package.
src/olmo_core/nn/vision/config.py Adds VisionBackboneConfig / VisionBackboneType and factory methods for CLIP/SigLIP/SigLIP2 variants.
src/olmo_core/nn/vision/image_vit.py Implements CLIP-style and SigLIP-style ViT encoders returning per-layer hidden states.
src/olmo_core/nn/vision/connector.py Implements pooling + projection connector from vision features to LM d_model.
src/olmo_core/nn/vision/multimodal.py Implements composite multimodal model that injects pooled image features into LM embeddings.
src/olmo_core/nn/transformer/model.py Adds input_embeddings support to Transformer.forward() for multimodal embedding splice.
src/olmo_core/nn/__init__.py Re-exports vision/multimodal modules at the olmo_core.nn package level.
src/test/nn/vision/__init__.py Adds test package initializer for vision tests.
src/test/nn/vision/config_test.py Tests vision config factories, build dispatch, and basic invariants.
src/test/nn/vision/image_vit_test.py Tests ViT forward shapes, determinism, CLS presence, and pos-emb interpolation behavior.
src/test/nn/vision/connector_test.py Tests connector pooling/projector variants, padding mask behavior, and multi-layer inputs.
src/test/nn/vision/multimodal_test.py Tests multimodal forward in text-only and image-splice modes, plus multi-crop and meta device.
src/test/nn/vision/parity_test.py HF numerical parity tests for CLIP/SigLIP/SigLIP2 vision encoders (skip when unavailable).
CHANGELOG.md Documents the addition of the new multimodal/vision modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/olmo_core/nn/vision/image_vit.py Outdated
Comment thread src/olmo_core/nn/transformer/model.py
Comment thread src/olmo_core/nn/vision/multimodal.py Outdated
Comment thread src/test/nn/vision/parity_test.py
Comment thread src/olmo_core/nn/vision/config.py
Copy link
Copy Markdown
Collaborator

@undfined undfined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid! I like the layout and tests, thank you. A few suggestions from me + let's check the copilot suggestions (a couple overlap). @AkshitaB would love to get your input before we merge.

Comment thread src/olmo_core/nn/vision/image_vit.py Outdated
Comment thread src/olmo_core/nn/vision/image_vit.py Outdated
Comment thread src/olmo_core/nn/vision/multimodal.py Outdated
Comment thread src/olmo_core/nn/transformer/model.py
Comment thread src/olmo_core/nn/vision/config.py Outdated
jason718 added a commit to jason718/OLMo-core that referenced this pull request Jun 2, 2026
- image_vit.py: drop the `transformers` dependency. The needed activations
  (`quick_gelu`, `gelu_pytorch_tanh`, `gelu`) are implemented with plain
  PyTorch ops (verified bit-exact vs transformers); unknown names raise.
- transformer/model.py: raise early when `input_embeddings` is used with
  context parallelism, which would misalign the (unsharded) embeddings
  against sharded input_ids/labels/RoPE.
- multimodal.py: make the image-feature splice robust to a non-contiguous
  `h` by calling `.contiguous()` before the masked view-assignment.
- config.py: use `Self` for factory return types; clarify the
  `image_num_layers`=23 docstring (full CLIP tower is 24; final block
  unused when reading from layer -2).
- Convert relative imports to absolute across the vision modules and drop
  external-project references from docstrings.
- parity_test.py: try `local_files_only=True` first so cached checkpoints
  don't trigger a network download.
- image_vit.py: drop the `transformers` dependency. The needed activations
  (`quick_gelu`, `gelu_pytorch_tanh`, `gelu`) are implemented with plain
  PyTorch ops (verified bit-exact vs transformers); unknown names raise.
- transformer/model.py: raise early when `input_embeddings` is used with
  context parallelism, which would misalign the (unsharded) embeddings
  against sharded input_ids/labels/RoPE.
- multimodal.py: make the image-feature splice robust to a non-contiguous
  `h` by calling `.contiguous()` before the masked view-assignment.
- config.py: use `Self` for factory return types; clarify the
  `image_num_layers`=23 docstring (full CLIP tower is 24; final block
  unused when reading from layer -2).
- Convert relative imports to absolute across the vision modules and drop
  external-project references from docstrings.
- parity_test.py: try `local_files_only=True` first so cached checkpoints
  don't trigger a network download.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Comment thread src/olmo_core/nn/vision/multimodal.py
Comment thread src/olmo_core/nn/vision/config.py
Comment thread src/test/nn/vision/parity_test.py
Comment thread src/olmo_core/nn/vision/image_vit.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants