[1/n] Add vision transformer, connector, and MultimodalTransformer by jason718 · Pull Request #692 · allenai/OLMo-core

jason718 · 2026-05-27T06:11:52Z

Summary

This is the first in a series of a few PRs adding VLM/multimodal support to OLMo-core.

What's in this PR

Three new modules under src/olmo_core/nn/vision/:

`VisionBackbone` (`image_vit.py`, `config.py`)

OpenAI-style Vision Transformer encoder supporting two architecture families:

Type	HF checkpoint
`openai` (CLIP)	`openai/clip-vit-large-patch14-336`
`siglip`	`google/siglip-so400m-patch14-384`
`siglip2`	`google/siglip2-so400m-patch14-384`

Key config knobs: image_default_input_size, image_patch_size, image_emb_dim, image_num_heads, image_num_layers, image_num_pos. Factory class-methods cover all standard Molmo2 variants.

`VisionConnector` (`connector.py`)

Maps variable-length vision embeddings to the LM hidden dimension:

2×2 attention pooling over the patch grid
SwiGLU MLP projector → output_dim = lm.d_model

Constructed via VisionConnectorConfig.from_vision_backbone(vis_cfg, output_dim, mlp_hidden_size).

`MultimodalTransformer` (`multimodal.py`)

Composite model holding lm: Transformer, vision: VisionBackbone, and connector: VisionConnector. Forward pass:

Run VisionBackbone on each crop's patch tensor.
Project patch embeddings to LM width via VisionConnector.
Splice patch tokens into the text embedding stream at image_patch_token_id positions.
Run the standard LM forward pass on the fused sequence.

HF Parity Tests

src/test/nn/vision/parity_test.py loads real checkpoints and asserts numerical equivalence between our implementation and HuggingFace's reference models. Tests skip automatically when checkpoints are not cached.

Measured max absolute errors on CPU in float32 (24–27 transformer layers):

Checkpoint	max abs error	mean abs error	tolerance used
`openai/clip-vit-large-patch14-336`	4.9e-04	3.1e-06	`atol=3e-3, rtol=1e-3`
`google/siglip-so400m-patch14-384`	2.8e-03	7.8e-06	`atol=3e-3, rtol=1e-3`
`google/siglip2-so400m-patch14-384`	3.1e-04	3.0e-06	`atol=3e-3, rtol=1e-3`

These errors reflect genuine float32 accumulation across 24–27 attention/MLP layers from independent kernel implementations — not any precision loss. They are ~10–100× tighter than fp16 parity would require.

Files changed

src/olmo_core/nn/vision/__init__.py        (new)
src/olmo_core/nn/vision/config.py          (new)  VisionBackboneConfig + factory methods
src/olmo_core/nn/vision/image_vit.py       (new)  VisionTransformer, SiglipVisionTransformer
src/olmo_core/nn/vision/connector.py       (new)  VisionConnector
src/olmo_core/nn/vision/multimodal.py      (new)  MultimodalTransformer, MultimodalTransformerConfig

src/test/nn/vision/__init__.py             (new)
src/test/nn/vision/config_test.py          (new)
src/test/nn/vision/image_vit_test.py       (new)
src/test/nn/vision/connector_test.py       (new)
src/test/nn/vision/multimodal_test.py      (new)
src/test/nn/vision/parity_test.py          (new)  CLIP + SigLIP + SigLIP2 HF parity

jason718 · 2026-05-27T06:36:58Z

CI failure analysis

All 3 failing CI jobs are due to the PR coming from an external fork (jason718/OLMo-core), which does not receive GitHub Actions secrets. None are caused by the code changes in this PR.

Test (CPU)

5 tests fail — all with credential errors, not logic errors:

Test	Error
`test/data/mixes_test.py::test_olmoe_mix`	S3 `403 Forbidden` — no AWS creds
`test/data/mixes_test.py::test_dolma17_mix`	S3 `403 Forbidden` — no AWS creds
`test/data/mixes_test.py::test_v3_small_ppl_validation_mix`	S3 `403 Forbidden` — no AWS creds
`test/io_test.py::test_s3_functionality`	S3 `403 Forbidden` — no AWS creds
`test/launch/beaker_test.py::test_get_beaker_client_caching`	`BeakerConfigurationError: token is empty`

These tests all pass on PRs from within the main repo (e.g. #681) where secrets are available.

Test checkpoint

Fails with S3 AccessDenied on checkpoint read — same root cause (no AWS creds on fork).

Test olmo3 ladder

Fails with empty Beaker token — same root cause.

These failures are pre-existing on all external fork PRs and do not reflect on the correctness of the vision architecture code being introduced here.

VLM/multimodal vision-language architecture stack: - VisionBackbone (OpenAI CLIP / SigLIP / SigLIP2): OpenAI-style ViT encoder with configurable image size, patch size, embedding dim, and attention heads. Supports CLIP (openai), SigLIP (siglip), and SigLIP2 (siglip2) initialisation. - VisionConnector: attention-pooling (2×2) + SwiGLU MLP projector that maps vision embeddings to the language-model hidden dimension. - MultimodalTransformer: composite model that fuses image patch tokens into the LM token stream at image-patch positions, then runs the full LM forward pass. - Removed DINOv2 backbone variants (not used in Molmo2). - HF parity tests for CLIP, SigLIP, and SigLIP2 vision encoders.

Copilot

Pull request overview

Adds the first slice of multimodal/VLM support to olmo_core.nn by introducing a vision backbone (ViT variants), a vision→LM connector, and a composite multimodal model that splices projected image features into the LM embedding stream. This lays the groundwork for subsequent PRs to build full multimodal training/inference workflows.

Changes:

Added new vision modules: VisionTransformer / SiglipVisionTransformer, VisionConnector, and MultimodalTransformer with corresponding config objects.
Extended the LM Transformer.forward() API to optionally accept precomputed input_embeddings (for multimodal fusion).
Added unit tests and HF parity tests for CLIP/SigLIP/SigLIP2 equivalence (skipping when checkpoints/deps aren’t available).

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/olmo_core/nn/vision/__init__.py`	Exposes vision and multimodal public API from the new `vision` package.
`src/olmo_core/nn/vision/config.py`	Adds `VisionBackboneConfig` / `VisionBackboneType` and factory methods for CLIP/SigLIP/SigLIP2 variants.
`src/olmo_core/nn/vision/image_vit.py`	Implements CLIP-style and SigLIP-style ViT encoders returning per-layer hidden states.
`src/olmo_core/nn/vision/connector.py`	Implements pooling + projection connector from vision features to LM `d_model`.
`src/olmo_core/nn/vision/multimodal.py`	Implements composite multimodal model that injects pooled image features into LM embeddings.
`src/olmo_core/nn/transformer/model.py`	Adds `input_embeddings` support to `Transformer.forward()` for multimodal embedding splice.
`src/olmo_core/nn/__init__.py`	Re-exports vision/multimodal modules at the `olmo_core.nn` package level.
`src/test/nn/vision/__init__.py`	Adds test package initializer for vision tests.
`src/test/nn/vision/config_test.py`	Tests vision config factories, build dispatch, and basic invariants.
`src/test/nn/vision/image_vit_test.py`	Tests ViT forward shapes, determinism, CLS presence, and pos-emb interpolation behavior.
`src/test/nn/vision/connector_test.py`	Tests connector pooling/projector variants, padding mask behavior, and multi-layer inputs.
`src/test/nn/vision/multimodal_test.py`	Tests multimodal forward in text-only and image-splice modes, plus multi-crop and meta device.
`src/test/nn/vision/parity_test.py`	HF numerical parity tests for CLIP/SigLIP/SigLIP2 vision encoders (skip when unavailable).
`CHANGELOG.md`	Documents the addition of the new multimodal/vision modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

undfined

Looks solid! I like the layout and tests, thank you. A few suggestions from me + let's check the copilot suggestions (a couple overlap). @AkshitaB would love to get your input before we merge.

- image_vit.py: drop the `transformers` dependency. The needed activations (`quick_gelu`, `gelu_pytorch_tanh`, `gelu`) are implemented with plain PyTorch ops (verified bit-exact vs transformers); unknown names raise. - transformer/model.py: raise early when `input_embeddings` is used with context parallelism, which would misalign the (unsharded) embeddings against sharded input_ids/labels/RoPE. - multimodal.py: make the image-feature splice robust to a non-contiguous `h` by calling `.contiguous()` before the masked view-assignment. - config.py: use `Self` for factory return types; clarify the `image_num_layers`=23 docstring (full CLIP tower is 24; final block unused when reading from layer -2). - Convert relative imports to absolute across the vision modules and drop external-project references from docstrings. - parity_test.py: try `local_files_only=True` first so cached checkpoints don't trigger a network download.

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

jason718 changed the title ~~[1/7] Add vision transformer, connector, and MultimodalTransformer~~ [1/n] Add vision transformer, connector, and MultimodalTransformer May 27, 2026

jason718 force-pushed the pr/vlm-1-vision-architecture branch from f3361b9 to 34ba65d Compare May 27, 2026 06:20

jason718 requested a review from Copilot May 27, 2026 06:37

Copilot started reviewing on behalf of jason718 May 27, 2026 06:37 View session

jason718 requested review from AkshitaB, chrisc36, no0p, undfined and uwGZQ and removed request for Copilot May 27, 2026 06:38

jason718 self-assigned this May 27, 2026

jason718 added the enhancement New feature or request label May 27, 2026

This was referenced May 27, 2026

[2/n] HF Molmo2 loader and logit parity tests #693

Open

[3/n] Multimodal eval: image preprocessing, inference engine, and benchmark scripts #695

Draft

[4/n] Multimodal training data pipeline and train module #697

Draft

jason718 force-pushed the pr/vlm-1-vision-architecture branch from 34ba65d to c7b2ebb Compare May 28, 2026 18:08

Merge branch 'main' into pr/vlm-1-vision-architecture

822741b

undfined requested a review from Copilot June 1, 2026 16:50

Copilot started reviewing on behalf of undfined June 1, 2026 16:50 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread src/olmo_core/nn/vision/image_vit.py Outdated

Comment thread src/olmo_core/nn/transformer/model.py

Comment thread src/olmo_core/nn/vision/multimodal.py Outdated

Comment thread src/test/nn/vision/parity_test.py

Comment thread src/olmo_core/nn/vision/config.py

undfined requested changes Jun 1, 2026

View reviewed changes

Comment thread src/olmo_core/nn/vision/image_vit.py Outdated

Comment thread src/olmo_core/nn/vision/image_vit.py Outdated

Comment thread src/olmo_core/nn/vision/multimodal.py Outdated

Comment thread src/olmo_core/nn/transformer/model.py

Comment thread src/olmo_core/nn/vision/config.py Outdated

jason718 requested review from Copilot and undfined June 2, 2026 21:52

Copilot started reviewing on behalf of jason718 June 2, 2026 21:52 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread src/olmo_core/nn/vision/multimodal.py

Comment thread src/olmo_core/nn/vision/config.py

Comment thread src/test/nn/vision/parity_test.py

Comment thread src/olmo_core/nn/vision/image_vit.py

Potential fix for pull request finding

f940592

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/n] Add vision transformer, connector, and MultimodalTransformer#692

[1/n] Add vision transformer, connector, and MultimodalTransformer#692
jason718 wants to merge 4 commits into
allenai:mainfrom
jason718:pr/vlm-1-vision-architecture

jason718 commented May 27, 2026 •

edited

Loading

Uh oh!

jason718 commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

undfined left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jason718 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

VisionBackbone (image_vit.py, config.py)

VisionConnector (connector.py)

MultimodalTransformer (multimodal.py)

HF Parity Tests

Files changed

Uh oh!

jason718 commented May 27, 2026

CI failure analysis

Test (CPU)

Test checkpoint

Test olmo3 ladder

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

undfined left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jason718 commented May 27, 2026 •

edited

Loading

`VisionBackbone` (`image_vit.py`, `config.py`)

`VisionConnector` (`connector.py`)

`MultimodalTransformer` (`multimodal.py`)