Commit c7b2ebb
committed
Add vision transformer, connector, and MultimodalTransformer
VLM/multimodal vision-language architecture stack:
- VisionBackbone (OpenAI CLIP / SigLIP / SigLIP2): OpenAI-style ViT encoder
with configurable image size, patch size, embedding dim, and attention heads.
Supports CLIP (openai), SigLIP (siglip), and SigLIP2 (siglip2) initialisation.
- VisionConnector: attention-pooling (2×2) + SwiGLU MLP projector that
maps vision embeddings to the language-model hidden dimension.
- MultimodalTransformer: composite model that fuses image patch tokens into
the LM token stream at image-patch positions, then runs the full LM
forward pass.
- Removed DINOv2 backbone variants (not used in Molmo2).
- HF parity tests for CLIP, SigLIP, and SigLIP2 vision encoders.1 parent 754d58d commit c7b2ebb
14 files changed
Lines changed: 2428 additions & 5 deletions
File tree
- src
- olmo_core/nn
- transformer
- vision
- test/nn/vision
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
506 | 506 | | |
507 | 507 | | |
508 | 508 | | |
| 509 | + | |
509 | 510 | | |
510 | 511 | | |
511 | 512 | | |
| |||
519 | 520 | | |
520 | 521 | | |
521 | 522 | | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
522 | 529 | | |
523 | 530 | | |
524 | 531 | | |
| |||
550 | 557 | | |
551 | 558 | | |
552 | 559 | | |
553 | | - | |
554 | | - | |
555 | | - | |
556 | | - | |
557 | | - | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
558 | 568 | | |
559 | 569 | | |
560 | 570 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
0 commit comments