[4/n] Multimodal training data pipeline and train module by jason718 · Pull Request #697 · allenai/OLMo-core

jason718 · 2026-05-27T23:18:06Z

Context

Fourth PR in the VLM stack building Molmo2-style multimodal support.

[1/n] Vision architecture ([1/n] Add vision transformer, connector, and MultimodalTransformer #692)
[2/n] HF Molmo2 loader ([2/n] HF Molmo2 loader and logit parity tests #693)
[3/n] Eval infrastructure ([3/n] Multimodal eval: image preprocessing, inference engine, and benchmark scripts #695)
[4/n] This PR — training data pipeline + train module

What this adds

Training data pipeline (`olmo_core/data/multimodal/`)

File	Purpose
`pixmo_cap.py`	`PixmoCapDataset` — PixMo-Cap (`allenai/pixmo-cap`) caption-pretraining source, selectable via `source=`: `hub` streams the dataset from the HuggingFace Hub and downloads each image from its `image_url` (skipping dead links); `local` reads a pre-downloaded on-disk copy (`$MOLMO_DATA_DIR/.../cap`) — the path for large-scale AI2 runs. Yields `(prompt, response, image)` triples.
`preprocessor.py`	`MultimodalPreprocessor` — combines `MultiCropPreprocessor` + tokenizer into the per-example dict (`input_tokens`, `loss_masks`, `images`, `pooled_patches_idx`)
`collator.py`	`MultimodalCollator` — stacks per-example dicts into a batch; pads variable `n_crops`/`n_pooled` with `-1` rows + dummy `<im_patch>` tokens so the model's patch-count contract holds
`data_loader.py`	`MultimodalDataLoader` — wraps the source into a `DataLoaderBase` the `Trainer` consumes; rank + worker sharding

PixMo-Cap on the Hub is URL-based (columns image_url / caption / transcripts, 717K rows, train only) — it stores image links, not bytes — so hub mode downloads per-URL at iteration time. local mode is the network-free path for real training.

No synthetic dataset. A SyntheticMultimodalDataset previously stood in as the default source; it's been removed from the shipped library. Pipeline tests use a small test-only fixture (test/data/multimodal/synthetic_source.py); the FSDP test uses an inline tiny PixMo-Cap-shaped sample.

Train module (`olmo_core/train/train_module/multimodal/`)

MultimodalTransformerTrainModule reuses TransformerTrainModule's machinery and adds FSDP/DDP wrapping of the composite model, response-only loss (loss_masks → label_mask), and threads images / pooled_patches_idx into MultimodalTransformer.forward. TP/CP/PP/EP out of scope.

Model hooks (`nn/vision/multimodal.py`)

Adds the TrainModule interface MultimodalTransformer needs: apply_fsdp, apply_ddp, init_weights, num_flops_per_token, param-count properties, and post_batch/post_optim_step/aux-metrics delegation to the LM.

Tests — 54 pass (CPU + distributed)

Test file	Tests	Coverage
`data/multimodal/pixmo_cap_test.py`	6	local (skip-gated) + mocked hub (skips dead URLs, limit counts yielded)
`data/multimodal/preprocessor_test.py`	11	output keys, dtypes, loss masks, multicrop parity
`data/multimodal/collator_test.py`	8	uniform + variable `n_crops`/`n_pooled`, dummy-patch contract
`data/multimodal/data_loader_test.py`	11	rank/worker sharding, mock-batch contract, state round-trip
`data/multimodal/end_to_end_test.py`	5	source → loader → batch → forward
`train/multimodal/train_module_test.py`	9	loss-mask→label-mask, batch prep, sizing
`train/multimodal/fsdp_test.py`	4	FSDP/DDP wrapping, 2-rank gloo

54 pass locally (no GPU required for the CPU set; hub tests are mocked, no network).

…odule Training infrastructure for Molmo2 fine-tuning, stacked on the eval PR. Training data pipeline (src/olmo_core/data/multimodal/): - MultimodalPreprocessor: converts (text, images) → (input_ids, loss_masks, image_patches) using the Molmo2 multi-crop layout; wraps MultiCropPreprocessor - MultimodalCollator: pads ragged token sequences and stacks image tensors into dense batches ready for MultimodalTransformer.forward() - MultimodalDataLoader: rank-aware DataLoader with worker-safe image decoding - SyntheticMultimodalDataset: synthetic dataset for integration tests Train module (src/olmo_core/train/train_module/multimodal/): - MultimodalTransformerTrainModule: wraps MultimodalTransformer with FSDP/DDP support, gradient checkpointing, and mixed-precision; integrates with the Trainer callback system

jason718 force-pushed the pr/vlm-3-eval-benchmarks branch from d6ed6a7 to cc0cc08 Compare May 27, 2026 23:42

jason718 force-pushed the pr/vlm-4-training branch from 32c2ffe to 65d65f0 Compare May 27, 2026 23:42

jason718 mentioned this pull request May 27, 2026

[3/n] Multimodal eval: image preprocessing, inference engine, and benchmark scripts #695

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/n] Multimodal training data pipeline and train module#697

[4/n] Multimodal training data pipeline and train module#697
jason718 wants to merge 1 commit into
pr/vlm-3-eval-benchmarksfrom
pr/vlm-4-training

jason718 commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jason718 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this adds

Training data pipeline (olmo_core/data/multimodal/)

Train module (olmo_core/train/train_module/multimodal/)

Model hooks (nn/vision/multimodal.py)

Tests — 54 pass (CPU + distributed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jason718 commented May 27, 2026 •

edited

Loading

Training data pipeline (`olmo_core/data/multimodal/`)

Train module (`olmo_core/train/train_module/multimodal/`)

Model hooks (`nn/vision/multimodal.py`)