Skip to content

[4/n] Multimodal training data pipeline and train module#697

Draft
jason718 wants to merge 1 commit into
pr/vlm-3-eval-benchmarksfrom
pr/vlm-4-training
Draft

[4/n] Multimodal training data pipeline and train module#697
jason718 wants to merge 1 commit into
pr/vlm-3-eval-benchmarksfrom
pr/vlm-4-training

Conversation

@jason718
Copy link
Copy Markdown

@jason718 jason718 commented May 27, 2026

Context

Fourth PR in the VLM stack building Molmo2-style multimodal support.

What this adds

Training data pipeline (olmo_core/data/multimodal/)

File Purpose
pixmo_cap.py PixmoCapDataset — PixMo-Cap (allenai/pixmo-cap) caption-pretraining source, selectable via source=: hub streams the dataset from the HuggingFace Hub and downloads each image from its image_url (skipping dead links); local reads a pre-downloaded on-disk copy ($MOLMO_DATA_DIR/.../cap) — the path for large-scale AI2 runs. Yields (prompt, response, image) triples.
preprocessor.py MultimodalPreprocessor — combines MultiCropPreprocessor + tokenizer into the per-example dict (input_tokens, loss_masks, images, pooled_patches_idx)
collator.py MultimodalCollator — stacks per-example dicts into a batch; pads variable n_crops/n_pooled with -1 rows + dummy <im_patch> tokens so the model's patch-count contract holds
data_loader.py MultimodalDataLoader — wraps the source into a DataLoaderBase the Trainer consumes; rank + worker sharding

PixMo-Cap on the Hub is URL-based (columns image_url / caption / transcripts, 717K rows, train only) — it stores image links, not bytes — so hub mode downloads per-URL at iteration time. local mode is the network-free path for real training.

No synthetic dataset. A SyntheticMultimodalDataset previously stood in as the default source; it's been removed from the shipped library. Pipeline tests use a small test-only fixture (test/data/multimodal/synthetic_source.py); the FSDP test uses an inline tiny PixMo-Cap-shaped sample.

Train module (olmo_core/train/train_module/multimodal/)

MultimodalTransformerTrainModule reuses TransformerTrainModule's machinery and adds FSDP/DDP wrapping of the composite model, response-only loss (loss_masks → label_mask), and threads images / pooled_patches_idx into MultimodalTransformer.forward. TP/CP/PP/EP out of scope.

Model hooks (nn/vision/multimodal.py)

Adds the TrainModule interface MultimodalTransformer needs: apply_fsdp, apply_ddp, init_weights, num_flops_per_token, param-count properties, and post_batch/post_optim_step/aux-metrics delegation to the LM.

Tests — 54 pass (CPU + distributed)

Test file Tests Coverage
data/multimodal/pixmo_cap_test.py 6 local (skip-gated) + mocked hub (skips dead URLs, limit counts yielded)
data/multimodal/preprocessor_test.py 11 output keys, dtypes, loss masks, multicrop parity
data/multimodal/collator_test.py 8 uniform + variable n_crops/n_pooled, dummy-patch contract
data/multimodal/data_loader_test.py 11 rank/worker sharding, mock-batch contract, state round-trip
data/multimodal/end_to_end_test.py 5 source → loader → batch → forward
train/multimodal/train_module_test.py 9 loss-mask→label-mask, batch prep, sizing
train/multimodal/fsdp_test.py 4 FSDP/DDP wrapping, 2-rank gloo

54 pass locally (no GPU required for the CPU set; hub tests are mocked, no network).

…odule

Training infrastructure for Molmo2 fine-tuning, stacked on the eval PR.

Training data pipeline (src/olmo_core/data/multimodal/):
- MultimodalPreprocessor: converts (text, images) → (input_ids, loss_masks,
  image_patches) using the Molmo2 multi-crop layout; wraps MultiCropPreprocessor
- MultimodalCollator: pads ragged token sequences and stacks image tensors
  into dense batches ready for MultimodalTransformer.forward()
- MultimodalDataLoader: rank-aware DataLoader with worker-safe image decoding
- SyntheticMultimodalDataset: synthetic dataset for integration tests

Train module (src/olmo_core/train/train_module/multimodal/):
- MultimodalTransformerTrainModule: wraps MultimodalTransformer with FSDP/DDP
  support, gradient checkpointing, and mixed-precision; integrates with the
  Trainer callback system
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant