[4/n] Multimodal training data pipeline and train module#697
Draft
jason718 wants to merge 1 commit into
Draft
Conversation
…odule Training infrastructure for Molmo2 fine-tuning, stacked on the eval PR. Training data pipeline (src/olmo_core/data/multimodal/): - MultimodalPreprocessor: converts (text, images) → (input_ids, loss_masks, image_patches) using the Molmo2 multi-crop layout; wraps MultiCropPreprocessor - MultimodalCollator: pads ragged token sequences and stacks image tensors into dense batches ready for MultimodalTransformer.forward() - MultimodalDataLoader: rank-aware DataLoader with worker-safe image decoding - SyntheticMultimodalDataset: synthetic dataset for integration tests Train module (src/olmo_core/train/train_module/multimodal/): - MultimodalTransformerTrainModule: wraps MultimodalTransformer with FSDP/DDP support, gradient checkpointing, and mixed-precision; integrates with the Trainer callback system
d6ed6a7 to
cc0cc08
Compare
32c2ffe to
65d65f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Fourth PR in the VLM stack building Molmo2-style multimodal support.
What this adds
Training data pipeline (
olmo_core/data/multimodal/)pixmo_cap.pyPixmoCapDataset— PixMo-Cap (allenai/pixmo-cap) caption-pretraining source, selectable viasource=:hubstreams the dataset from the HuggingFace Hub and downloads each image from itsimage_url(skipping dead links);localreads a pre-downloaded on-disk copy ($MOLMO_DATA_DIR/.../cap) — the path for large-scale AI2 runs. Yields(prompt, response, image)triples.preprocessor.pyMultimodalPreprocessor— combinesMultiCropPreprocessor+ tokenizer into the per-example dict (input_tokens,loss_masks,images,pooled_patches_idx)collator.pyMultimodalCollator— stacks per-example dicts into a batch; pads variablen_crops/n_pooledwith-1rows + dummy<im_patch>tokens so the model's patch-count contract holdsdata_loader.pyMultimodalDataLoader— wraps the source into aDataLoaderBasetheTrainerconsumes; rank + worker shardingTrain module (
olmo_core/train/train_module/multimodal/)MultimodalTransformerTrainModulereusesTransformerTrainModule's machinery and adds FSDP/DDP wrapping of the composite model, response-only loss (loss_masks → label_mask), and threadsimages/pooled_patches_idxintoMultimodalTransformer.forward. TP/CP/PP/EP out of scope.Model hooks (
nn/vision/multimodal.py)Adds the
TrainModuleinterfaceMultimodalTransformerneeds:apply_fsdp,apply_ddp,init_weights,num_flops_per_token, param-count properties, andpost_batch/post_optim_step/aux-metrics delegation to the LM.Tests — 54 pass (CPU + distributed)
data/multimodal/pixmo_cap_test.pydata/multimodal/preprocessor_test.pydata/multimodal/collator_test.pyn_crops/n_pooled, dummy-patch contractdata/multimodal/data_loader_test.pydata/multimodal/end_to_end_test.pytrain/multimodal/train_module_test.pytrain/multimodal/fsdp_test.py54 pass locally (no GPU required for the CPU set; hub tests are mocked, no network).