|
| 1 | +# Tuning Vision Language Models |
| 2 | +Our library also supports full fine tuning and LoRA tuning for vision language models. |
| 3 | + |
| 4 | +## Supported Dataset Format |
| 5 | +We support tuning an `image+text` dataset that includes: |
| 6 | +- A single text field, formatted using the model’s chat template. |
| 7 | +- A single image field, which can contain either a list of images or a single image. |
| 8 | + |
| 9 | +The text must follow the OpenAI conversational data format, which is defined as a list of message objects. Each message object must have two required fields: `role` and `content`: |
| 10 | +- `role`: The speaker (e.g., "user" or "assistant"). |
| 11 | +- `content`: A list of dictionaries, each specifying: |
| 12 | + - `type`: `text` or `image`. |
| 13 | + - `text`: The text content (if applicable). |
| 14 | + |
| 15 | +Example Format: |
| 16 | +```json |
| 17 | +[ |
| 18 | + { |
| 19 | + "role": "user", |
| 20 | + "content": [ |
| 21 | + {"type": "text", "text": "Who is this?"}, |
| 22 | + {"type": "image"} |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "role": "assistant", |
| 27 | + "content": [ |
| 28 | + {"type": "text", "text": "Barack Obama"} |
| 29 | + ] |
| 30 | + }, |
| 31 | + { |
| 32 | + "role": "user", |
| 33 | + "content": [ |
| 34 | + {"type": "text", "text": "What is he famous for?"} |
| 35 | + ] |
| 36 | + }, |
| 37 | + { |
| 38 | + "role": "assistant", |
| 39 | + "content": [ |
| 40 | + {"type": "text", "text": "He is the 44th President of the United States."} |
| 41 | + ] |
| 42 | + } |
| 43 | +] |
| 44 | +``` |
| 45 | + |
| 46 | +## Processing of dataset |
| 47 | + |
| 48 | +First, each dataset sample is processed by applying the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) to the raw text, which formats the conversation as required. Then, the model’s [`processor`](https://huggingface.co/docs/transformers/main/en/processors) takes the formatted text and the corresponding image(s) and converts them into the final input representation (e.g., input_ids, attention masks, etc.) that the model uses for training. |
| 49 | + |
| 50 | +**Note**: `Granite 3.2` and `Llava-1.5` Vision models expect a single image for each dataset sample. If a list of images is provided, only the first image will be used. |
| 51 | + |
| 52 | +## Tuning configurations |
| 53 | + |
| 54 | +Two parameters must be passed to specify which dataset columns to use: |
| 55 | +- `dataset_text_field`: The column name that contains the conversational text. |
| 56 | +- `dataset_image_field`: The column name that contains the images. |
| 57 | + |
| 58 | +Below is a sample configuration file: |
| 59 | +```json |
| 60 | +{ |
| 61 | + "model_name_or_path": "ibm-granite/granite-vision-3.2-2b", |
| 62 | + "training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft", |
| 63 | + "dataset_text_field": "messages", |
| 64 | + "dataset_image_field": "images", |
| 65 | + "output_dir": "/app/test", |
| 66 | + "num_train_epochs": 1.0, |
| 67 | + "per_device_train_batch_size": 8, |
| 68 | + "gradient_accumulation_steps": 2, |
| 69 | + "learning_rate": 1e-4, |
| 70 | + "bf16": true, |
| 71 | + "torch_dtype": "bfloat16", |
| 72 | + "use_flash_attn": true, |
| 73 | + "remove_unused_columns": false, |
| 74 | + "dataset_kwargs": {"skip_prepare_dataset": true}, |
| 75 | + "gradient_checkpointing": true, |
| 76 | + "gradient_checkpointing_kwargs": {"use_reentrant": false}, |
| 77 | + "accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer"} |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +## Running the Trainer |
| 82 | + |
| 83 | +You can also run training by calling our trainer module directly using the command line. You can use `python` for single GPU or `accelerate launch` command for multi GPU. |
| 84 | +For example: |
| 85 | + |
| 86 | +Command for single GPU: |
| 87 | + |
| 88 | +```sh |
| 89 | +python tuning/sft_trainer.py \ |
| 90 | +--model_name_or_path $MODEL_PATH \ |
| 91 | +--training_data_path $TRAIN_DATA_PATH \ |
| 92 | +--output_dir $OUTPUT_PATH \ |
| 93 | +--num_train_epochs 5 \ |
| 94 | +--per_device_train_batch_size 4 \ |
| 95 | +--gradient_accumulation_steps 1 \ |
| 96 | +--learning_rate 1e-5 \ |
| 97 | +--dataset_text_field "messages" \ |
| 98 | +--dataset_image_field "images" |
| 99 | +``` |
| 100 | + |
| 101 | +Command for multi GPU: |
| 102 | + |
| 103 | +```sh |
| 104 | +accelerate launch \ |
| 105 | +--num_processes=$NUM_PROCESSORS |
| 106 | +--config_file fixtures/accelerate_fsdp_defaults.yaml \ |
| 107 | +tuning/sft_trainer.py \ |
| 108 | +--model_name_or_path $MODEL_PATH \ |
| 109 | +--training_data_path $TRAIN_DATA_PATH \ |
| 110 | +--output_dir $OUTPUT_PATH \ |
| 111 | +--num_train_epochs 5 \ |
| 112 | +--per_device_train_batch_size 4 \ |
| 113 | +--gradient_accumulation_steps 1 \ |
| 114 | +--learning_rate 1e-5 \ |
| 115 | +--dataset_text_field "messages" \ |
| 116 | +--dataset_image_field "images" |
| 117 | +``` |
| 118 | + |
| 119 | +## Tuning Considerations for vision models |
| 120 | + |
| 121 | +Flash Attention 2.0 is not supported by `MllamaForConditionalGeneration` models, thus when running tuning with the `Llama 3.2 Vision Models` set: |
| 122 | + |
| 123 | +```json |
| 124 | +"use_flash_attn": false |
| 125 | +``` |
| 126 | +### Multi-GPU Tuning with FSDP: |
| 127 | + |
| 128 | +When running `multi-GPU` tuning with `FSDP`, you need to wrap specific transformer layers. Use the following setting in FSDP config based on your model: |
| 129 | + |
| 130 | +Granite 3.2 Vision Models: |
| 131 | +```json |
| 132 | +"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer" } |
| 133 | +``` |
| 134 | + |
| 135 | +Llava-Next and Llava-1.5 Models: |
| 136 | +```json |
| 137 | +"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer" } |
| 138 | +``` |
| 139 | + |
| 140 | +Llava-1.6-Mistral Model: |
| 141 | +```json |
| 142 | +"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "MistralDecoderLayer" } |
| 143 | +``` |
| 144 | + |
| 145 | +Llama 3.2 Vision Models: No additional configuration is required. |
| 146 | + |
| 147 | +### Gradient Checkpointing: |
| 148 | + |
| 149 | +We recommend running with argument `gradient_checkpointing=True` as enabling this will greatly reduce the memory needed to load and run the model. |
| 150 | + |
| 151 | +When running with gradient checkpointing for the `Llava` and `Granite` vision models, you will need to also set `gradient_checkpointing_kwargs` to not use the activation checkpoint variant that requires reentrant autograd. |
| 152 | + |
| 153 | +```json |
| 154 | +"gradient_checkpointing_kwargs": {"use_reentrant": false} |
| 155 | +``` |
| 156 | + |
| 157 | +Without setting this, tuning will lead to error: |
| 158 | + |
| 159 | +```sh |
| 160 | +RuntimeError: mat2 must be a matrix, got 1-D tensor |
| 161 | +RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [0] and normalized_shape = [1152] |
| 162 | +``` |
| 163 | + |
| 164 | +### Other arguments: |
| 165 | + |
| 166 | +To prevent default text-only processing and ensure proper handling of multimodal data, we recommend setting: |
| 167 | + |
| 168 | +```json |
| 169 | +"remove_unused_columns": false |
| 170 | +"dataset_kwargs": {"skip_prepare_dataset": true} |
| 171 | +``` |
| 172 | + |
| 173 | +When performing LoRA tuning on vision models, you must specify the `target_modules` explicitly, as no defaults are provided. |
| 174 | + |
0 commit comments