foundation-model-stack · willmj · Apr 16, 2025 · Dec 23, 2024 · Dec 27, 2024 · Dec 27, 2024
@@ -281,7 +281,7 @@ ignored-parents=
 max-args=5
 
 # Maximum number of attributes for a class (custom).
-max-attributes=10
+max-attributes=15
 
 # Maximum number of boolean expressions in an if statement (see R0916).
 max-bool-expr=5
@@ -299,7 +299,7 @@ max-parents=7
 max-public-methods=20
 
 # Maximum number of return / yield for function / method body.
-max-returns=6
+max-returns=10
 
 # Maximum number of statements in function / method body.
 max-statements=50

@@ -13,6 +13,7 @@
   - [Fine Tuning](#fine-tuning)
   - [FMS Acceleration](#fms-acceleration)
 - [Extended Pre-Training](#extended-pre-training)
+- [Tuning Vision Language Models](#tuning-vision-language-models)
 - [Inference](#inference)
   - [Running a single example](#running-a-single-example)
   - [Running multiple examples](#running-multiple-examples)
@@ -39,15 +40,15 @@ pip install fms-hf-tuning
 ### Using FlashAttention
 
 > Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
-```
+```sh
 pip install fms-hf-tuning[dev]
 pip install fms-hf-tuning[flash-attn]
 ```
 [FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
 
 *Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
 
-```
+```sh
 pip uninstall flash-attn
 pip cache purge
 pip install fms-hf-tuning[flash-attn]
@@ -898,6 +899,34 @@ The `fms_acceleration.cli` can do more to search for all available configs, plug
 
 We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)
 
+## Tuning Vision Language Models
+
+We also support full fine-tuning and LoRA tuning for vision language models - `Granite 3.2 Vision`, `Llama 3.2 Vision`, and `LLaVa-Next`. 
+For information on supported dataset formats and how to tune a vision-language model, please see [this document](./docs/vision-language-model-tuning.md).
+
+### Supported vision model
+
+- Legend:
+
+  ✅ Ready and available 
+
+  ✔️ Ready and available - compatible architecture
+
+  🚫 Not supported
+
+  ? May be supported, but not tested
+
+Model Name & Size  | Model Architecture | Full Finetuning |
+-------------------- | ---------------- | --------------- |
+Llama 3.2-11B Vision  | MllamaForConditionalGeneration | ✅* |
+Llava 1.5-7B  | LlavaForConditionalGeneration | ✅* |
+Granite 3.1-2B Vision  | LlavaNextForConditionalGeneration | ✅* |
+Llava Mistral 1.6-7B  | LlavaNextForConditionalGeneration | ✅* |
+
+(*) - Supported with `fms-hf-tuning` v2.8.0 or later.
+
+**Note**: vLLM currently does not support inference with LoRA-tuned vision models. To use a tuned LoRA adapter of vision model, please merge it with the base model before running vLLM inference.
+
 ## Inference
 Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.
 

@@ -0,0 +1,174 @@
+# Tuning Vision Language Models
+Our library also supports full fine tuning and LoRA tuning for vision language models.
+
+## Supported Dataset Format
+We support tuning an `image+text` dataset that includes:
+- A single text field, formatted using the model’s chat template.
+- A single image field, which can contain either a list of images or a single image.
+
+The text must follow the OpenAI conversational data format, which is defined as a list of message objects. Each message object must have two required fields: `role` and `content`:
+- `role`: The speaker (e.g., "user" or "assistant").
+- `content`: A list of dictionaries, each specifying:
+   - `type`: `text` or `image`.
+   - `text`: The text content (if applicable).
+
+Example Format:
+```json
+[
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Who is this?"},
+            {"type": "image"}
+        ]
+    },
+    {
+        "role": "assistant",
+        "content": [
+            {"type": "text", "text": "Barack Obama"}
+        ]
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "What is he famous for?"}
+        ]
+    },
+    {
+        "role": "assistant",
+        "content": [
+            {"type": "text", "text": "He is the 44th President of the United States."}
+        ]
+    }
+]
+```
+
+## Processing of dataset
+
+First, each dataset sample is processed by applying the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) to the raw text, which formats the conversation as required. Then, the model’s [`processor`](https://huggingface.co/docs/transformers/main/en/processors) takes the formatted text and the corresponding image(s) and converts them into the final input representation (e.g., input_ids, attention masks, etc.) that the model uses for training.
+
+**Note**: `Granite 3.2` and `Llava-1.5` Vision models expect a single image for each dataset sample. If a list of images is provided, only the first image will be used.
+
+## Tuning configurations
+
+Two parameters must be passed to specify which dataset columns to use:
+- `dataset_text_field`: The column name that contains the conversational text.
+- `dataset_image_field`: The column name that contains the images.
+
+Below is a sample configuration file:
+```json
+{
+  "model_name_or_path": "ibm-granite/granite-vision-3.2-2b", 
+  "training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
+  "dataset_text_field": "messages",
+  "dataset_image_field": "images",
+  "output_dir": "/app/test",
+  "num_train_epochs": 1.0,
+  "per_device_train_batch_size": 8,
+  "gradient_accumulation_steps": 2,
+  "learning_rate": 1e-4,
+  "bf16": true,
+  "torch_dtype": "bfloat16",
+  "use_flash_attn": true,
+  "remove_unused_columns": false,
+  "dataset_kwargs": {"skip_prepare_dataset": true},
+  "gradient_checkpointing": true,
+  "gradient_checkpointing_kwargs": {"use_reentrant": false},
+  "accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer"}
+}
+```
+
+## Running the Trainer
+
+You can also run training by calling our trainer module directly using the command line. You can use `python` for single GPU or `accelerate launch` command for multi GPU.
+For example:
+
+Command for single GPU:
+
+```sh
+python tuning/sft_trainer.py  \
+--model_name_or_path $MODEL_PATH  \
+--training_data_path $TRAIN_DATA_PATH  \
+--output_dir $OUTPUT_PATH  \
+--num_train_epochs 5  \
+--per_device_train_batch_size 4  \
+--gradient_accumulation_steps 1  \
+--learning_rate 1e-5  \
+--dataset_text_field "messages" \
+--dataset_image_field "images"
+```
+
+Command for multi GPU:
+
+```sh
+accelerate launch \
+--num_processes=$NUM_PROCESSORS
+--config_file fixtures/accelerate_fsdp_defaults.yaml \
+tuning/sft_trainer.py  \
+--model_name_or_path $MODEL_PATH  \
+--training_data_path $TRAIN_DATA_PATH  \
+--output_dir $OUTPUT_PATH  \
+--num_train_epochs 5  \
+--per_device_train_batch_size 4  \
+--gradient_accumulation_steps 1  \
+--learning_rate 1e-5  \
+--dataset_text_field "messages" \
+--dataset_image_field "images"
+```
+
+## Tuning Considerations for vision models
+
+Flash Attention 2.0 is not supported by `MllamaForConditionalGeneration` models, thus when running tuning with the `Llama 3.2 Vision Models` set:
+
+```json
+"use_flash_attn": false
+```
+### Multi-GPU Tuning with FSDP:
+
+When running `multi-GPU` tuning with `FSDP`, you need to wrap specific transformer layers. Use the following setting in FSDP config based on your model:
+
+Granite 3.2 Vision Models:
+```json
+"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer" }
+```
+
+Llava-Next and Llava-1.5 Models:
+```json
+"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer" }
+```
+
+Llava-1.6-Mistral Model:
+```json
+"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "MistralDecoderLayer" }
+```
+
+Llama 3.2 Vision Models: No additional configuration is required.
+
+### Gradient Checkpointing:
+
+We recommend running with argument `gradient_checkpointing=True` as enabling this will greatly reduce the memory needed to load and run the model.
+
+When running with gradient checkpointing for the `Llava` and `Granite` vision models, you will need to also set `gradient_checkpointing_kwargs` to not use the activation checkpoint variant that requires reentrant autograd.  
+
+```json
+"gradient_checkpointing_kwargs": {"use_reentrant": false}
+```
+
+Without setting this, tuning will lead to error:
+
+```sh
+RuntimeError: mat2 must be a matrix, got 1-D tensor
+RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [0] and normalized_shape = [1152]
+```
+
+### Other arguments:
+
+To prevent default text-only processing and ensure proper handling of multimodal data, we recommend setting:
+
+```json
+"remove_unused_columns": false
+"dataset_kwargs": {"skip_prepare_dataset": true}
+```
+
+When performing LoRA tuning on vision models, you must specify the `target_modules` explicitly, as no defaults are provided.
+
@@ -41,6 +41,12 @@ fsdp_config:
   # not needed for HF models that have . _no_split_modules
   # the example below is for GPTBigCode
   # fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock” 
+  # needed for llava-1.5-vision + llava-next-vision models
+  # fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer"
+  # needed for llava-1.6-mistral-vision model
+  # fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer"
+  # needed for granite-3.2-vision model
+  # fsdp_transformer_layer_cls_to_wrap: "GraniteDecoderLayer"
 
 # for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the 
 # learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options

@@ -39,6 +39,7 @@ dependencies = [
 "protobuf>=5.28.0,<6.0.0",
 "datasets>=2.15.0,<4.0",
 "simpleeval>=0.9.13,<2.0",
+"pillow>=11.0.0,<12.0",
 ]
 
 [project.optional-dependencies]

@@ -0,0 +1,3 @@
+{
+  "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- if strftime_now is defined %}\n        {%- set date_string = strftime_now(\"%d %b %Y\") %}\n    {%- else %}\n        {%- set date_string = \"26 Jul 2024\" %}\n    {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n    {%- set user_supplied_system_message = true %}\n{%- else %}\n    {%- set system_message = \"\" %}\n    {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- Find out if there are any images #}\n{% set image_ns = namespace(has_images=false) %}      \n{%- for message in messages %}\n    {%- for content in message['content'] %}\n        {%- if content['type'] == 'image' %}\n            {%- set image_ns.has_images = true %}\n        {%- endif %}\n    {%- endfor %}\n{%- endfor %}\n\n{#- System message if there are no images, or if the user supplied one #}\n{%- if user_supplied_system_message or not image_ns.has_images %}\n    {{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n    {%- if tools is not none %}\n        {{- \"Environment: ipython\\n\" }}\n    {%- endif %}\n    {{- \"Cutting Knowledge Date: December 2023\\n\" }}\n    {{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n    {%- if tools is not none and not tools_in_user_message %}\n        {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n        {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n        {{- \"Do not use variables.\\n\\n\" }}\n        {%- for t in tools %}\n            {{- t | tojson(indent=4) }}\n            {{- \"\\n\\n\" }}\n        {%- endfor %}\n    {%- endif %}\n    {{- system_message }}\n    {{- \"<|eot_id|>\" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' }}\n        {%- if message['content'] is string %}\n            {{- message['content'] }}\n        {%- else %}\n            {%- for content in message['content'] %}\n                {%- if content['type'] == 'image' %}\n                    {{- '<|image|>' }}\n                {%- elif content['type'] == 'text' %}\n                    {{- content['text'] }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|eot_id|>' }}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n        {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n        {{- '\"parameters\": ' }}\n        {{- tool_call.arguments | tojson }}\n        {{- \"}\" }}\n        {{- \"<|eot_id|>\" }}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
+}
@@ -0,0 +1,44 @@
+{
+  "architectures": [
+    "MllamaForConditionalGeneration"
+  ],
+  "image_token_index": 128256,
+  "model_type": "mllama",
+  "text_config": {
+    "cross_attention_layers": [
+      1
+    ],
+    "eos_token_id": [
+      128001,
+      128008,
+      128009
+    ],
+    "hidden_size": 128,
+    "intermediate_size": 768,
+    "max_position_embeddings": 1024,
+    "model_type": "mllama_text_model",
+    "num_attention_heads": 4,
+    "num_hidden_layers": 2,
+    "rope_scaling": {
+      "factor": 8.0,
+      "high_freq_factor": 4.0,
+      "low_freq_factor": 1.0,
+      "original_max_position_embeddings": 512,
+      "rope_type": "llama3"
+    },
+    "torch_dtype": "float16"
+  },
+  "torch_dtype": "float16",
+  "transformers_version": "4.49.0",
+  "vision_config": {
+    "attention_heads": 4,
+    "hidden_size": 128,
+    "image_size": 224,
+    "intermediate_size": 512,
+    "model_type": "mllama_vision_model",
+    "num_global_layers": 1,
+    "num_hidden_layers": 2,
+    "torch_dtype": "float16",
+    "vision_output_dim": 256
+  }
+}
@@ -0,0 +1,11 @@
+{
+  "_from_model_config": true,
+  "bos_token_id": 128000,
+  "eos_token_id": [
+    128001,
+    128008,
+    128009
+  ],
+  "pad_token_id": 128004,
+  "transformers_version": "4.49.0"
+}
@@ -0,0 +1,26 @@
+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "MllamaImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "max_image_tiles": 4,
+  "processor_class": "MllamaProcessor",
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 560,
+    "width": 560
+  }
+}
@@ -0,0 +1,23 @@
+{
+  "bos_token": {
+    "content": "<|begin_of_text|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|eot_id|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|finetune_right_pad_id|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}