Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
30326cd
install trl=0.13, deepspeed, update transformers
anhuong Dec 23, 2024
e166e86
deps: install pillow, uninstall deepspeed
anhuong Dec 27, 2024
de15409
add multimodal flag, pass processor, add data collator
anhuong Dec 27, 2024
802405a
load dataset directly, pass processor, fix field
anhuong Dec 27, 2024
95f62ca
add generic data collator
anhuong Jan 26, 2025
79e9ecd
merge changes from main
anhuong Jan 26, 2025
c1f8e5f
remove load_dataset since HF support added
anhuong Jan 26, 2025
a69dd0a
add fsdp config needed for llava models
anhuong Feb 12, 2025
baa7fb0
Merge branch 'main' into vision-model
anhuong Feb 12, 2025
3b8b3a1
merge changes from main
anhuong Feb 21, 2025
0cd7d90
feat:Use of data handlers for Vision LM support (#4)
Abhishek-TAMU Feb 25, 2025
d33cc22
replace text_field_name for dataset_text_field and for image
anhuong Feb 28, 2025
42ae79d
remove multimodal flag
anhuong Feb 28, 2025
6f30f4c
fix formatting, remove unused fields
anhuong Feb 28, 2025
4aef3d8
remove irrelevant unit test
anhuong Feb 28, 2025
ef48df8
revert data loading back
anhuong Feb 28, 2025
a51140f
fix:Support loading for Granite-3.2 Vision Model
Abhishek-TAMU Mar 3, 2025
e1bec77
remove duplicate logger, fmt
anhuong Mar 3, 2025
5b7ea44
fix unbound var, refactor tokenizer
anhuong Mar 3, 2025
50cb0bd
changes from review comments
anhuong Mar 4, 2025
7822f37
fix embedding resize and errors
anhuong Mar 13, 2025
3a63cb8
add hack fix for vocab size for Mllama models
anhuong Mar 13, 2025
9581f19
add docs on vision model usage
anhuong Mar 14, 2025
b2295af
move llama vocab size, allow single image inputs
anhuong Mar 14, 2025
216ee32
linter fixes
anhuong Mar 14, 2025
372d967
merge changes from main
anhuong Mar 14, 2025
fb635de
fix merge, add lora note
anhuong Mar 14, 2025
8fa641e
docs: organize sections
anhuong Mar 18, 2025
af6e068
remove all dataset columns
anhuong Mar 18, 2025
2c104db
merge changes from main
anhuong Mar 20, 2025
0270148
Merge branch 'foundation-model-stack:main' into vision-model
Abhishek-TAMU Mar 20, 2025
0ace3b7
only take single image for granite models
anhuong Mar 21, 2025
4c4b67c
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 1, 2025
b028080
Merge branch 'vision-model' of github.com:anhuong/fms-hf-tuning into …
Abhishek-TAMU Apr 1, 2025
5c41e0c
feat:Support Entire Vision dataset with Streaming (#6)
Abhishek-TAMU Apr 1, 2025
4f91fb3
PR change of adding vocab size
Abhishek-TAMU Apr 1, 2025
7d4fc12
Added llama vision model and unit test case
Abhishek-TAMU Apr 2, 2025
dd288fd
Make Jinja template work
Abhishek-TAMU Apr 5, 2025
8503acd
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 5, 2025
168cde0
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 8, 2025
2fb6e6a
Fix for preprocessor_config in checkpoint folder
Abhishek-TAMU Apr 9, 2025
ae7d30f
fmt fix
Abhishek-TAMU Apr 9, 2025
adf308b
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 9, 2025
14afb72
Moving resizing out of if block
Abhishek-TAMU Apr 10, 2025
07ad9b3
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 10, 2025
0339b93
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 14, 2025
7f81292
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 14, 2025
21de4a0
Test case fix and merging with main
Abhishek-TAMU Apr 14, 2025
1e78997
PR Change 1
Abhishek-TAMU Apr 15, 2025
5bdcd99
PR Change 2
Abhishek-TAMU Apr 15, 2025
db1fcf9
Added test_vision_data_collator
Abhishek-TAMU Apr 15, 2025
b43741d
Merge remote-tracking branch 'upstream/main' into vision-model
Abhishek-TAMU Apr 15, 2025
7c987f9
PR Changes
Abhishek-TAMU Apr 16, 2025
c7581dc
Comment change
Abhishek-TAMU Apr 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ ignored-parents=
max-args=5

# Maximum number of attributes for a class (custom).
max-attributes=10
max-attributes=15

# Maximum number of boolean expressions in an if statement (see R0916).
max-bool-expr=5
Expand All @@ -299,7 +299,7 @@ max-parents=7
max-public-methods=20

# Maximum number of return / yield for function / method body.
max-returns=6
max-returns=10

# Maximum number of statements in function / method body.
max-statements=50
Expand Down
33 changes: 31 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- [Fine Tuning](#fine-tuning)
- [FMS Acceleration](#fms-acceleration)
- [Extended Pre-Training](#extended-pre-training)
- [Tuning Vision Language Models](#tuning-vision-language-models)
- [Inference](#inference)
- [Running a single example](#running-a-single-example)
- [Running multiple examples](#running-multiple-examples)
Expand All @@ -39,15 +40,15 @@ pip install fms-hf-tuning
### Using FlashAttention

> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
```
```sh
pip install fms-hf-tuning[dev]
pip install fms-hf-tuning[flash-attn]
```
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.

*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.

```
```sh
pip uninstall flash-attn
pip cache purge
pip install fms-hf-tuning[flash-attn]
Expand Down Expand Up @@ -898,6 +899,34 @@ The `fms_acceleration.cli` can do more to search for all available configs, plug

We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)

## Tuning Vision Language Models

We also support full fine-tuning and LoRA tuning for vision language models - `Granite 3.2 Vision`, `Llama 3.2 Vision`, and `LLaVa-Next`.
For information on supported dataset formats and how to tune a vision-language model, please see [this document](./docs/vision-language-model-tuning.md).

### Supported vision model

- Legend:

✅ Ready and available

✔️ Ready and available - compatible architecture

🚫 Not supported

? May be supported, but not tested

Model Name & Size | Model Architecture | Full Finetuning |
-------------------- | ---------------- | --------------- |
Llama 3.2-11B Vision | MllamaForConditionalGeneration | ✅* |
Llava 1.5-7B | LlavaForConditionalGeneration | ✅* |
Granite 3.1-2B Vision | LlavaNextForConditionalGeneration | ✅* |
Llava Mistral 1.6-7B | LlavaNextForConditionalGeneration | ✅* |

(*) - Supported with `fms-hf-tuning` v2.8.0 or later.

**Note**: vLLM currently does not support inference with LoRA-tuned vision models. To use a tuned LoRA adapter of vision model, please merge it with the base model before running vLLM inference.

## Inference
Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.

Expand Down
174 changes: 174 additions & 0 deletions docs/vision-language-model-tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Tuning Vision Language Models
Our library also supports full fine tuning and LoRA tuning for vision language models.

## Supported Dataset Format
We support tuning an `image+text` dataset that includes:
- A single text field, formatted using the model’s chat template.
- A single image field, which can contain either a list of images or a single image.

The text must follow the OpenAI conversational data format, which is defined as a list of message objects. Each message object must have two required fields: `role` and `content`:
- `role`: The speaker (e.g., "user" or "assistant").
- `content`: A list of dictionaries, each specifying:
- `type`: `text` or `image`.
- `text`: The text content (if applicable).

Example Format:
```json
[
{
"role": "user",
"content": [
{"type": "text", "text": "Who is this?"},
{"type": "image"}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "Barack Obama"}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": "What is he famous for?"}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "He is the 44th President of the United States."}
]
}
]
```

## Processing of dataset

First, each dataset sample is processed by applying the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) to the raw text, which formats the conversation as required. Then, the model’s [`processor`](https://huggingface.co/docs/transformers/main/en/processors) takes the formatted text and the corresponding image(s) and converts them into the final input representation (e.g., input_ids, attention masks, etc.) that the model uses for training.

**Note**: `Granite 3.2` and `Llava-1.5` Vision models expect a single image for each dataset sample. If a list of images is provided, only the first image will be used.

## Tuning configurations

Two parameters must be passed to specify which dataset columns to use:
- `dataset_text_field`: The column name that contains the conversational text.
- `dataset_image_field`: The column name that contains the images.

Below is a sample configuration file:
```json
{
"model_name_or_path": "ibm-granite/granite-vision-3.2-2b",
"training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
"dataset_text_field": "messages",
"dataset_image_field": "images",
"output_dir": "/app/test",
"num_train_epochs": 1.0,
"per_device_train_batch_size": 8,
"gradient_accumulation_steps": 2,
"learning_rate": 1e-4,
"bf16": true,
"torch_dtype": "bfloat16",
"use_flash_attn": true,
"remove_unused_columns": false,
"dataset_kwargs": {"skip_prepare_dataset": true},
"gradient_checkpointing": true,
"gradient_checkpointing_kwargs": {"use_reentrant": false},
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer"}
}
```

## Running the Trainer

You can also run training by calling our trainer module directly using the command line. You can use `python` for single GPU or `accelerate launch` command for multi GPU.
For example:

Command for single GPU:

```sh
python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--dataset_text_field "messages" \
--dataset_image_field "images"
```

Command for multi GPU:

```sh
accelerate launch \
--num_processes=$NUM_PROCESSORS
--config_file fixtures/accelerate_fsdp_defaults.yaml \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--dataset_text_field "messages" \
--dataset_image_field "images"
```

## Tuning Considerations for vision models

Flash Attention 2.0 is not supported by `MllamaForConditionalGeneration` models, thus when running tuning with the `Llama 3.2 Vision Models` set:

```json
"use_flash_attn": false
```
### Multi-GPU Tuning with FSDP:

When running `multi-GPU` tuning with `FSDP`, you need to wrap specific transformer layers. Use the following setting in FSDP config based on your model:

Granite 3.2 Vision Models:
```json
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer" }
```

Llava-Next and Llava-1.5 Models:
```json
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer" }
```

Llava-1.6-Mistral Model:
```json
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "MistralDecoderLayer" }
```

Llama 3.2 Vision Models: No additional configuration is required.

### Gradient Checkpointing:

We recommend running with argument `gradient_checkpointing=True` as enabling this will greatly reduce the memory needed to load and run the model.

When running with gradient checkpointing for the `Llava` and `Granite` vision models, you will need to also set `gradient_checkpointing_kwargs` to not use the activation checkpoint variant that requires reentrant autograd.

```json
"gradient_checkpointing_kwargs": {"use_reentrant": false}
```

Without setting this, tuning will lead to error:

```sh
RuntimeError: mat2 must be a matrix, got 1-D tensor
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [0] and normalized_shape = [1152]
```

### Other arguments:

To prevent default text-only processing and ensure proper handling of multimodal data, we recommend setting:

```json
"remove_unused_columns": false
"dataset_kwargs": {"skip_prepare_dataset": true}
```

When performing LoRA tuning on vision models, you must specify the `target_modules` explicitly, as no defaults are provided.

6 changes: 6 additions & 0 deletions fixtures/accelerate_fsdp_defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ fsdp_config:
# not needed for HF models that have . _no_split_modules
# the example below is for GPTBigCode
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”
# needed for llava-1.5-vision + llava-next-vision models
# fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer"
# needed for llava-1.6-mistral-vision model
# fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer"
# needed for granite-3.2-vision model
# fsdp_transformer_layer_cls_to_wrap: "GraniteDecoderLayer"

# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ dependencies = [
"protobuf>=5.28.0,<6.0.0",
"datasets>=2.15.0,<4.0",
"simpleeval>=0.9.13,<2.0",
"pillow>=11.0.0,<12.0",
]

[project.optional-dependencies]
Expand Down
3 changes: 3 additions & 0 deletions tests/artifacts/tiny-llama-vision-model/chat_template.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- set user_supplied_system_message = true %}\n{%- else %}\n {%- set system_message = \"\" %}\n {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- Find out if there are any images #}\n{% set image_ns = namespace(has_images=false) %} \n{%- for message in messages %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {%- set image_ns.has_images = true %}\n {%- endif %}\n {%- endfor %}\n{%- endfor %}\n\n{#- System message if there are no images, or if the user supplied one #}\n{%- if user_supplied_system_message or not image_ns.has_images %}\n {{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n {%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n {%- endif %}\n {{- \"Cutting Knowledge Date: December 2023\\n\" }}\n {{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n {%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {%- endif %}\n {{- system_message }}\n {{- \"<|eot_id|>\" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' }}\n {%- if message['content'] is string %}\n {{- message['content'] }}\n {%- else %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {{- '<|image|>' }}\n {%- elif content['type'] == 'text' %}\n {{- content['text'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
}
44 changes: 44 additions & 0 deletions tests/artifacts/tiny-llama-vision-model/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"architectures": [
"MllamaForConditionalGeneration"
],
"image_token_index": 128256,
"model_type": "mllama",
"text_config": {
"cross_attention_layers": [
1
],
"eos_token_id": [
128001,
128008,
128009
],
"hidden_size": 128,
"intermediate_size": 768,
"max_position_embeddings": 1024,
"model_type": "mllama_text_model",
"num_attention_heads": 4,
"num_hidden_layers": 2,
"rope_scaling": {
"factor": 8.0,
"high_freq_factor": 4.0,
"low_freq_factor": 1.0,
"original_max_position_embeddings": 512,
"rope_type": "llama3"
},
"torch_dtype": "float16"
},
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"vision_config": {
"attention_heads": 4,
"hidden_size": 128,
"image_size": 224,
"intermediate_size": 512,
"model_type": "mllama_vision_model",
"num_global_layers": 1,
"num_hidden_layers": 2,
"torch_dtype": "float16",
"vision_output_dim": 256
}
}
11 changes: 11 additions & 0 deletions tests/artifacts/tiny-llama-vision-model/generation_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"_from_model_config": true,
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
],
"pad_token_id": 128004,
"transformers_version": "4.49.0"
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"do_convert_rgb": true,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "MllamaImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_image_tiles": 4,
"processor_class": "MllamaProcessor",
"resample": 2,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 560,
"width": 560
}
}
23 changes: 23 additions & 0 deletions tests/artifacts/tiny-llama-vision-model/special_tokens_map.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"bos_token": {
"content": "<|begin_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|eot_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|finetune_right_pad_id|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}
Loading