Skip to content

Commit fa070a8

Browse files
feat: support loading vision model (#451)
* install trl=0.13, deepspeed, update transformers * deps: install pillow, uninstall deepspeed * add multimodal flag, pass processor, add data collator * load dataset directly, pass processor, fix field * add generic data collator Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove load_dataset since HF support added Signed-off-by: Anh Uong <anh.uong@ibm.com> * add fsdp config needed for llava models Signed-off-by: Anh Uong <anh.uong@ibm.com> * feat:Use of data handlers for Vision LM support (#4) * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * replace text_field_name for dataset_text_field and for image Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove multimodal flag Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix formatting, remove unused fields Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove irrelevant unit test - in transformers v4.49 output_dir is no longer required Signed-off-by: Anh Uong <anh.uong@ibm.com> * revert data loading back Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix:Support loading for Granite-3.2 Vision Model * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix issues for granite vision preview model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * remove duplicate logger, fmt Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix unbound var, refactor tokenizer Signed-off-by: Anh Uong <anh.uong@ibm.com> * changes from review comments Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix embedding resize and errors Signed-off-by: Anh Uong <anh.uong@ibm.com> * add hack fix for vocab size for Mllama models Signed-off-by: Anh Uong <anh.uong@ibm.com> * add docs on vision model usage Signed-off-by: Anh Uong <anh.uong@ibm.com> * move llama vocab size, allow single image inputs Signed-off-by: Anh Uong <anh.uong@ibm.com> * linter fixes Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix merge, add lora note Signed-off-by: Anh Uong <anh.uong@ibm.com> * docs: organize sections Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove all dataset columns Signed-off-by: Anh Uong <anh.uong@ibm.com> * only take single image for granite models Signed-off-by: Anh Uong <anh.uong@ibm.com> * feat:Support Entire Vision dataset with Streaming (#6) * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix issues for granite vision preview model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Transformers version for running Llama model successfully Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Changes when enabling streaming * Merge remote-tracking branch 'anh_vision_fms_hf_tuning/vision-model' into vision_support * Merge with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * modify apply_tokenizer_chat_template argument key Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * resolve features for iterable dataset Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add applying processor in collator and PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Rename Handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config for dataset streaming via arguments Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix column removal Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Convert to RGB for LlavaProcessor and model LlavaForConditionalGeneration Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR CHANGES 1 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes 2 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Collator documentation Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor fix Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Resize input and output embeddings seperately for LLama vision model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Documentation added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added processor to DataPreProcessor Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR change of adding vocab size Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added llama vision model and unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Make Jinja template work Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix for preprocessor_config in checkpoint folder Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fmt fix Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Moving resizing out of if block Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Test case fix and merging with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Change 1 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Change 2 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added test_vision_data_collator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Comment change Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Anh Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
1 parent 2d47acf commit fa070a8

26 files changed

Lines changed: 1254027 additions & 52 deletions

.pylintrc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ ignored-parents=
281281
max-args=5
282282

283283
# Maximum number of attributes for a class (custom).
284-
max-attributes=10
284+
max-attributes=15
285285

286286
# Maximum number of boolean expressions in an if statement (see R0916).
287287
max-bool-expr=5
@@ -299,7 +299,7 @@ max-parents=7
299299
max-public-methods=20
300300

301301
# Maximum number of return / yield for function / method body.
302-
max-returns=6
302+
max-returns=10
303303

304304
# Maximum number of statements in function / method body.
305305
max-statements=50

README.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
- [Fine Tuning](#fine-tuning)
1414
- [FMS Acceleration](#fms-acceleration)
1515
- [Extended Pre-Training](#extended-pre-training)
16+
- [Tuning Vision Language Models](#tuning-vision-language-models)
1617
- [Inference](#inference)
1718
- [Running a single example](#running-a-single-example)
1819
- [Running multiple examples](#running-multiple-examples)
@@ -39,15 +40,15 @@ pip install fms-hf-tuning
3940
### Using FlashAttention
4041

4142
> Note: After installing, if you wish to use [FlashAttention](https://github.com/Dao-AILab/flash-attention), then you need to install these requirements:
42-
```
43+
```sh
4344
pip install fms-hf-tuning[dev]
4445
pip install fms-hf-tuning[flash-attn]
4546
```
4647
[FlashAttention](https://github.com/Dao-AILab/flash-attention) requires the [CUDA Toolit](https://developer.nvidia.com/cuda-toolkit) to be pre-installed.
4748

4849
*Debug recommendation:* While training, if you encounter flash-attn errors such as `undefined symbol`, you can follow the below steps for clean installation of flash binaries. This may occur when having multiple environments sharing the pip cache directory or torch version is updated.
4950

50-
```
51+
```sh
5152
pip uninstall flash-attn
5253
pip cache purge
5354
pip install fms-hf-tuning[flash-attn]
@@ -898,6 +899,34 @@ The `fms_acceleration.cli` can do more to search for all available configs, plug
898899

899900
We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)
900901

902+
## Tuning Vision Language Models
903+
904+
We also support full fine-tuning and LoRA tuning for vision language models - `Granite 3.2 Vision`, `Llama 3.2 Vision`, and `LLaVa-Next`.
905+
For information on supported dataset formats and how to tune a vision-language model, please see [this document](./docs/vision-language-model-tuning.md).
906+
907+
### Supported vision model
908+
909+
- Legend:
910+
911+
✅ Ready and available
912+
913+
✔️ Ready and available - compatible architecture
914+
915+
🚫 Not supported
916+
917+
? May be supported, but not tested
918+
919+
Model Name & Size | Model Architecture | Full Finetuning |
920+
-------------------- | ---------------- | --------------- |
921+
Llama 3.2-11B Vision | MllamaForConditionalGeneration | ✅* |
922+
Llava 1.5-7B | LlavaForConditionalGeneration | ✅* |
923+
Granite 3.1-2B Vision | LlavaNextForConditionalGeneration | ✅* |
924+
Llava Mistral 1.6-7B | LlavaNextForConditionalGeneration | ✅* |
925+
926+
(*) - Supported with `fms-hf-tuning` v2.8.0 or later.
927+
928+
**Note**: vLLM currently does not support inference with LoRA-tuned vision models. To use a tuned LoRA adapter of vision model, please merge it with the base model before running vLLM inference.
929+
901930
## Inference
902931
Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.
903932

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Tuning Vision Language Models
2+
Our library also supports full fine tuning and LoRA tuning for vision language models.
3+
4+
## Supported Dataset Format
5+
We support tuning an `image+text` dataset that includes:
6+
- A single text field, formatted using the model’s chat template.
7+
- A single image field, which can contain either a list of images or a single image.
8+
9+
The text must follow the OpenAI conversational data format, which is defined as a list of message objects. Each message object must have two required fields: `role` and `content`:
10+
- `role`: The speaker (e.g., "user" or "assistant").
11+
- `content`: A list of dictionaries, each specifying:
12+
- `type`: `text` or `image`.
13+
- `text`: The text content (if applicable).
14+
15+
Example Format:
16+
```json
17+
[
18+
{
19+
"role": "user",
20+
"content": [
21+
{"type": "text", "text": "Who is this?"},
22+
{"type": "image"}
23+
]
24+
},
25+
{
26+
"role": "assistant",
27+
"content": [
28+
{"type": "text", "text": "Barack Obama"}
29+
]
30+
},
31+
{
32+
"role": "user",
33+
"content": [
34+
{"type": "text", "text": "What is he famous for?"}
35+
]
36+
},
37+
{
38+
"role": "assistant",
39+
"content": [
40+
{"type": "text", "text": "He is the 44th President of the United States."}
41+
]
42+
}
43+
]
44+
```
45+
46+
## Processing of dataset
47+
48+
First, each dataset sample is processed by applying the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) to the raw text, which formats the conversation as required. Then, the model’s [`processor`](https://huggingface.co/docs/transformers/main/en/processors) takes the formatted text and the corresponding image(s) and converts them into the final input representation (e.g., input_ids, attention masks, etc.) that the model uses for training.
49+
50+
**Note**: `Granite 3.2` and `Llava-1.5` Vision models expect a single image for each dataset sample. If a list of images is provided, only the first image will be used.
51+
52+
## Tuning configurations
53+
54+
Two parameters must be passed to specify which dataset columns to use:
55+
- `dataset_text_field`: The column name that contains the conversational text.
56+
- `dataset_image_field`: The column name that contains the images.
57+
58+
Below is a sample configuration file:
59+
```json
60+
{
61+
"model_name_or_path": "ibm-granite/granite-vision-3.2-2b",
62+
"training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
63+
"dataset_text_field": "messages",
64+
"dataset_image_field": "images",
65+
"output_dir": "/app/test",
66+
"num_train_epochs": 1.0,
67+
"per_device_train_batch_size": 8,
68+
"gradient_accumulation_steps": 2,
69+
"learning_rate": 1e-4,
70+
"bf16": true,
71+
"torch_dtype": "bfloat16",
72+
"use_flash_attn": true,
73+
"remove_unused_columns": false,
74+
"dataset_kwargs": {"skip_prepare_dataset": true},
75+
"gradient_checkpointing": true,
76+
"gradient_checkpointing_kwargs": {"use_reentrant": false},
77+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer"}
78+
}
79+
```
80+
81+
## Running the Trainer
82+
83+
You can also run training by calling our trainer module directly using the command line. You can use `python` for single GPU or `accelerate launch` command for multi GPU.
84+
For example:
85+
86+
Command for single GPU:
87+
88+
```sh
89+
python tuning/sft_trainer.py \
90+
--model_name_or_path $MODEL_PATH \
91+
--training_data_path $TRAIN_DATA_PATH \
92+
--output_dir $OUTPUT_PATH \
93+
--num_train_epochs 5 \
94+
--per_device_train_batch_size 4 \
95+
--gradient_accumulation_steps 1 \
96+
--learning_rate 1e-5 \
97+
--dataset_text_field "messages" \
98+
--dataset_image_field "images"
99+
```
100+
101+
Command for multi GPU:
102+
103+
```sh
104+
accelerate launch \
105+
--num_processes=$NUM_PROCESSORS
106+
--config_file fixtures/accelerate_fsdp_defaults.yaml \
107+
tuning/sft_trainer.py \
108+
--model_name_or_path $MODEL_PATH \
109+
--training_data_path $TRAIN_DATA_PATH \
110+
--output_dir $OUTPUT_PATH \
111+
--num_train_epochs 5 \
112+
--per_device_train_batch_size 4 \
113+
--gradient_accumulation_steps 1 \
114+
--learning_rate 1e-5 \
115+
--dataset_text_field "messages" \
116+
--dataset_image_field "images"
117+
```
118+
119+
## Tuning Considerations for vision models
120+
121+
Flash Attention 2.0 is not supported by `MllamaForConditionalGeneration` models, thus when running tuning with the `Llama 3.2 Vision Models` set:
122+
123+
```json
124+
"use_flash_attn": false
125+
```
126+
### Multi-GPU Tuning with FSDP:
127+
128+
When running `multi-GPU` tuning with `FSDP`, you need to wrap specific transformer layers. Use the following setting in FSDP config based on your model:
129+
130+
Granite 3.2 Vision Models:
131+
```json
132+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "GraniteDecoderLayer" }
133+
```
134+
135+
Llava-Next and Llava-1.5 Models:
136+
```json
137+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer" }
138+
```
139+
140+
Llava-1.6-Mistral Model:
141+
```json
142+
"accelerate_launch_args": { "fsdp_transformer_layer_cls_to_wrap": "MistralDecoderLayer" }
143+
```
144+
145+
Llama 3.2 Vision Models: No additional configuration is required.
146+
147+
### Gradient Checkpointing:
148+
149+
We recommend running with argument `gradient_checkpointing=True` as enabling this will greatly reduce the memory needed to load and run the model.
150+
151+
When running with gradient checkpointing for the `Llava` and `Granite` vision models, you will need to also set `gradient_checkpointing_kwargs` to not use the activation checkpoint variant that requires reentrant autograd.
152+
153+
```json
154+
"gradient_checkpointing_kwargs": {"use_reentrant": false}
155+
```
156+
157+
Without setting this, tuning will lead to error:
158+
159+
```sh
160+
RuntimeError: mat2 must be a matrix, got 1-D tensor
161+
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [0] and normalized_shape = [1152]
162+
```
163+
164+
### Other arguments:
165+
166+
To prevent default text-only processing and ensure proper handling of multimodal data, we recommend setting:
167+
168+
```json
169+
"remove_unused_columns": false
170+
"dataset_kwargs": {"skip_prepare_dataset": true}
171+
```
172+
173+
When performing LoRA tuning on vision models, you must specify the `target_modules` explicitly, as no defaults are provided.
174+

fixtures/accelerate_fsdp_defaults.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,12 @@ fsdp_config:
4141
# not needed for HF models that have . _no_split_modules
4242
# the example below is for GPTBigCode
4343
# fsdp_transformer_layer_cls_to_wrap: "GPTBigCodeBlock”
44+
# needed for llava-1.5-vision + llava-next-vision models
45+
# fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer"
46+
# needed for llava-1.6-mistral-vision model
47+
# fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer"
48+
# needed for granite-3.2-vision model
49+
# fsdp_transformer_layer_cls_to_wrap: "GraniteDecoderLayer"
4450

4551
# for "autocast" mixed precision training, where the weights of the model are kept at higher precision, but the
4652
# learning products (e.g., gradients, model parameters) are kept at a lower precision. Default is 'no'. Other options

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ dependencies = [
3939
"protobuf>=5.28.0,<6.0.0",
4040
"datasets>=2.15.0,<4.0",
4141
"simpleeval>=0.9.13,<2.0",
42+
"pillow>=11.0.0,<12.0",
4243
]
4344

4445
[project.optional-dependencies]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- if strftime_now is defined %}\n {%- set date_string = strftime_now(\"%d %b %Y\") %}\n {%- else %}\n {%- set date_string = \"26 Jul 2024\" %}\n {%- endif %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- set user_supplied_system_message = true %}\n{%- else %}\n {%- set system_message = \"\" %}\n {%- set user_supplied_system_message = false %}\n{%- endif %}\n\n{#- Find out if there are any images #}\n{% set image_ns = namespace(has_images=false) %} \n{%- for message in messages %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {%- set image_ns.has_images = true %}\n {%- endif %}\n {%- endfor %}\n{%- endfor %}\n\n{#- System message if there are no images, or if the user supplied one #}\n{%- if user_supplied_system_message or not image_ns.has_images %}\n {{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n {%- if tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n {%- endif %}\n {{- \"Cutting Knowledge Date: December 2023\\n\" }}\n {{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n {%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {%- endif %}\n {{- system_message }}\n {{- \"<|eot_id|>\" }}\n{%- endif %}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' }}\n {%- if message['content'] is string %}\n {{- message['content'] }}\n {%- else %}\n {%- for content in message['content'] %}\n {%- if content['type'] == 'image' %}\n {{- '<|image|>' }}\n {%- elif content['type'] == 'text' %}\n {{- content['text'] }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {{- \"<|eot_id|>\" }}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"
3+
}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
{
2+
"architectures": [
3+
"MllamaForConditionalGeneration"
4+
],
5+
"image_token_index": 128256,
6+
"model_type": "mllama",
7+
"text_config": {
8+
"cross_attention_layers": [
9+
1
10+
],
11+
"eos_token_id": [
12+
128001,
13+
128008,
14+
128009
15+
],
16+
"hidden_size": 128,
17+
"intermediate_size": 768,
18+
"max_position_embeddings": 1024,
19+
"model_type": "mllama_text_model",
20+
"num_attention_heads": 4,
21+
"num_hidden_layers": 2,
22+
"rope_scaling": {
23+
"factor": 8.0,
24+
"high_freq_factor": 4.0,
25+
"low_freq_factor": 1.0,
26+
"original_max_position_embeddings": 512,
27+
"rope_type": "llama3"
28+
},
29+
"torch_dtype": "float16"
30+
},
31+
"torch_dtype": "float16",
32+
"transformers_version": "4.49.0",
33+
"vision_config": {
34+
"attention_heads": 4,
35+
"hidden_size": 128,
36+
"image_size": 224,
37+
"intermediate_size": 512,
38+
"model_type": "mllama_vision_model",
39+
"num_global_layers": 1,
40+
"num_hidden_layers": 2,
41+
"torch_dtype": "float16",
42+
"vision_output_dim": 256
43+
}
44+
}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"_from_model_config": true,
3+
"bos_token_id": 128000,
4+
"eos_token_id": [
5+
128001,
6+
128008,
7+
128009
8+
],
9+
"pad_token_id": 128004,
10+
"transformers_version": "4.49.0"
11+
}
67.9 MB
Binary file not shown.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"do_convert_rgb": true,
3+
"do_normalize": true,
4+
"do_pad": true,
5+
"do_rescale": true,
6+
"do_resize": true,
7+
"image_mean": [
8+
0.48145466,
9+
0.4578275,
10+
0.40821073
11+
],
12+
"image_processor_type": "MllamaImageProcessor",
13+
"image_std": [
14+
0.26862954,
15+
0.26130258,
16+
0.27577711
17+
],
18+
"max_image_tiles": 4,
19+
"processor_class": "MllamaProcessor",
20+
"resample": 2,
21+
"rescale_factor": 0.00392156862745098,
22+
"size": {
23+
"height": 560,
24+
"width": 560
25+
}
26+
}

0 commit comments

Comments
 (0)