feat: support loading vision model#451
Conversation
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
|
Thanks for making a pull request! 😃 |
anhuong
left a comment
There was a problem hiding this comment.
Tested this code loading llama 3.2-11b vision model as well as llava 1.6-mistral-7b vision model and they both were able to be lora tuned successfully with dataset https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft
Note that when loading llava model with FSDP, need to provide extra field fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer" for llava 1.5 and fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer" for llava 1.6-mistral
Ran with configuration:
{
"model_name_or_path": "llava-hf/llava-v1.6-mistral-7b-hf",
"training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
"output_dir": "/fmaas-integration-tests/tuning/output/anhuong/llava1.6-mistral-7b-vision_llava-dataset_lora",
"num_train_epochs": 3.0,
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 1,
"learning_rate": 1e-4,
"response_template": "\n### Response:", <--- FIX: this field is not used
"dataset_text_field": "output", <--- FIX: this field is not used
"bf16": true,
"torch_dtype": "bfloat16",
"use_flash_attn": false,
"remove_unused_columns": false,
"dataset_kwargs": {"skip_prepare_dataset": true},
"multimodal": true,
"peft_method": "lora",
"r": 8,
"lora_dropout": 0.05,
"lora_alpha": 16,
"target_modules": ["all-linear"],
"lora_post_process_for_vllm": true,
"gradient_checkpointing": true,
"text_field_name": "messages",
"image_field_name": "images"
}Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
* Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
- in transformers v4.49 output_dir is no longer required Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
|
@dushyantbehl PR for a fix : anhuong#5 |
* Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix issues for granite vision preview model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
| embedding_size - current_output_embeddings.weight.shape[0] | ||
| ) | ||
|
|
||
| # Save current input embedding |
There was a problem hiding this comment.
unit test for this class would be great for this function
There was a problem hiding this comment.
I can definitely add a test case, something like below (Below test case pass successfully). But I was wondering do we use bigger models for test as Llama vision model is a 11b model (meta-llama/Llama-3.2-11B-Vision-Instruct) and downloading and loading it every time in unit test run would take lot of time just for running this 1 test case. Any thoughts ?
def test_resize_llama_vision_model():
model = AutoModelForVision2Seq.from_pretrained(LLAMA_VISION_MODEL_NAME)
processor = AutoProcessor.from_pretrained(LLAMA_VISION_MODEL_NAME)
tokenizer = processor.tokenizer
current_input_embeddings = model.get_input_embeddings()
current_input_embeddings= copy.deepcopy(current_input_embeddings)
current_output_embeddings = model.get_output_embeddings()
current_output_embeddings = copy.deepcopy(current_output_embeddings)
current_tokenizer_len = len(tokenizer.get_vocab())
resize_result = tokenizer_and_embedding_resize(
special_tokens_dict={"unk_token": "<unk>"}, tokenizer=tokenizer, model=model, multiple_of=1
)
resized_input_embeddings = model.get_input_embeddings()
resized_output_embeddings = model.get_output_embeddings()
resized_tokenizer_len = len(tokenizer.get_vocab())
assert resized_tokenizer_len == current_tokenizer_len + 1
assert "<unk>" in tokenizer.get_vocab()
assert resize_result["num_new_tokens"] == 1
# 2 new tokens were added: <unk> and <image>
assert resized_output_embeddings.weight.shape[0] == current_output_embeddings.weight.shape[0] + 2
assert resized_input_embeddings.weight.shape[0] == current_input_embeddings.weight.shape[0] + 2
There was a problem hiding this comment.
no we will have to find a dummy vision model for unit tests , or we ll have to create one ourselves with dummy files and trimmed down vocab and embeddings
There was a problem hiding this comment.
unit test otherwise looks good to me. I didnt understand why 2 tokens are added though, if we add one UNK # 2 new tokens were added: <unk> and <image>. Does a new image token get added for every language token?
let us know in comment once you assess feasibility of unit test. We should ideally have unit tests for some vision model
There was a problem hiding this comment.
unit test otherwise looks good to me. I didnt understand why 2 tokens are added though, if we add one UNK # 2 new tokens were added: and
. Does a new image token get added for every language token?
<image> token is present in Llama vision tokenizer and tokens length: len(tokenizer) = 128257
But model.get_output_embeddings().weight.shape is torch.Size([128256, 4096]).
Hence whenspecial_tokens_dict just has <unk>, then num_new_tokens = num_new_tokens + embedding_size - len(tokenizer) makes embedding size = 128258 and hence model resizes from 128256 to 128258 (increase by 2) (as it takes additional <image> token also in calculation)
And hence, I resize and increase model input embeddings also by 2.
There was a problem hiding this comment.
So basically ,
As mentioned above for Llama vision model from Hugging face (before any resizing):
len(tokenizer) == model.get_output_embeddings().weight.shape[0] + 1
AND
For Granite, Llava Vision model, (before any resizing)
len(tokenizer) == model.get_output_embeddings().weight.shape[0]
There was a problem hiding this comment.
Regarding tiny vision model, I have saved tiny llama vision model which you can see in this commit along with passed unit test.
To save the Tiny Llama Vision model, the same config from the original Llama Vision model was used, with parameters such as hidden_size, num_hidden_layers, intermediate_size, and attention_heads reduced.
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
dushyantbehl
left a comment
There was a problem hiding this comment.
Few comments and request to go through the merge again due to inconsistency in code post rebasing.
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
dushyantbehl
left a comment
There was a problem hiding this comment.
Thanks @Abhishek-TAMU just last couple of minor changes requested and rest looks good to me.
Please check the DCO before we can merge it.
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
dushyantbehl
left a comment
There was a problem hiding this comment.
Thanks a lot @Abhishek-TAMU for diligently fixing all review comments.
LGTM.
* install trl=0.13, deepspeed, update transformers * deps: install pillow, uninstall deepspeed * add multimodal flag, pass processor, add data collator * load dataset directly, pass processor, fix field * add generic data collator Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove load_dataset since HF support added Signed-off-by: Anh Uong <anh.uong@ibm.com> * add fsdp config needed for llava models Signed-off-by: Anh Uong <anh.uong@ibm.com> * feat:Use of data handlers for Vision LM support (#4) * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * replace text_field_name for dataset_text_field and for image Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove multimodal flag Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix formatting, remove unused fields Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove irrelevant unit test - in transformers v4.49 output_dir is no longer required Signed-off-by: Anh Uong <anh.uong@ibm.com> * revert data loading back Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix:Support loading for Granite-3.2 Vision Model * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix issues for granite vision preview model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * remove duplicate logger, fmt Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix unbound var, refactor tokenizer Signed-off-by: Anh Uong <anh.uong@ibm.com> * changes from review comments Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix embedding resize and errors Signed-off-by: Anh Uong <anh.uong@ibm.com> * add hack fix for vocab size for Mllama models Signed-off-by: Anh Uong <anh.uong@ibm.com> * add docs on vision model usage Signed-off-by: Anh Uong <anh.uong@ibm.com> * move llama vocab size, allow single image inputs Signed-off-by: Anh Uong <anh.uong@ibm.com> * linter fixes Signed-off-by: Anh Uong <anh.uong@ibm.com> * fix merge, add lora note Signed-off-by: Anh Uong <anh.uong@ibm.com> * docs: organize sections Signed-off-by: Anh Uong <anh.uong@ibm.com> * remove all dataset columns Signed-off-by: Anh Uong <anh.uong@ibm.com> * only take single image for granite models Signed-off-by: Anh Uong <anh.uong@ibm.com> * feat:Support Entire Vision dataset with Streaming (#6) * Changes to support vlms Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Change in kwargs Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Restructure of VisionDataCollator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Usage of 2 handlers and modifying chat_template handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix fmt+lint Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor Fix for unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor error handling Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix issues for granite vision preview model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Transformers version for running Llama model successfully Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Changes when enabling streaming * Merge remote-tracking branch 'anh_vision_fms_hf_tuning/vision-model' into vision_support * Merge with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * modify apply_tokenizer_chat_template argument key Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * resolve features for iterable dataset Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add applying processor in collator and PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Rename Handler Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Add config for dataset streaming via arguments Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix column removal Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Convert to RGB for LlavaProcessor and model LlavaForConditionalGeneration Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR CHANGES 1 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes 2 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Collator documentation Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Minor fix Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Resize input and output embeddings seperately for LLama vision model Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Documentation added Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added processor to DataPreProcessor Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR change of adding vocab size Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added llama vision model and unit test case Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Make Jinja template work Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Fix for preprocessor_config in checkpoint folder Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fmt fix Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Moving resizing out of if block Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Test case fix and merging with main Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Change 1 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Change 2 Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Added test_vision_data_collator Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Comment change Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Anh Uong <anh.uong@ibm.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
Description of the change
Related issue number
How to verify the PR
Was the PR tested