Skip to content

feat: support loading vision model#451

Merged
willmj merged 54 commits intofoundation-model-stack:mainfrom
anhuong:vision-model
Apr 16, 2025
Merged

feat: support loading vision model#451
willmj merged 54 commits intofoundation-model-stack:mainfrom
anhuong:vision-model

Conversation

@anhuong
Copy link
Copy Markdown
Collaborator

@anhuong anhuong commented Jan 26, 2025

Description of the change

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link
Copy Markdown

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions Bot added the feat label Jan 26, 2025
Copy link
Copy Markdown
Collaborator Author

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this code loading llama 3.2-11b vision model as well as llava 1.6-mistral-7b vision model and they both were able to be lora tuned successfully with dataset https://huggingface.co/datasets/HuggingFaceH4/llava-instruct-mix-vsft

Note that when loading llava model with FSDP, need to provide extra field fsdp_transformer_layer_cls_to_wrap: "LlamaDecoderLayer" for llava 1.5 and fsdp_transformer_layer_cls_to_wrap: "MistralDecoderLayer" for llava 1.6-mistral

Ran with configuration:

{
  "model_name_or_path": "llava-hf/llava-v1.6-mistral-7b-hf", 
  "training_data_path": "HuggingFaceH4/llava-instruct-mix-vsft",
  "output_dir": "/fmaas-integration-tests/tuning/output/anhuong/llava1.6-mistral-7b-vision_llava-dataset_lora",
  "num_train_epochs": 3.0,
  "per_device_train_batch_size": 4,
  "gradient_accumulation_steps": 1,
  "learning_rate": 1e-4,
  "response_template": "\n### Response:",       <--- FIX: this field is not used
  "dataset_text_field": "output",       <--- FIX: this field is not used
  "bf16": true,
  "torch_dtype": "bfloat16",
  "use_flash_attn": false,
  "remove_unused_columns": false,
  "dataset_kwargs": {"skip_prepare_dataset": true},
  "multimodal": true,
  "peft_method": "lora",
  "r": 8,
  "lora_dropout": 0.05,
  "lora_alpha": 16,
  "target_modules": ["all-linear"],
  "lora_post_process_for_vllm": true,
  "gradient_checkpointing": true,
  "text_field_name": "messages",
  "image_field_name": "images"
}

Comment thread pyproject.toml Outdated
Comment thread tuning/config/configs.py Outdated
Comment thread tuning/config/configs.py Outdated
Comment thread tuning/data/data_preprocessing_utils.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/data_preprocessing_utils.py Outdated
Comment thread tuning/data/data_preprocessing_utils.py Outdated
Comment thread tuning/data/data_preprocessing_utils.py Outdated
@anhuong anhuong removed the request for review from alex-jw-brooks February 11, 2025 17:55
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Abhishek-TAMU and others added 5 commits February 25, 2025 07:16
* Changes to support vlms

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Change in kwargs

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Restructure of VisionDataCollator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Usage of 2 handlers and modifying chat_template handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix fmt+lint

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor Fix for unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor error handling

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Anh Uong <anh.uong@ibm.com>
- in transformers v4.49 output_dir is no longer required

Signed-off-by: Anh Uong <anh.uong@ibm.com>
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Signed-off-by: Anh Uong <anh.uong@ibm.com>
@Abhishek-TAMU
Copy link
Copy Markdown
Collaborator

Abhishek-TAMU commented Mar 1, 2025

@dushyantbehl PR for a fix : anhuong#5
More info here

* Changes to support vlms

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Change in kwargs

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Restructure of VisionDataCollator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Usage of 2 handlers and modifying chat_template handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix fmt+lint

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor Fix for unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor error handling

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Fix issues for granite vision preview model

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Comment thread tuning/utils/tokenizer_data_utils.py Outdated
Comment thread tuning/utils/tokenizer_data_utils.py Outdated
embedding_size - current_output_embeddings.weight.shape[0]
)

# Save current input embedding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test for this class would be great for this function

Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely add a test case, something like below (Below test case pass successfully). But I was wondering do we use bigger models for test as Llama vision model is a 11b model (meta-llama/Llama-3.2-11B-Vision-Instruct) and downloading and loading it every time in unit test run would take lot of time just for running this 1 test case. Any thoughts ?

def test_resize_llama_vision_model():
    model = AutoModelForVision2Seq.from_pretrained(LLAMA_VISION_MODEL_NAME)
    processor = AutoProcessor.from_pretrained(LLAMA_VISION_MODEL_NAME)
    tokenizer = processor.tokenizer

    current_input_embeddings = model.get_input_embeddings()
    current_input_embeddings= copy.deepcopy(current_input_embeddings)
    current_output_embeddings = model.get_output_embeddings()
    current_output_embeddings = copy.deepcopy(current_output_embeddings)
    
    current_tokenizer_len = len(tokenizer.get_vocab())

    resize_result = tokenizer_and_embedding_resize(
        special_tokens_dict={"unk_token": "<unk>"}, tokenizer=tokenizer, model=model, multiple_of=1
    )

    resized_input_embeddings = model.get_input_embeddings()
    resized_output_embeddings = model.get_output_embeddings()
    resized_tokenizer_len = len(tokenizer.get_vocab())

    assert resized_tokenizer_len == current_tokenizer_len + 1
    assert "<unk>" in tokenizer.get_vocab()
    assert resize_result["num_new_tokens"] == 1

    # 2 new tokens were added: <unk> and <image>
    assert resized_output_embeddings.weight.shape[0] == current_output_embeddings.weight.shape[0] + 2
    assert resized_input_embeddings.weight.shape[0] == current_input_embeddings.weight.shape[0] + 2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we will have to find a dummy vision model for unit tests , or we ll have to create one ourselves with dummy files and trimmed down vocab and embeddings

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test otherwise looks good to me. I didnt understand why 2 tokens are added though, if we add one UNK # 2 new tokens were added: <unk> and <image>. Does a new image token get added for every language token?

let us know in comment once you assess feasibility of unit test. We should ideally have unit tests for some vision model

Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit test otherwise looks good to me. I didnt understand why 2 tokens are added though, if we add one UNK # 2 new tokens were added: and . Does a new image token get added for every language token?

<image> token is present in Llama vision tokenizer and tokens length: len(tokenizer) = 128257
But model.get_output_embeddings().weight.shape is torch.Size([128256, 4096]).

Hence whenspecial_tokens_dict just has <unk>, then num_new_tokens = num_new_tokens + embedding_size - len(tokenizer) makes embedding size = 128258 and hence model resizes from 128256 to 128258 (increase by 2) (as it takes additional <image> token also in calculation)

And hence, I resize and increase model input embeddings also by 2.

Copy link
Copy Markdown
Collaborator

@Abhishek-TAMU Abhishek-TAMU Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically ,
As mentioned above for Llama vision model from Hugging face (before any resizing):
len(tokenizer) == model.get_output_embeddings().weight.shape[0] + 1

AND

For Granite, Llava Vision model, (before any resizing)
len(tokenizer) == model.get_output_embeddings().weight.shape[0]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding tiny vision model, I have saved tiny llama vision model which you can see in this commit along with passed unit test.

To save the Tiny Llama Vision model, the same config from the original Llama Vision model was used, with parameters such as hidden_size, num_hidden_layers, intermediate_size, and attention_heads reduced.

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Copy link
Copy Markdown
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments and request to go through the merge again due to inconsistency in code post rebasing.

Comment thread tuning/utils/collators.py
Comment thread tuning/sft_trainer.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/data_preprocessing_utils.py
Comment thread tuning/utils/tokenizer_data_utils.py
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tuning/data/data_processors.py Outdated
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
dushyantbehl
dushyantbehl previously approved these changes Apr 16, 2025
Copy link
Copy Markdown
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Abhishek-TAMU just last couple of minor changes requested and rest looks good to me.

Please check the DCO before we can merge it.

Comment thread tests/utils/test_embedding_resize.py
Comment thread tuning/data/setup_dataprocessor.py Outdated
Comment thread tests/test_sft_trainer.py Outdated
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Copy link
Copy Markdown
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @Abhishek-TAMU for diligently fixing all review comments.

LGTM.

@willmj willmj merged commit fa070a8 into foundation-model-stack:main Apr 16, 2025
9 checks passed
dushyantbehl pushed a commit to dushyantbehl/fms-hf-tuning that referenced this pull request Jun 23, 2025
* install trl=0.13, deepspeed, update transformers

* deps: install pillow, uninstall deepspeed

* add multimodal flag, pass processor, add data collator

* load dataset directly, pass processor, fix field

* add generic data collator

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* remove load_dataset since HF support added

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* add fsdp config needed for llava models

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* feat:Use of data handlers for Vision LM support  (#4)

* Changes to support vlms

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Change in kwargs

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Restructure of VisionDataCollator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Usage of 2 handlers and modifying chat_template handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix fmt+lint

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor Fix for unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor error handling

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* replace text_field_name for dataset_text_field and for image

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* remove multimodal flag

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* fix formatting, remove unused fields

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* remove irrelevant unit test

- in transformers v4.49 output_dir is no longer required

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* revert data loading back

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* fix:Support loading for Granite-3.2 Vision Model 

* Changes to support vlms

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Change in kwargs

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Restructure of VisionDataCollator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Usage of 2 handlers and modifying chat_template handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix fmt+lint

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor Fix for unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor error handling

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Fix issues for granite vision preview model

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* remove duplicate logger, fmt

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* fix unbound var, refactor tokenizer

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* changes from review comments

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* fix embedding resize and errors

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* add hack fix for vocab size for Mllama models

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* add docs on vision model usage

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* move llama vocab size, allow single image inputs

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* linter fixes

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* fix merge, add lora note

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* docs: organize sections

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* remove all dataset columns

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* only take single image for granite models

Signed-off-by: Anh Uong <anh.uong@ibm.com>

* feat:Support Entire Vision dataset with Streaming (#6)

* Changes to support vlms

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Change in kwargs

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Restructure of VisionDataCollator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Usage of 2 handlers and modifying chat_template handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fix fmt+lint

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor Fix for unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor error handling

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Fix issues for granite vision preview model

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Transformers version for running Llama model successfully

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Changes when enabling streaming

* Merge remote-tracking branch 'anh_vision_fms_hf_tuning/vision-model' into vision_support

* Merge with main

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* modify apply_tokenizer_chat_template argument key

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* resolve features for iterable dataset

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add applying processor in collator and PR changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Rename Handler

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Add config for dataset streaming via arguments

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Fix column removal

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Convert to RGB for LlavaProcessor and model LlavaForConditionalGeneration

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR CHANGES 1

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Changes 2

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Collator documentation

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Minor fix

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Resize input and output embeddings seperately for LLama vision model

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Documentation added

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Added processor to DataPreProcessor

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR change of adding vocab size

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Added llama vision model and unit test case

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Make Jinja template work

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Fix for preprocessor_config in checkpoint folder

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* fmt fix

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Moving resizing out of if block

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Test case fix and merging with main

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Change 1

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Change 2

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Added test_vision_data_collator

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* PR Changes

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

* Comment change

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

---------

Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Co-authored-by: Abhishek Maurya <124327945+Abhishek-TAMU@users.noreply.github.com>
Co-authored-by: Abhishek <maurya.abhishek@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants