feat: Support Florence-2 Transformers-native architecture (transformers ≥ 4.56.1 / 5.x)#10
feat: Support Florence-2 Transformers-native architecture (transformers ≥ 4.56.1 / 5.x)#10carlesonielfa wants to merge 11 commits into
Conversation
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
|
Hi @NickLucche , would you mind taking a look when you get a chance? I didn't find any contribution guidelines so I did my best to match the project's style. The PR updates the plugin to support the official HF Happy to make any adjustments! |
|
Thanks for contributing @carlesonielfa ! cc @Isotr0py for florence related PR.
I think the tok_kwargs change was necessary only with more recent vllm version, which one did you use to test the model? |
|
The degraded generation I observed it with the plugin's 0.2.0 version and vLLM 0.14.1.
My PR I tested with vLLM 0.16.0 and that behavior was corrected, and even without a set Although when testing more intensively, for very simple images (a black square, simple icons) the model will still get stuck on a generation loop, and increasing the repetition penalty just makes it generate hallucinated text. |
Isotr0py
left a comment
There was a problem hiding this comment.
Nice! It will be great to use transformers' implementation, especially the original ones quite lacks maintenance. 😅
Leave two initial comments. PTAL!
| def _remap(weights: Iterable[tuple[str, torch.Tensor]]): | ||
| for name, param in weights: | ||
| # HF checkpoint layout (Florence2ForConditionalGeneration): | ||
| # model.vision_tower.* -> vision_tower.* | ||
| # model.multi_modal_projector.* -> multi_modal_projector.* | ||
| # model.language_model.* -> language_model.model.* | ||
| # (HF uses BartModel directly; our wrapper adds .model) | ||
| # lm_head.* -> language_model.lm_head.* | ||
| if name.startswith("model.vision_tower."): | ||
| name = name[len("model.") :] | ||
| elif name.startswith("model.multi_modal_projector."): | ||
| name = name[len("model.") :] | ||
| elif name.startswith("model.language_model."): | ||
| name = ( | ||
| "language_model.model." + name[len("model.language_model.") :] | ||
| ) | ||
| elif name.startswith("lm_head."): | ||
| name = "language_model." + name | ||
| yield name, param |
There was a problem hiding this comment.
We can use WeightsMapper here, you can refer to other VLMs in vLLM:
https://github.com/vllm-project/vllm/blob/a1257fd1ea93da6e27b31e4739ac2707781d8ba7/vllm/model_executor/models/qwen3_vl.py#L1372-L1379
| f"only Florence Vision is supported for now. " | ||
| f"Received model type: {config.vision_config.model_type}" | ||
| ) | ||
| self.vision_tower = Florence2VisionBackbone(config.vision_config) |
There was a problem hiding this comment.
I think we can directly import Florence2VisionBackbone from transformers directly now, so we no longer need to copy ViT implementation here :)
There was a problem hiding this comment.
perfect! Implemented, that simplifies it a lot
|
@carlesonielfa thanks for elaborating, would you mind opening a separate PR for the bart |
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Sure! Removed the fix from this PR and opened a new one for it here: #12 Worth noting that the standalone PR won't run on newer vLLM versions without the import refactors across bart.py and florence2.py. The florence2.py import changes in this PR are intertwined with other modifications, so isolating just the vLLM-compatible import refactor would require some extra work. Happy to do that if you'd like, but merging this PR first would be the simpler path. |
|
I think I had some changes related to updating version here #6 (to be added once we cut a release version here) |
|
@carlesonielfa just realized in #6 I was defaulting to False. |
|
@NickLucche I'm now realizing I misunderstood your first comment. I only tested the output quality for Florence, and I thought you were referring to that on your first comment because those were the bug fixes I mentioned in the PR description, now I'm noticing I didn't' explain the That change was prompted by a runtime error when running the Its totally possible that your changes in #6 are sufficient! My tweak in this case was just for compatibility |
Isotr0py
left a comment
There was a problem hiding this comment.
The model implementation LGTM as long as we also clean up the multimodal projector copy.
|
|
||
| class ConvEmbed(nn.Module): | ||
| """ Image to Patch Embedding | ||
| class Florence2MultiModalProjector(nn.Module): |
There was a problem hiding this comment.
We can import from transformers as well.
https://github.com/huggingface/transformers/blob/3dd82faf3e887043db772d4c1191ec40271a1584/src/transformers/models/florence2/modeling_florence2.py#L573-L595
| """ | ||
| This module learns positional embeddings up to a fixed maximum size. | ||
| """ | ||
| class Florence2VisionLearnedAbsolutePositionEmbedding2D(nn.Module): |
| return pos.permute(2, 0, 1).unsqueeze(0).expand(x.shape[0], -1, -1, -1) | ||
|
|
||
|
|
||
| class Florence2VisionPositionalEmbeddingCosine1D(nn.Module): |
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
|
@Isotr0py Nice! Done, cleaned it up even more |
There was a problem hiding this comment.
Can you also update the readme to point to HF Florence-2 models?
florence-community/Florence-2-baseflorence-community/Florence-2-large
There was a problem hiding this comment.
sure! Removed also the trust_remote_code and the alternative tokenizer mention since that is no longer needed
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
|
Sure! I'm happy to do it. However, I'll be off until the start of April. If you'd like to move things forward in the meantime, feel free to go ahead. Otherwise, I'll pick this up when I'm back. |
Summary
Updates the Florence-2 vllm implementation to support the new
florence_visionmodel type,which is now the standard Transformers architecture and is also compatible with
transformers 5. Tested against vllm 0.16.0 and transformers 5.2.0.The old DaViT-based backbone used in earlier Microsoft checkpoints is no longer required.
As a result,
trust_remote_code=Trueis no longer needed, the model is now fully supported by the native transformers library.Having a
repetition_penaltyfor generation is also not required since this PR fixes an existing bug in the decoder prompt generationChanges
Vision backbone (
vllm_bart_plugin/florence2.py):Florence2VisionBackboneandFlorence2MultiModalProjector, matching the HF-official implementationload_weightsto match the new HF checkpoint layoutFlorence2ProcessorBug fixes (
vllm_bart_plugin/florence2.py,vllm_bart_plugin/bart.py):I'm not 100% sure if both were present before, but solving these fixed the degraded generation of the model that was present previously in the plugin. Generation quality is now on par with
transformersand we no longer requiresrepetiton_penaltyto be set to avoid previously seen<s>...<s>loops.decoder_start_token_idisNoneat the top-levelFlorence2Configand must be read fromtext_config.Tests (
tests/test_florence2.py):@pytest.mark.slow) for all tasksflorence-community/Florence-2-base-ftFormatting note
blackandisortwere only run on the files changed in this PR (vllm_bart_plugin/florence2.pyandtests/test_florence2.py). The rest of the codebase was not uniformly formatted beforehand, so formatting it all here would produce a noisy diff. Happy to run it across the whole project if you'd prefer to do it as part of this PR or a follow-up.Testing