Skip to content

feat: Support Florence-2 Transformers-native architecture (transformers ≥ 4.56.1 / 5.x)#10

Open
carlesonielfa wants to merge 11 commits into
vllm-project:masterfrom
carlesonielfa:master
Open

feat: Support Florence-2 Transformers-native architecture (transformers ≥ 4.56.1 / 5.x)#10
carlesonielfa wants to merge 11 commits into
vllm-project:masterfrom
carlesonielfa:master

Conversation

@carlesonielfa
Copy link
Copy Markdown

@carlesonielfa carlesonielfa commented Mar 6, 2026

Summary

Updates the Florence-2 vllm implementation to support the new florence_vision model type,
which is now the standard Transformers architecture and is also compatible with transformers 5. Tested against vllm 0.16.0 and transformers 5.2.0.

The old DaViT-based backbone used in earlier Microsoft checkpoints is no longer required.

As a result, trust_remote_code=True is no longer needed, the model is now fully supported by the native transformers library.

Having a repetition_penalty for generation is also not required since this PR fixes an existing bug in the decoder prompt generation

Changes

Vision backbone (vllm_bart_plugin/florence2.py):

  • Replaced the old DaViT classes with the new Florence2VisionBackbone and Florence2MultiModalProjector, matching the HF-official implementation
  • Updated weight remapping in load_weights to match the new HF checkpoint layout
  • Fixed multi-modal token handling to correctly align vision embeddings with the placeholder tokens already inserted by Florence2Processor

Bug fixes (vllm_bart_plugin/florence2.py, vllm_bart_plugin/bart.py):
I'm not 100% sure if both were present before, but solving these fixed the degraded generation of the model that was present previously in the plugin. Generation quality is now on par with transformers and we no longer requires repetiton_penalty to be set to avoid previously seen <s>...<s> loops.

  • Fixed decoder prompt generation: decoder_start_token_id is None at the top-level Florence2Config and must be read from text_config.
  • Fixed cross-attention KV cache coverage: the multimodal placeholder now spans the full encoder input (image + task-prompt tokens) so vLLM allocates and reads KV cache slots for all encoder tokens, not just the image tokens.

Tests (tests/test_florence2.py):

  • CPU unit tests for all new vision components
  • GPU integration tests (@pytest.mark.slow) for all tasks
  • Uses florence-community/Florence-2-base-ft
  • Updated BART code and tests to pass with new vLLM version

Formatting note

black and isort were only run on the files changed in this PR (vllm_bart_plugin/florence2.py and tests/test_florence2.py). The rest of the codebase was not uniformly formatted beforehand, so formatting it all here would produce a noisy diff. Happy to run it across the whole project if you'd prefer to do it as part of this PR or a follow-up.

Testing

# Unit tests (CPU only)
pytest tests/test_florence2.py -m "not slow"

# Integration tests (requires GPU + HF Hub access)
pytest tests/test_florence2.py -m slow

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
@carlesonielfa
Copy link
Copy Markdown
Author

Hi @NickLucche , would you mind taking a look when you get a chance?

I didn't find any contribution guidelines so I did my best to match the project's style.

The PR updates the plugin to support the official HF florence_vision architecture and makes it compatible with newer versions of vLLM and transformers.

Happy to make any adjustments!

@NickLucche
Copy link
Copy Markdown
Collaborator

Thanks for contributing @carlesonielfa ! cc @Isotr0py for florence related PR.

I'm not 100% sure if both were present before, but solving these fixed the degraded generation of the model that was present previously in the plugin

I think the tok_kwargs change was necessary only with more recent vllm version, which one did you use to test the model?

@carlesonielfa
Copy link
Copy Markdown
Author

The degraded generation I observed it with the plugin's 0.2.0 version and vLLM 0.14.1.

  • With the <DETAILED_CAPTION> the text generation pattern does not match what the model would generate when running with transformers. For example it starts the generation with a lowercase a which the model never does on transformers.
  • With <MORE_DETAILED_CAPTION> it just loops generating <s> even with the repetition penalty at 1.5

My PR I tested with vLLM 0.16.0 and that behavior was corrected, and even without a set repetiton_penalty, there is generation closely matching that of transformers for both prompts.

Although when testing more intensively, for very simple images (a black square, simple icons) the model will still get stuck on a generation loop, and increasing the repetition penalty just makes it generate hallucinated text.

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It will be great to use transformers' implementation, especially the original ones quite lacks maintenance. 😅

Leave two initial comments. PTAL!

Comment thread vllm_bart_plugin/florence2.py Outdated
Comment on lines +1049 to +1067
def _remap(weights: Iterable[tuple[str, torch.Tensor]]):
for name, param in weights:
# HF checkpoint layout (Florence2ForConditionalGeneration):
# model.vision_tower.* -> vision_tower.*
# model.multi_modal_projector.* -> multi_modal_projector.*
# model.language_model.* -> language_model.model.*
# (HF uses BartModel directly; our wrapper adds .model)
# lm_head.* -> language_model.lm_head.*
if name.startswith("model.vision_tower."):
name = name[len("model.") :]
elif name.startswith("model.multi_modal_projector."):
name = name[len("model.") :]
elif name.startswith("model.language_model."):
name = (
"language_model.model." + name[len("model.language_model.") :]
)
elif name.startswith("lm_head."):
name = "language_model." + name
yield name, param
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, done!

f"only Florence Vision is supported for now. "
f"Received model type: {config.vision_config.model_type}"
)
self.vision_tower = Florence2VisionBackbone(config.vision_config)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can directly import Florence2VisionBackbone from transformers directly now, so we no longer need to copy ViT implementation here :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect! Implemented, that simplifies it a lot

@NickLucche
Copy link
Copy Markdown
Collaborator

NickLucche commented Mar 13, 2026

@carlesonielfa thanks for elaborating, would you mind opening a separate PR for the bart add_special_tokens fix, so we can better refer to the fix in isolation?
I think I also came across something similar. I think that's related to the difference in how vlllm handles first token generation in enc-dec models wrt how hf does that (first token is forced to bos).

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
@carlesonielfa
Copy link
Copy Markdown
Author

@carlesonielfa thanks for elaborating, would you mind opening a separate PR for the bart add_special_tokens fix, so we can better refer to the fix in isolation? I think I also came across something similar. I think that's related to the difference in how vlllm handles first token generation in enc-dec models wrt how hf does that (first token is forced to bos).

Sure! Removed the fix from this PR and opened a new one for it here: #12

Worth noting that the standalone PR won't run on newer vLLM versions without the import refactors across bart.py and florence2.py. The florence2.py import changes in this PR are intertwined with other modifications, so isolating just the vLLM-compatible import refactor would require some extra work. Happy to do that if you'd like, but merging this PR first would be the simpler path.

@NickLucche
Copy link
Copy Markdown
Collaborator

I think I had some changes related to updating version here #6 (to be added once we cut a release version here)

@NickLucche
Copy link
Copy Markdown
Collaborator

@carlesonielfa just realized in #6 I was defaulting to False.
Could you add a snippet/test to #12 highligting the change in output for bart with that parameter forcing to True?

@carlesonielfa
Copy link
Copy Markdown
Author

@NickLucche I'm now realizing I misunderstood your first comment.

I only tested the output quality for Florence, and I thought you were referring to that on your first comment because those were the bug fixes I mentioned in the PR description, now I'm noticing I didn't' explain the bart.py change at all which is what you were actually asking about.

That change was prompted by a runtime error when running the test_model_inference tests: without it, several test_model_inference tests fail with got multiple values for keyword argument 'add_special_tokens'.

Its totally possible that your changes in #6 are sufficient! My tweak in this case was just for compatibility

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model implementation LGTM as long as we also clean up the multimodal projector copy.

Comment thread vllm_bart_plugin/florence2.py Outdated

class ConvEmbed(nn.Module):
""" Image to Patch Embedding
class Florence2MultiModalProjector(nn.Module):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread vllm_bart_plugin/florence2.py Outdated
"""
This module learns positional embeddings up to a fixed maximum size.
"""
class Florence2VisionLearnedAbsolutePositionEmbedding2D(nn.Module):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment thread vllm_bart_plugin/florence2.py Outdated
return pos.permute(2, 0, 1).unsqueeze(0).expand(x.shape[0], -1, -1, -1)


class Florence2VisionPositionalEmbeddingCosine1D(nn.Module):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
@carlesonielfa
Copy link
Copy Markdown
Author

@Isotr0py Nice! Done, cleaned it up even more

Comment thread README.md Outdated
Comment on lines 136 to 137
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the readme to point to HF Florence-2 models?

  • florence-community/Florence-2-base
  • florence-community/Florence-2-large

Copy link
Copy Markdown
Author

@carlesonielfa carlesonielfa Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! Removed also the trust_remote_code and the alternative tokenizer mention since that is no longer needed

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>
@NickLucche
Copy link
Copy Markdown
Collaborator

description, now I'm noticing I didn't' explain the bart.py change at all which is what you were actually asking about.

Ok I got it, do you want to rebase on top of #6 and test whether that is working? In that case we could merge #6->bump vllm requirement->follow-up with this PR

@carlesonielfa
Copy link
Copy Markdown
Author

Sure! I'm happy to do it. However, I'll be off until the start of April. If you'd like to move things forward in the meantime, feel free to go ahead. Otherwise, I'll pick this up when I'm back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants