feat: Support Florence-2 Transformers-native architecture (transformers ≥ 4.56.1 / 5.x) by carlesonielfa · Pull Request #10 · vllm-project/bart-plugin

carlesonielfa · 2026-03-06T11:27:22Z

Summary

Updates the Florence-2 vllm implementation to support the new florence_vision model type,
which is now the standard Transformers architecture and is also compatible with transformers 5. Tested against vllm 0.16.0 and transformers 5.2.0.

The old DaViT-based backbone used in earlier Microsoft checkpoints is no longer required.

As a result, trust_remote_code=True is no longer needed, the model is now fully supported by the native transformers library.

Having a repetition_penalty for generation is also not required since this PR fixes an existing bug in the decoder prompt generation

Changes

Vision backbone (vllm_bart_plugin/florence2.py):

Replaced the old DaViT classes with the new Florence2VisionBackbone and Florence2MultiModalProjector, matching the HF-official implementation
Updated weight remapping in load_weights to match the new HF checkpoint layout
Fixed multi-modal token handling to correctly align vision embeddings with the placeholder tokens already inserted by Florence2Processor

Bug fixes (vllm_bart_plugin/florence2.py, vllm_bart_plugin/bart.py):
I'm not 100% sure if both were present before, but solving these fixed the degraded generation of the model that was present previously in the plugin. Generation quality is now on par with transformers and we no longer requires repetiton_penalty to be set to avoid previously seen <s>...<s> loops.

Fixed decoder prompt generation: decoder_start_token_id is None at the top-level Florence2Config and must be read from text_config.
Fixed cross-attention KV cache coverage: the multimodal placeholder now spans the full encoder input (image + task-prompt tokens) so vLLM allocates and reads KV cache slots for all encoder tokens, not just the image tokens.

Tests (tests/test_florence2.py):

CPU unit tests for all new vision components
GPU integration tests (@pytest.mark.slow) for all tasks
Uses florence-community/Florence-2-base-ft
Updated BART code and tests to pass with new vLLM version

Formatting note

black and isort were only run on the files changed in this PR (vllm_bart_plugin/florence2.py and tests/test_florence2.py). The rest of the codebase was not uniformly formatted beforehand, so formatting it all here would produce a noisy diff. Happy to run it across the whole project if you'd prefer to do it as part of this PR or a follow-up.

Testing

# Unit tests (CPU only)
pytest tests/test_florence2.py -m "not slow"

# Integration tests (requires GPU + HF Hub access)
pytest tests/test_florence2.py -m slow

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa · 2026-03-10T15:28:10Z

Hi @NickLucche , would you mind taking a look when you get a chance?

I didn't find any contribution guidelines so I did my best to match the project's style.

The PR updates the plugin to support the official HF florence_vision architecture and makes it compatible with newer versions of vLLM and transformers.

Happy to make any adjustments!

NickLucche · 2026-03-12T12:54:53Z

Thanks for contributing @carlesonielfa ! cc @Isotr0py for florence related PR.

I'm not 100% sure if both were present before, but solving these fixed the degraded generation of the model that was present previously in the plugin

I think the tok_kwargs change was necessary only with more recent vllm version, which one did you use to test the model?

carlesonielfa · 2026-03-12T13:46:41Z

The degraded generation I observed it with the plugin's 0.2.0 version and vLLM 0.14.1.

With the <DETAILED_CAPTION> the text generation pattern does not match what the model would generate when running with transformers. For example it starts the generation with a lowercase a which the model never does on transformers.
With <MORE_DETAILED_CAPTION> it just loops generating <s> even with the repetition penalty at 1.5

My PR I tested with vLLM 0.16.0 and that behavior was corrected, and even without a set repetiton_penalty, there is generation closely matching that of transformers for both prompts.

Although when testing more intensively, for very simple images (a black square, simple icons) the model will still get stuck on a generation loop, and increasing the repetition penalty just makes it generate hallucinated text.

Isotr0py

Nice! It will be great to use transformers' implementation, especially the original ones quite lacks maintenance. 😅

Leave two initial comments. PTAL!

Isotr0py · 2026-03-12T15:40:53Z

+        def _remap(weights: Iterable[tuple[str, torch.Tensor]]):
+            for name, param in weights:
+                # HF checkpoint layout (Florence2ForConditionalGeneration):
+                #   model.vision_tower.*           -> vision_tower.*
+                #   model.multi_modal_projector.*  -> multi_modal_projector.*
+                #   model.language_model.*         -> language_model.model.*
+                #       (HF uses BartModel directly; our wrapper adds .model)
+                #   lm_head.*                      -> language_model.lm_head.*
+                if name.startswith("model.vision_tower."):
+                    name = name[len("model.") :]
+                elif name.startswith("model.multi_modal_projector."):
+                    name = name[len("model.") :]
+                elif name.startswith("model.language_model."):
+                    name = (
+                        "language_model.model." + name[len("model.language_model.") :]
+                    )
+                elif name.startswith("lm_head."):
+                    name = "language_model." + name
+                yield name, param


We can use WeightsMapper here, you can refer to other VLMs in vLLM:
https://github.com/vllm-project/vllm/blob/a1257fd1ea93da6e27b31e4739ac2707781d8ba7/vllm/model_executor/models/qwen3_vl.py#L1372-L1379

nice, done!

Isotr0py · 2026-03-12T16:23:51Z

+            f"only Florence Vision is supported for now. "
+            f"Received model type: {config.vision_config.model_type}"
+        )
+        self.vision_tower = Florence2VisionBackbone(config.vision_config)


I think we can directly import Florence2VisionBackbone from transformers directly now, so we no longer need to copy ViT implementation here :)

perfect! Implemented, that simplifies it a lot

NickLucche · 2026-03-13T08:55:42Z

@carlesonielfa thanks for elaborating, would you mind opening a separate PR for the bart add_special_tokens fix, so we can better refer to the fix in isolation?
I think I also came across something similar. I think that's related to the difference in how vlllm handles first token generation in enc-dec models wrt how hf does that (first token is forced to bos).

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa · 2026-03-13T09:39:25Z

@carlesonielfa thanks for elaborating, would you mind opening a separate PR for the bart add_special_tokens fix, so we can better refer to the fix in isolation? I think I also came across something similar. I think that's related to the difference in how vlllm handles first token generation in enc-dec models wrt how hf does that (first token is forced to bos).

Sure! Removed the fix from this PR and opened a new one for it here: #12

Worth noting that the standalone PR won't run on newer vLLM versions without the import refactors across bart.py and florence2.py. The florence2.py import changes in this PR are intertwined with other modifications, so isolating just the vLLM-compatible import refactor would require some extra work. Happy to do that if you'd like, but merging this PR first would be the simpler path.

NickLucche · 2026-03-13T10:31:56Z

I think I had some changes related to updating version here #6 (to be added once we cut a release version here)

NickLucche · 2026-03-13T10:33:40Z

@carlesonielfa just realized in #6 I was defaulting to False.
Could you add a snippet/test to #12 highligting the change in output for bart with that parameter forcing to True?

carlesonielfa · 2026-03-13T10:50:18Z

@NickLucche I'm now realizing I misunderstood your first comment.

I only tested the output quality for Florence, and I thought you were referring to that on your first comment because those were the bug fixes I mentioned in the PR description, now I'm noticing I didn't' explain the bart.py change at all which is what you were actually asking about.

That change was prompted by a runtime error when running the test_model_inference tests: without it, several test_model_inference tests fail with got multiple values for keyword argument 'add_special_tokens'.

Its totally possible that your changes in #6 are sufficient! My tweak in this case was just for compatibility

Isotr0py

The model implementation LGTM as long as we also clean up the multimodal projector copy.

Isotr0py · 2026-03-13T13:30:38Z

-
-class ConvEmbed(nn.Module):
-    """ Image to Patch Embedding
+class Florence2MultiModalProjector(nn.Module):


We can import from transformers as well.
https://github.com/huggingface/transformers/blob/3dd82faf3e887043db772d4c1191ec40271a1584/src/transformers/models/florence2/modeling_florence2.py#L573-L595

Isotr0py · 2026-03-13T13:35:50Z

-    """
-    This module learns positional embeddings up to a fixed maximum size.
-    """
+class Florence2VisionLearnedAbsolutePositionEmbedding2D(nn.Module):


Isotr0py · 2026-03-13T13:35:56Z

+        return pos.permute(2, 0, 1).unsqueeze(0).expand(x.shape[0], -1, -1, -1)
+
+
+class Florence2VisionPositionalEmbeddingCosine1D(nn.Module):


Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa · 2026-03-13T14:34:35Z

@Isotr0py Nice! Done, cleaned it up even more

Isotr0py · 2026-03-13T16:29:35Z

Can you also update the readme to point to HF Florence-2 models?

florence-community/Florence-2-base

florence-community/Florence-2-large

sure! Removed also the trust_remote_code and the alternative tokenizer mention since that is no longer needed

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

NickLucche · 2026-03-13T19:02:48Z

description, now I'm noticing I didn't' explain the bart.py change at all which is what you were actually asking about.

Ok I got it, do you want to rebase on top of #6 and test whether that is working? In that case we could merge #6->bump vllm requirement->follow-up with this PR

carlesonielfa · 2026-03-19T16:31:21Z

Sure! I'm happy to do it. However, I'll be off until the start of April. If you'd like to move things forward in the meantime, feel free to go ahead. Otherwise, I'll pick this up when I'm back.

carlesonielfa added 3 commits March 6, 2026 12:28

vllm 0.16.0 support

4bc4975

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

Support offical HF implementation and transformers > 5

b42b288

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

add tests

7cdec11

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa force-pushed the master branch from f650ce0 to 7cdec11 Compare March 6, 2026 11:28

No longer require trust_remote_code

bb29009

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa force-pushed the master branch from f876bdb to bb29009 Compare March 6, 2026 11:48

Fix generation and add better tests

b60cb01

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa force-pushed the master branch from c68d685 to b60cb01 Compare March 10, 2026 11:17

carlesonielfa added 2 commits March 10, 2026 12:26

Fix florence model name

e740733

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

Fix BART tests

19d1c6e

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

Isotr0py reviewed Mar 12, 2026

View reviewed changes

Simplify by addressing comments

440d8e2

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa force-pushed the master branch from b5334df to 440d8e2 Compare March 13, 2026 08:57

Remove BART add_special_tokens fix

0a00dbb

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

carlesonielfa force-pushed the master branch from 7e9e8c5 to 0a00dbb Compare March 13, 2026 09:31

Isotr0py approved these changes Mar 13, 2026

View reviewed changes

Even simpler

5e78020

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

Isotr0py reviewed Mar 13, 2026

View reviewed changes

Update README

07483dc

Signed-off-by: Carles Onielfa <carlesonielfa@gmail.com>

		return pos.permute(2, 0, 1).unsqueeze(0).expand(x.shape[0], -1, -1, -1)


		class Florence2VisionPositionalEmbeddingCosine1D(nn.Module):

Conversation

carlesonielfa commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Formatting note

Testing

Uh oh!

carlesonielfa commented Mar 10, 2026

Uh oh!

NickLucche commented Mar 12, 2026

Uh oh!

carlesonielfa commented Mar 12, 2026

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlesonielfa commented Mar 13, 2026

Uh oh!

NickLucche commented Mar 13, 2026

Uh oh!

NickLucche commented Mar 13, 2026

Uh oh!

carlesonielfa commented Mar 13, 2026

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlesonielfa commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carlesonielfa Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Mar 13, 2026

Uh oh!

carlesonielfa commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carlesonielfa commented Mar 6, 2026 •

edited

Loading

NickLucche commented Mar 13, 2026 •

edited

Loading

carlesonielfa Mar 13, 2026 •

edited

Loading