Add support for exporting the Qwen3-VL-Embedding model by xiangfuxwang · Pull Request #1686 · huggingface/optimum-intel

xiangfuxwang · 2026-04-15T09:10:14Z

Export support for the Qwen3-VL-Embedding model has been added. The main modification was to the exporters/openvino/model_configs.py file. The export command line was tested: optimum-cli export openvino --model Qwen3-VL-Embedding-2B Qwen3-VL-Embedding-2B-ov-fp16 --task feature-extraction --weight-format fp16. Initial testing showed that the exported model is usable, and the export process proceeded without errors.

rkazants

please add tests, update documentation, provide proper PR description.

Copilot

Pull request overview

Adds OpenVINO exporter support for the Qwen3-VL-Embedding model by extending the OpenVINO model config registry and introducing a dedicated VLM export config / dummy input generation path for qwen3_vl under feature-extraction.

Changes:

Register qwen3_vl custom class mappings for feature-extraction (and image-text-to-text).
Add a Qwen3-VL-specific dummy vision input generator (pixel_values, image_grid_thw).
Add a new OpenVINO config for qwen3_vl intended to enable feature-extraction export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T07:16:19Z

+        "image-text-to-text",
+    ],
+    library_name="transformers",
+)
+class Qwen3VLOpenVINOConfig(BaseVLMOpenVINOConfig):


This file already defines and registers another Qwen3VLOpenVINOConfig for qwen3_vl later (used for image-text-to-text). Introducing a second class with the same name and overlapping @register_in_tasks_manager("qwen3_vl", ...) decorators makes the registry behavior order-dependent and the symbol name ambiguous. Consider renaming this new config (e.g., Qwen3VLEmbeddingOpenVINOConfig) and registering it only for the feature-extraction task to avoid accidental overrides.

Suggested change

"image-text-to-text",

],

library_name="transformers",

)

class Qwen3VLOpenVINOConfig(BaseVLMOpenVINOConfig):

],

library_name="transformers",

)

class Qwen3VLEmbeddingOpenVINOConfig(BaseVLMOpenVINOConfig):

Copilot · 2026-04-21T07:16:20Z

+            config=config,
+            task=task,
+            int_dtype=int_dtype,
+            float_dtype=float_dtype,


behavior passed to __init__ is currently ignored because it isn’t forwarded to BaseVLMOpenVINOConfig.__init__ (which sets self._behavior). This can break with_behavior(...) / multi-part VLM export because the instance will always behave as VISION_EMBEDDINGS. Pass behavior=behavior to super().__init__(...) or set self._behavior = behavior after calling super().

Suggested change

float_dtype=float_dtype,

float_dtype=float_dtype,

behavior=behavior,

Copilot · 2026-04-21T07:16:20Z

+            return {}
+        return {
+            "pixel_values": {0: "batch_size", 1: "channels", 2: "temporal_patch_size", 3: "patch_height", 4: "patch_width"},
+            "image_grid_thw": {0: "num_images", 1: "3"}


In the inputs mapping, using the literal string "3" as a dimension name for image_grid_thw is inconsistent with the rest of the exporter configs (dimension names are descriptive identifiers). Rename this axis to something semantic (e.g. grid_dims/thw) or omit it if it’s intended to be a fixed size.

Suggested change

"image_grid_thw": {0: "num_images", 1: "3"}

"image_grid_thw": {0: "num_images", 1: "thw"}

Copilot · 2026-04-21T07:16:20Z

+        # For feature-extraction task, we need to generate inputs for get_image_features method
+        import torch
+        # Only return the inputs that the model actually accepts
+        # Use shape [batch_size, 3, 2, 16, 16] for pixel_values
+        dummy_inputs = {
+            "pixel_values": torch.randn(1, 3, 2, 16, 16, dtype=torch.float32),
+            "image_grid_thw": torch.tensor([[1, 16, 16]], dtype=torch.int64)
+        }
+        return dummy_inputs


generate_dummy_inputs hard-codes Torch tensors with float32/int64 dtypes and fixed shapes, ignoring framework, float_dtype, int_dtype, and user-provided input_shapes. This can cause dtype mismatches (e.g. fp16 export) and makes input-shape overrides ineffective. Prefer using the standard dummy input generator flow (super().generate_dummy_inputs(...)) with a dedicated DummyInputGenerator that respects the dtype/shape parameters.

Suggested change

# For feature-extraction task, we need to generate inputs for get_image_features method

import torch

# Only return the inputs that the model actually accepts

# Use shape [batch_size, 3, 2, 16, 16] for pixel_values

dummy_inputs = {

"pixel_values": torch.randn(1, 3, 2, 16, 16, dtype=torch.float32),

"image_grid_thw": torch.tensor([[1, 16, 16]], dtype=torch.int64)

}

return dummy_inputs

# Reuse the standard dummy input generator flow so framework, dtype,

# and caller-provided shape overrides are all respected.

return super().generate_dummy_inputs(framework=framework, **kwargs)

Copilot · 2026-04-21T07:16:20Z

+
+    def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"):
+        if input_name == "pixel_values":
+            # For Qwen3-VL-Embedding, the input shape is [batch_size, 3, 2, 16, 16]
+            return self.random_float_tensor(
+                [self.batch_size, 3, 2, 16, 16], framework=framework, dtype=float_dtype
+            )
+        if input_name == "image_grid_thw":
+            # For Qwen3-VL-Embedding, the input shape is [num_images, 3]
+            return self.random_int_tensor(
+                [1, 3], min_value=1, max_value=16, framework=framework, dtype=int_dtype


DummyQwen3VLInputGenerator.generate hard-codes [batch, 3, 2, 16, 16] / [1, 3] shapes and doesn’t use normalized_config (or the provided width/height) to derive shapes. This risks producing invalid dummy inputs for other Qwen3-VL checkpoints or future config changes. Derive these dimensions from the model config (e.g., patch size / temporal patch size / image size) or at least thread through the dummy-shape kwargs so callers can override them.

Suggested change

def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"):

if input_name == "pixel_values":

# For Qwen3-VL-Embedding, the input shape is [batch_size, 3, 2, 16, 16]

return self.random_float_tensor(

[self.batch_size, 3, 2, 16, 16], framework=framework, dtype=float_dtype

)

if input_name == "image_grid_thw":

# For Qwen3-VL-Embedding, the input shape is [num_images, 3]

return self.random_int_tensor(

[1, 3], min_value=1, max_value=16, framework=framework, dtype=int_dtype

self.patch_size = max(

1,

int(

kwargs.get(

"patch_size",

getattr(normalized_config, "patch_size", getattr(normalized_config, "image_patch_size", 1)),

)

),

)

self.temporal_patch_size = max(

1,

int(

kwargs.get(

"temporal_patch_size",

getattr(normalized_config, "temporal_patch_size", getattr(normalized_config, "video_temporal_patch_size", 1)),

)

),

)

self.num_channels = int(

kwargs.get("num_channels", getattr(normalized_config, "num_channels", num_channels))

)

self.width = int(kwargs.get("width", getattr(normalized_config, "image_size", width)))

self.height = int(kwargs.get("height", getattr(normalized_config, "image_size", height)))

def _grid_size(self, size: int) -> int:

return max(1, (int(size) + self.patch_size - 1) // self.patch_size)

def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"):

grid_h = self._grid_size(self.height)

grid_w = self._grid_size(self.width)

padded_height = grid_h * self.patch_size

padded_width = grid_w * self.patch_size

if input_name == "pixel_values":

return self.random_float_tensor(

[self.batch_size, self.num_channels, self.temporal_patch_size, padded_height, padded_width],

framework=framework,

dtype=float_dtype,

)

if input_name == "image_grid_thw":

max_grid_dim = max(self.temporal_patch_size, grid_h, grid_w)

return self.random_int_tensor(

[self.batch_size, 3],

min_value=1,

max_value=max_grid_dim,

framework=framework,

dtype=int_dtype,

Copilot · 2026-04-21T07:16:21Z

+        # Create a simple patcher that doesn't rely on get_image_features or ModelPatcher
+        class Qwen3VLImageEmbeddingsModelPatcher:
+            def __init__(self, config, model, model_kwargs=None):
+                self.config = config
+                self.model = model
+                self.model_kwargs = model_kwargs
+                # Patch the forward method directly
+                self.orig_forward = model.forward
+                model.forward = self.patched_forward
+            def patched_forward(self, pixel_values, image_grid_thw, **kwargs):
+                # Get the original output
+                output = self.orig_forward(pixel_values, image_grid_thw, **kwargs)
+                # Return only the last_hidden_state to avoid type inference issues
+                return output.last_hidden_state
+            def __enter__(self):
+                return self
+            def __exit__(self, exc_type, exc_val, exc_tb):
+                # Restore the original forward method
+                self.model.forward = self.orig_forward


The custom patcher returned by patch_model_for_export doesn’t follow the ModelPatcher contract used by convert.py (which wraps patcher.patched_forward and assumes a dict-like output with .values(), and may reassign patcher.patched_forward). Here, model.forward is patched in __init__ and the patched forward returns a raw tensor, so future changes (or different export paths) can easily break. Prefer implementing this as a proper ModelPatcher subclass (or reuse CommonImageEmbeddingsModelPatcher) so patched_forward returns a dict keyed by config.outputs and patching is applied in __enter__/__exit__.

Suggested change

# Create a simple patcher that doesn't rely on get_image_features or ModelPatcher

class Qwen3VLImageEmbeddingsModelPatcher:

def __init__(self, config, model, model_kwargs=None):

self.config = config

self.model = model

self.model_kwargs = model_kwargs

# Patch the forward method directly

self.orig_forward = model.forward

model.forward = self.patched_forward

def patched_forward(self, pixel_values, image_grid_thw, **kwargs):

# Get the original output

output = self.orig_forward(pixel_values, image_grid_thw, **kwargs)

# Return only the last_hidden_state to avoid type inference issues

return output.last_hidden_state

def __enter__(self):

return self

def __exit__(self, exc_type, exc_val, exc_tb):

# Restore the original forward method

self.model.forward = self.orig_forward

class Qwen3VLImageEmbeddingsModelPatcher(ModelPatcher):

def patched_forward(self, pixel_values, image_grid_thw, **kwargs):

output = self._model(pixel_values, image_grid_thw, **kwargs)

output_name = next(iter(self.config.outputs))

return {output_name: output.last_hidden_state}

architag21 · 2026-04-28T11:26:53Z

+            def patched_forward(self, pixel_values, image_grid_thw, **kwargs):
+                # Get the original output
+                output = self.orig_forward(pixel_values, image_grid_thw, **kwargs)
+                # Return only the last_hidden_state to avoid type inference issues
+                return output.last_hidden_state


def patched_forward(self, pixel_values, image_grid_thw, **kwargs):
output = self.orig_forward(pixel_values, image_grid_thw, **kwargs)
if isinstance(output, tuple):
return output[0]
return output.last_hidden_state if hasattr(output, 'last_hidden_state') else output

architag21 · 2026-04-28T11:43:57Z

Hi, by when can we review and merge this PR?

Add support for exporting the Qwen3-VL-Embedding model

401bcf6

rkazants requested a review from Copilot April 21, 2026 07:08

Copilot started reviewing on behalf of rkazants April 21, 2026 07:08 View session

rkazants requested changes Apr 21, 2026

View reviewed changes

Copilot AI reviewed Apr 21, 2026

View reviewed changes

architag21 reviewed Apr 28, 2026

View reviewed changes

rkazants added the need tests label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for exporting the Qwen3-VL-Embedding model#1686

Add support for exporting the Qwen3-VL-Embedding model#1686
xiangfuxwang wants to merge 1 commit intohuggingface:mainfrom
xiangfuxwang:qwen3_vl_embedding

xiangfuxwang commented Apr 15, 2026

Uh oh!

rkazants left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

architag21 Apr 28, 2026

Uh oh!

architag21 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	float_dtype=float_dtype,
	float_dtype=float_dtype,
	behavior=behavior,

	"image_grid_thw": {0: "num_images", 1: "3"}
	"image_grid_thw": {0: "num_images", 1: "thw"}

-    def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"):
-        if input_name == "pixel_values":
-            # For Qwen3-VL-Embedding, the input shape is [batch_size, 3, 2, 16, 16]
-            return self.random_float_tensor(
-                [self.batch_size, 3, 2, 16, 16], framework=framework, dtype=float_dtype
-            )
-        if input_name == "image_grid_thw":
-            # For Qwen3-VL-Embedding, the input shape is [num_images, 3]
-            return self.random_int_tensor(
-                [1, 3], min_value=1, max_value=16, framework=framework, dtype=int_dtype
+        self.patch_size = max(
+,
+            int(
+                kwargs.get(
+                    "patch_size",
+                    getattr(normalized_config, "patch_size", getattr(normalized_config, "image_patch_size", 1)),
+                )
+            ),
+        )
+        self.temporal_patch_size = max(
+,
+            int(
+                kwargs.get(
+                    "temporal_patch_size",
+                    getattr(normalized_config, "temporal_patch_size", getattr(normalized_config, "video_temporal_patch_size", 1)),
+                )
+            ),
+        )
+        self.num_channels = int(
+            kwargs.get("num_channels", getattr(normalized_config, "num_channels", num_channels))
+        )
+        self.width = int(kwargs.get("width", getattr(normalized_config, "image_size", width)))
+        self.height = int(kwargs.get("height", getattr(normalized_config, "image_size", height)))
+    def _grid_size(self, size: int) -> int:
+        return max(1, (int(size) + self.patch_size - 1) // self.patch_size)
+    def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"):
+        grid_h = self._grid_size(self.height)
+        grid_w = self._grid_size(self.width)
+        padded_height = grid_h * self.patch_size
+        padded_width = grid_w * self.patch_size
+        if input_name == "pixel_values":
+            return self.random_float_tensor(
+                [self.batch_size, self.num_channels, self.temporal_patch_size, padded_height, padded_width],
+                framework=framework,
+                dtype=float_dtype,
+            )
+        if input_name == "image_grid_thw":
+            max_grid_dim = max(self.temporal_patch_size, grid_h, grid_w)
+            return self.random_int_tensor(
+                [self.batch_size, 3],
+                min_value=1,
+                max_value=max_grid_dim,
+                framework=framework,
+                dtype=int_dtype,

Conversation

xiangfuxwang commented Apr 15, 2026

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

architag21 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

architag21 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants