Add support for exporting the Qwen3-VL-Embedding model#1686
Add support for exporting the Qwen3-VL-Embedding model#1686xiangfuxwang wants to merge 1 commit intohuggingface:mainfrom
Conversation
rkazants
left a comment
There was a problem hiding this comment.
please add tests, update documentation, provide proper PR description.
There was a problem hiding this comment.
Pull request overview
Adds OpenVINO exporter support for the Qwen3-VL-Embedding model by extending the OpenVINO model config registry and introducing a dedicated VLM export config / dummy input generation path for qwen3_vl under feature-extraction.
Changes:
- Register
qwen3_vlcustom class mappings forfeature-extraction(andimage-text-to-text). - Add a Qwen3-VL-specific dummy vision input generator (
pixel_values,image_grid_thw). - Add a new OpenVINO config for
qwen3_vlintended to enablefeature-extractionexport.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "image-text-to-text", | ||
| ], | ||
| library_name="transformers", | ||
| ) | ||
| class Qwen3VLOpenVINOConfig(BaseVLMOpenVINOConfig): |
There was a problem hiding this comment.
This file already defines and registers another Qwen3VLOpenVINOConfig for qwen3_vl later (used for image-text-to-text). Introducing a second class with the same name and overlapping @register_in_tasks_manager("qwen3_vl", ...) decorators makes the registry behavior order-dependent and the symbol name ambiguous. Consider renaming this new config (e.g., Qwen3VLEmbeddingOpenVINOConfig) and registering it only for the feature-extraction task to avoid accidental overrides.
| "image-text-to-text", | |
| ], | |
| library_name="transformers", | |
| ) | |
| class Qwen3VLOpenVINOConfig(BaseVLMOpenVINOConfig): | |
| ], | |
| library_name="transformers", | |
| ) | |
| class Qwen3VLEmbeddingOpenVINOConfig(BaseVLMOpenVINOConfig): |
| config=config, | ||
| task=task, | ||
| int_dtype=int_dtype, | ||
| float_dtype=float_dtype, |
There was a problem hiding this comment.
behavior passed to __init__ is currently ignored because it isn’t forwarded to BaseVLMOpenVINOConfig.__init__ (which sets self._behavior). This can break with_behavior(...) / multi-part VLM export because the instance will always behave as VISION_EMBEDDINGS. Pass behavior=behavior to super().__init__(...) or set self._behavior = behavior after calling super().
| float_dtype=float_dtype, | |
| float_dtype=float_dtype, | |
| behavior=behavior, |
| return {} | ||
| return { | ||
| "pixel_values": {0: "batch_size", 1: "channels", 2: "temporal_patch_size", 3: "patch_height", 4: "patch_width"}, | ||
| "image_grid_thw": {0: "num_images", 1: "3"} |
There was a problem hiding this comment.
In the inputs mapping, using the literal string "3" as a dimension name for image_grid_thw is inconsistent with the rest of the exporter configs (dimension names are descriptive identifiers). Rename this axis to something semantic (e.g. grid_dims/thw) or omit it if it’s intended to be a fixed size.
| "image_grid_thw": {0: "num_images", 1: "3"} | |
| "image_grid_thw": {0: "num_images", 1: "thw"} |
| # For feature-extraction task, we need to generate inputs for get_image_features method | ||
| import torch | ||
| # Only return the inputs that the model actually accepts | ||
| # Use shape [batch_size, 3, 2, 16, 16] for pixel_values | ||
| dummy_inputs = { | ||
| "pixel_values": torch.randn(1, 3, 2, 16, 16, dtype=torch.float32), | ||
| "image_grid_thw": torch.tensor([[1, 16, 16]], dtype=torch.int64) | ||
| } | ||
| return dummy_inputs |
There was a problem hiding this comment.
generate_dummy_inputs hard-codes Torch tensors with float32/int64 dtypes and fixed shapes, ignoring framework, float_dtype, int_dtype, and user-provided input_shapes. This can cause dtype mismatches (e.g. fp16 export) and makes input-shape overrides ineffective. Prefer using the standard dummy input generator flow (super().generate_dummy_inputs(...)) with a dedicated DummyInputGenerator that respects the dtype/shape parameters.
| # For feature-extraction task, we need to generate inputs for get_image_features method | |
| import torch | |
| # Only return the inputs that the model actually accepts | |
| # Use shape [batch_size, 3, 2, 16, 16] for pixel_values | |
| dummy_inputs = { | |
| "pixel_values": torch.randn(1, 3, 2, 16, 16, dtype=torch.float32), | |
| "image_grid_thw": torch.tensor([[1, 16, 16]], dtype=torch.int64) | |
| } | |
| return dummy_inputs | |
| # Reuse the standard dummy input generator flow so framework, dtype, | |
| # and caller-provided shape overrides are all respected. | |
| return super().generate_dummy_inputs(framework=framework, **kwargs) |
|
|
||
| def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"): | ||
| if input_name == "pixel_values": | ||
| # For Qwen3-VL-Embedding, the input shape is [batch_size, 3, 2, 16, 16] | ||
| return self.random_float_tensor( | ||
| [self.batch_size, 3, 2, 16, 16], framework=framework, dtype=float_dtype | ||
| ) | ||
| if input_name == "image_grid_thw": | ||
| # For Qwen3-VL-Embedding, the input shape is [num_images, 3] | ||
| return self.random_int_tensor( | ||
| [1, 3], min_value=1, max_value=16, framework=framework, dtype=int_dtype |
There was a problem hiding this comment.
DummyQwen3VLInputGenerator.generate hard-codes [batch, 3, 2, 16, 16] / [1, 3] shapes and doesn’t use normalized_config (or the provided width/height) to derive shapes. This risks producing invalid dummy inputs for other Qwen3-VL checkpoints or future config changes. Derive these dimensions from the model config (e.g., patch size / temporal patch size / image size) or at least thread through the dummy-shape kwargs so callers can override them.
| def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"): | |
| if input_name == "pixel_values": | |
| # For Qwen3-VL-Embedding, the input shape is [batch_size, 3, 2, 16, 16] | |
| return self.random_float_tensor( | |
| [self.batch_size, 3, 2, 16, 16], framework=framework, dtype=float_dtype | |
| ) | |
| if input_name == "image_grid_thw": | |
| # For Qwen3-VL-Embedding, the input shape is [num_images, 3] | |
| return self.random_int_tensor( | |
| [1, 3], min_value=1, max_value=16, framework=framework, dtype=int_dtype | |
| self.patch_size = max( | |
| 1, | |
| int( | |
| kwargs.get( | |
| "patch_size", | |
| getattr(normalized_config, "patch_size", getattr(normalized_config, "image_patch_size", 1)), | |
| ) | |
| ), | |
| ) | |
| self.temporal_patch_size = max( | |
| 1, | |
| int( | |
| kwargs.get( | |
| "temporal_patch_size", | |
| getattr(normalized_config, "temporal_patch_size", getattr(normalized_config, "video_temporal_patch_size", 1)), | |
| ) | |
| ), | |
| ) | |
| self.num_channels = int( | |
| kwargs.get("num_channels", getattr(normalized_config, "num_channels", num_channels)) | |
| ) | |
| self.width = int(kwargs.get("width", getattr(normalized_config, "image_size", width))) | |
| self.height = int(kwargs.get("height", getattr(normalized_config, "image_size", height))) | |
| def _grid_size(self, size: int) -> int: | |
| return max(1, (int(size) + self.patch_size - 1) // self.patch_size) | |
| def generate(self, input_name: str, framework: str = "pt", int_dtype: str = "int64", float_dtype: str = "fp32"): | |
| grid_h = self._grid_size(self.height) | |
| grid_w = self._grid_size(self.width) | |
| padded_height = grid_h * self.patch_size | |
| padded_width = grid_w * self.patch_size | |
| if input_name == "pixel_values": | |
| return self.random_float_tensor( | |
| [self.batch_size, self.num_channels, self.temporal_patch_size, padded_height, padded_width], | |
| framework=framework, | |
| dtype=float_dtype, | |
| ) | |
| if input_name == "image_grid_thw": | |
| max_grid_dim = max(self.temporal_patch_size, grid_h, grid_w) | |
| return self.random_int_tensor( | |
| [self.batch_size, 3], | |
| min_value=1, | |
| max_value=max_grid_dim, | |
| framework=framework, | |
| dtype=int_dtype, |
| # Create a simple patcher that doesn't rely on get_image_features or ModelPatcher | ||
| class Qwen3VLImageEmbeddingsModelPatcher: | ||
| def __init__(self, config, model, model_kwargs=None): | ||
| self.config = config | ||
| self.model = model | ||
| self.model_kwargs = model_kwargs | ||
| # Patch the forward method directly | ||
| self.orig_forward = model.forward | ||
| model.forward = self.patched_forward | ||
| def patched_forward(self, pixel_values, image_grid_thw, **kwargs): | ||
| # Get the original output | ||
| output = self.orig_forward(pixel_values, image_grid_thw, **kwargs) | ||
| # Return only the last_hidden_state to avoid type inference issues | ||
| return output.last_hidden_state | ||
| def __enter__(self): | ||
| return self | ||
| def __exit__(self, exc_type, exc_val, exc_tb): | ||
| # Restore the original forward method | ||
| self.model.forward = self.orig_forward |
There was a problem hiding this comment.
The custom patcher returned by patch_model_for_export doesn’t follow the ModelPatcher contract used by convert.py (which wraps patcher.patched_forward and assumes a dict-like output with .values(), and may reassign patcher.patched_forward). Here, model.forward is patched in __init__ and the patched forward returns a raw tensor, so future changes (or different export paths) can easily break. Prefer implementing this as a proper ModelPatcher subclass (or reuse CommonImageEmbeddingsModelPatcher) so patched_forward returns a dict keyed by config.outputs and patching is applied in __enter__/__exit__.
| # Create a simple patcher that doesn't rely on get_image_features or ModelPatcher | |
| class Qwen3VLImageEmbeddingsModelPatcher: | |
| def __init__(self, config, model, model_kwargs=None): | |
| self.config = config | |
| self.model = model | |
| self.model_kwargs = model_kwargs | |
| # Patch the forward method directly | |
| self.orig_forward = model.forward | |
| model.forward = self.patched_forward | |
| def patched_forward(self, pixel_values, image_grid_thw, **kwargs): | |
| # Get the original output | |
| output = self.orig_forward(pixel_values, image_grid_thw, **kwargs) | |
| # Return only the last_hidden_state to avoid type inference issues | |
| return output.last_hidden_state | |
| def __enter__(self): | |
| return self | |
| def __exit__(self, exc_type, exc_val, exc_tb): | |
| # Restore the original forward method | |
| self.model.forward = self.orig_forward | |
| class Qwen3VLImageEmbeddingsModelPatcher(ModelPatcher): | |
| def patched_forward(self, pixel_values, image_grid_thw, **kwargs): | |
| output = self._model(pixel_values, image_grid_thw, **kwargs) | |
| output_name = next(iter(self.config.outputs)) | |
| return {output_name: output.last_hidden_state} |
| def patched_forward(self, pixel_values, image_grid_thw, **kwargs): | ||
| # Get the original output | ||
| output = self.orig_forward(pixel_values, image_grid_thw, **kwargs) | ||
| # Return only the last_hidden_state to avoid type inference issues | ||
| return output.last_hidden_state |
There was a problem hiding this comment.
def patched_forward(self, pixel_values, image_grid_thw, **kwargs):
output = self.orig_forward(pixel_values, image_grid_thw, **kwargs)
if isinstance(output, tuple):
return output[0]
return output.last_hidden_state if hasattr(output, 'last_hidden_state') else output
|
Hi, by when can we review and merge this PR? |
Export support for the Qwen3-VL-Embedding model has been added. The main modification was to the
exporters/openvino/model_configs.pyfile. The export command line was tested:optimum-cli export openvino --model Qwen3-VL-Embedding-2B Qwen3-VL-Embedding-2B-ov-fp16 --task feature-extraction --weight-format fp16. Initial testing showed that the exported model is usable, and the export process proceeded without errors.