Add support for LFM2-VL (lfm2_vl) model for image-text-to-text task by openvino-agent · Pull Request #1695 · huggingface/optimum-intel

openvino-agent · 2026-04-22T06:53:44Z

Summary

This PR adds OpenVINO export and inference support for the LFM2-VL vision-language model family (model_type=lfm2_vl) for the image-text-to-text task.

Model Architecture

LFM2-VL (LiquidAI/LFM2-VL-450M) is a vision-language model with:

Vision encoder: Siglip2 vision tower with NaFlex (variable-resolution) patches — flat patch format (batch, N_patches, 768) + spatial_shapes metadata
Language model: Hybrid Mamba-style architecture with both convolutional state-space layers (conv cache) and standard attention layers (key/value cache)
Multi-modal projector: Projects vision features to LM hidden size with pixel unshuffle

Changes

Export (`model_configs.py`, `model_patcher.py`, `utils.py`)

Added DummyLfm2VlVisionInputGenerator — generates flat NaFlex-format dummy vision inputs (16×16 = 256 patches)
Added Lfm2VLOpenVINOConfig — registered for lfm2_vl / image-text-to-text, exports 3 sub-models: vision embeddings, language model, text embeddings
Added Lfm2VlImageEmbeddingsModelPatcher:
- Patches Siglip2VisionEmbeddings.resize_positional_embeddings to use torch.linspace + F.grid_sample instead of F.interpolate — making it compatible with dynamic spatial shapes in OpenVINO
- Trims pixel_values to valid patches before vision tower (pixel_values[:, :h*w, :])
Added Lfm2VlLMModelPatcher — standard LM export patcher
Added lfm2_vl to MULTI_MODAL_TEXT_GENERATION_MODELS

Inference (`modeling_visual_language.py`)

Since LFM2-VL uses a non-stateful Mamba-style cache (explicit conv states + KV tensors as model I/O), the standard OVModelWithEmbedForCausalLM class (designed for stateful transformer LMs) cannot be used. Instead:

Added _OVLfm2VLCache — simple dataclass holding conv_states, key_cache, value_cache numpy arrays
Added _OVLfm2VLLanguageModel — custom LM wrapper that:
- Manages explicit Mamba cache across decoding steps
- Uses InferRequest API for OV inference
- Exposes past_key_values=None/non-None sentinel interface compatible with the VLM generation loop
- Implements embed_tokens, clear_requests, compile, to methods
Added _OVLfm2VLForCausalLM — full VLM class with:
- forward — passes spatial_shapes to vision processing
- prepare_inputs_for_generation — passes spatial_shapes in generation loop
- get_vision_embeddings / merge_vision_text_embeddings / preprocess_inputs
Registered "lfm2_vl": _OVLfm2VLForCausalLM in MODEL_TYPE_TO_CLS_MAPPING

Tests and Docs

Added lfm2_vl entries to test_export.py, test_quantization.py, utils_tests.py
Updated docs/source/openvino/models.mdx with LFM2-VL entry

Validation

Export:

optimum-cli export openvino --model LiquidAI/LFM2-VL-450M --task image-text-to-text output_dir

✅ Successfully exports 3 sub-models with dynamic shapes.

Inference:

from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM

model = OVModelForVisualCausalLM.from_pretrained('output_dir')
processor = AutoProcessor.from_pretrained('LiquidAI/LFM2-VL-450M')
# ... generate with image input
# Output: 'The image shows a white cow standing on a sandy beach...'

✅ End-to-end generation produces correct output.

- Add Lfm2VLOpenVINOConfig and DummyLfm2VlVisionInputGenerator in model_configs.py - Add Lfm2VlImageEmbeddingsModelPatcher and Lfm2VlLMModelPatcher in model_patcher.py with dynamic-shape-compatible positional embedding resize using grid_sample - Add lfm2_vl to MULTI_MODAL_TEXT_GENERATION_MODELS in utils.py - Add _OVLfm2VLCache, _OVLfm2VLLanguageModel, and _OVLfm2VLForCausalLM in modeling_visual_language.py for non-stateful Mamba-style cache handling - Update docs to include LFM2-VL model - Add test entries for lfm2_vl in test_export.py, test_quantization.py, utils_tests.py

HuggingFaceDocBuilderDev · 2026-04-22T07:15:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

openvino-agent closed this Apr 27, 2026

openvino-agent deleted the add-lfm2-vl-support branch April 29, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for LFM2-VL (lfm2_vl) model for image-text-to-text task#1695

Add support for LFM2-VL (lfm2_vl) model for image-text-to-text task#1695
openvino-agent wants to merge 1 commit intohuggingface:mainfrom
openvino-agent:add-lfm2-vl-support

openvino-agent commented Apr 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

openvino-agent commented Apr 22, 2026

Summary

Model Architecture

Changes

Export (model_configs.py, model_patcher.py, utils.py)

Inference (modeling_visual_language.py)

Tests and Docs

Validation

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Export (`model_configs.py`, `model_patcher.py`, `utils.py`)

Inference (`modeling_visual_language.py`)