Skip to content

Add support for LFM2-VL (lfm2_vl) model for image-text-to-text task#1695

Closed
openvino-agent wants to merge 1 commit intohuggingface:mainfrom
openvino-agent:add-lfm2-vl-support
Closed

Add support for LFM2-VL (lfm2_vl) model for image-text-to-text task#1695
openvino-agent wants to merge 1 commit intohuggingface:mainfrom
openvino-agent:add-lfm2-vl-support

Conversation

@openvino-agent
Copy link
Copy Markdown

Summary

This PR adds OpenVINO export and inference support for the LFM2-VL vision-language model family (model_type=lfm2_vl) for the image-text-to-text task.

Model Architecture

LFM2-VL (LiquidAI/LFM2-VL-450M) is a vision-language model with:

  • Vision encoder: Siglip2 vision tower with NaFlex (variable-resolution) patches — flat patch format (batch, N_patches, 768) + spatial_shapes metadata
  • Language model: Hybrid Mamba-style architecture with both convolutional state-space layers (conv cache) and standard attention layers (key/value cache)
  • Multi-modal projector: Projects vision features to LM hidden size with pixel unshuffle

Changes

Export (model_configs.py, model_patcher.py, utils.py)

  • Added DummyLfm2VlVisionInputGenerator — generates flat NaFlex-format dummy vision inputs (16×16 = 256 patches)
  • Added Lfm2VLOpenVINOConfig — registered for lfm2_vl / image-text-to-text, exports 3 sub-models: vision embeddings, language model, text embeddings
  • Added Lfm2VlImageEmbeddingsModelPatcher:
    • Patches Siglip2VisionEmbeddings.resize_positional_embeddings to use torch.linspace + F.grid_sample instead of F.interpolate — making it compatible with dynamic spatial shapes in OpenVINO
    • Trims pixel_values to valid patches before vision tower (pixel_values[:, :h*w, :])
  • Added Lfm2VlLMModelPatcher — standard LM export patcher
  • Added lfm2_vl to MULTI_MODAL_TEXT_GENERATION_MODELS

Inference (modeling_visual_language.py)

Since LFM2-VL uses a non-stateful Mamba-style cache (explicit conv states + KV tensors as model I/O), the standard OVModelWithEmbedForCausalLM class (designed for stateful transformer LMs) cannot be used. Instead:

  • Added _OVLfm2VLCache — simple dataclass holding conv_states, key_cache, value_cache numpy arrays
  • Added _OVLfm2VLLanguageModel — custom LM wrapper that:
    • Manages explicit Mamba cache across decoding steps
    • Uses InferRequest API for OV inference
    • Exposes past_key_values=None/non-None sentinel interface compatible with the VLM generation loop
    • Implements embed_tokens, clear_requests, compile, to methods
  • Added _OVLfm2VLForCausalLM — full VLM class with:
    • forward — passes spatial_shapes to vision processing
    • prepare_inputs_for_generation — passes spatial_shapes in generation loop
    • get_vision_embeddings / merge_vision_text_embeddings / preprocess_inputs
  • Registered "lfm2_vl": _OVLfm2VLForCausalLM in MODEL_TYPE_TO_CLS_MAPPING

Tests and Docs

  • Added lfm2_vl entries to test_export.py, test_quantization.py, utils_tests.py
  • Updated docs/source/openvino/models.mdx with LFM2-VL entry

Validation

Export:

optimum-cli export openvino --model LiquidAI/LFM2-VL-450M --task image-text-to-text output_dir

✅ Successfully exports 3 sub-models with dynamic shapes.

Inference:

from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM

model = OVModelForVisualCausalLM.from_pretrained('output_dir')
processor = AutoProcessor.from_pretrained('LiquidAI/LFM2-VL-450M')
# ... generate with image input
# Output: 'The image shows a white cow standing on a sandy beach...'

✅ End-to-end generation produces correct output.

- Add Lfm2VLOpenVINOConfig and DummyLfm2VlVisionInputGenerator in model_configs.py
- Add Lfm2VlImageEmbeddingsModelPatcher and Lfm2VlLMModelPatcher in model_patcher.py
  with dynamic-shape-compatible positional embedding resize using grid_sample
- Add lfm2_vl to MULTI_MODAL_TEXT_GENERATION_MODELS in utils.py
- Add _OVLfm2VLCache, _OVLfm2VLLanguageModel, and _OVLfm2VLForCausalLM
  in modeling_visual_language.py for non-stateful Mamba-style cache handling
- Update docs to include LFM2-VL model
- Add test entries for lfm2_vl in test_export.py, test_quantization.py, utils_tests.py
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@openvino-agent openvino-agent deleted the add-lfm2-vl-support branch April 29, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants