Feature Request: Native OpenVINO export and NNCF quantization support for SeamlessM4T-v2 (seamless_m4t_v2)

### **Description:**

**Is your feature request related to a problem? Please describe.**
Currently, the `seamless_m4t_v2` architecture (specifically `facebook/seamless-m4t-v2-large`) is not natively supported for OpenVINO export via `optimum-intel`. 

When attempting to export the model using `optimum-cli`, it fails with the following error:
```text
ValueError: Trying to export a seamless_m4t_v2 model, that is a custom or unsupported architecture, but no custom export configuration was passed as `custom_export_configs`.
```

**Describe the solution you'd like**
I would like native support for the `seamless_m4t_v2` architecture in `optimum-intel`. Specifically, the ability to:
1. **Export:** Export the model to OpenVINO format (`.xml`, `.bin`).
2. **Quantization/Precision:** Apply INT8 (and optionally INT4) weight compression via NNCF, as well as support for general FP16/bfloat16 precision options.
3. **Inference Pipeline:** Provide parity with Transformers' `SeamlessM4Tv2ForSpeechToSpeech` (or appropriate `OVModel*` wrappers/pipelines) to handle the specific graph of S2ST (units, vocoder, multiple heads) rather than standard encoder-decoder seq2seq.

**Describe alternatives you've considered**
Currently, the only way to run this model is via standard PyTorch, which lacks OpenVINO hardware acceleration on Intel integrated graphics and CPUs for this specific multimodal architecture. Cascaded pipelines (Whisper + MT + SpeechT5) are an alternative, but they lack the unified "latent-based" S2ST performance and quality of SeamlessM4T.

**Environment & Reproduction Details:**
- **OS:** Ubuntu 24.04 Server
- **Python version:** Python 3.12.3
- **`optimum` version:** 2.1.0
- **`optimum-intel` version:** 1.27.0
- **`transformers` version:** 4.57.6

**Command run:**
```bash
optimum-cli export openvino --model facebook/seamless-m4t-v2-large --weight-format int8 seamless_s2st_int8/
```

**Additional context**
SeamlessM4T-v2 is a state-of-the-art open-source model capable of direct Speech-to-Speech translation. Adding OpenVINO export support would be a massive performance boost for edge computing and local AI applications running on Intel hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Native OpenVINO export and NNCF quantization support for SeamlessM4T-v2 (seamless_m4t_v2) #1667

Description:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: Native OpenVINO export and NNCF quantization support for SeamlessM4T-v2 (seamless_m4t_v2) #1667

Description

Description:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions