Skip to content

[OpenVINO] Add export and inference support for ministral reasoning model 2510#1669

Open
dhandhalyabhavik wants to merge 1 commit intohuggingface:mainfrom
dhandhalyabhavik:ministral-reasoning-support
Open

[OpenVINO] Add export and inference support for ministral reasoning model 2510#1669
dhandhalyabhavik wants to merge 1 commit intohuggingface:mainfrom
dhandhalyabhavik:ministral-reasoning-support

Conversation

@dhandhalyabhavik
Copy link
Copy Markdown

@dhandhalyabhavik dhandhalyabhavik commented Mar 31, 2026

Enable mistral3 model type for OpenVINO export and inference in optimum-intel. This adds support for the Ministral-3-3B-Reasoning VLM model with Pixtral vision encoder and Ministral3 language model.

Changes:

  • exporters/openvino/utils.py: Register mistral3 in MULTI_MODAL_TEXT_GENERATION_MODELS
  • exporters/openvino/model_configs.py: Add Mistral3OpenVINOConfig with task registration, text_config model_type fallback (ministral3 -> mistral), and VLM behavior routing
  • exporters/openvino/model_patcher.py: Add inlined vision pipeline patcher (Mistral3ImageEmbeddingModelPatcher), language model patcher fixing sliding_window=None and cache_position injection (Mistral3LanguageModelPatcher), and eager_mask_without_vmap compatibility fix for transformers >= 5.4 q_length API change
  • intel/openvino/modeling_visual_language.py: Add _OVMistral3ForCausalLM inference class with vision/text embedding merge via masked_scatter

Tested with FP16/INT8/INT4 exports, text generation, image description, and 5-turn multi-turn conversation. No repetition issues detected.

Sample chat.py file for experiment with reasoning on/off.

#!/usr/bin/env python3
"""
Interactive chat with Ministral-3-3B-Reasoning via OpenVINO.
Supports up to 5 conversation turns. Optionally attach an image on the first turn.

Usage:
    # Text-only chat (INT4, fastest, direct mode)
    python3 chat.py

    # Reasoning mode — model shows [THINK]...[/THINK] before answering
    python3 chat.py --reasoning

    # Specify format and max tokens
    python3 chat.py --format fp16 --max-tokens 1024

    # Chat with an image on first turn
    python3 chat.py --image /workspace/instance-3/images/1.png --reasoning
"""
import argparse
import re
import time
import sys

from optimum.intel.openvino import OVModelForVisualCausalLM
from transformers import AutoProcessor
from PIL import Image

MAX_TURNS = 5
REPETITION_PENALTY = 1.2
DIRECT_SYSTEM_PROMPT = "You are a helpful AI assistant. Answer questions clearly and concisely."

MODEL_DIRS = {
    "fp16": "/workspace/instance-3/exported_models/ministral-3-3b-reasoning-fp16",
    "int8": "/workspace/instance-3/exported_models/ministral-3-3b-reasoning-int8",
    "int4": "/workspace/instance-3/exported_models/ministral-3-3b-reasoning-int4",
}


def format_reasoning_reply(reply):
    """Pretty-print a reply that contains [THINK]...[/THINK] tags."""
    match = re.search(r"\[THINK\](.*?)\[/THINK\](.*)", reply, re.DOTALL)
    if match:
        thinking = match.group(1).strip()
        answer = match.group(2).strip()
        lines = []
        lines.append("\033[2m💭 Thinking:\033[0m")
        for line in thinking.splitlines():
            lines.append(f"\033[2m   {line}\033[0m")
        lines.append("")
        lines.append(f"✅ Answer:\n{answer}")
        return "\n".join(lines)
    return reply


def main():
    parser = argparse.ArgumentParser(description="Chat with Ministral-3-3B-Reasoning (OpenVINO)")
    parser.add_argument("--format", choices=["fp16", "int8", "int4"], default="int4",
                        help="Model quantization format (default: int4)")
    parser.add_argument("--image", type=str, default=None,
                        help="Path to an image to include in the first turn")
    parser.add_argument("--max-tokens", type=int, default=512,
                        help="Max new tokens per response (default: 512)")
    parser.add_argument("--reasoning", action="store_true",
                        help="Enable reasoning mode: model shows [THINK]...[/THINK] chain-of-thought before answering")
    args = parser.parse_args()

    model_dir = MODEL_DIRS[args.format]
    mode = "reasoning" if args.reasoning else "direct"
    print(f"Loading {args.format.upper()} model from {model_dir}...")
    model = OVModelForVisualCausalLM.from_pretrained(model_dir)
    processor = AutoProcessor.from_pretrained(model_dir)
    print(f"Model loaded. Mode: {mode}, max {MAX_TURNS} turns, {args.max_tokens} tokens/response.\n")

    if args.reasoning:
        # No system prompt override — let the model's built-in reasoning template take effect
        history = []
        print("🧠 Reasoning mode ON — model will show its thinking process.\n")
    else:
        history = [{"role": "system", "content": DIRECT_SYSTEM_PROMPT}]

    image = None
    if args.image:
        image = Image.open(args.image)
        print(f"Image loaded: {args.image} ({image.size[0]}x{image.size[1]})")
        print("The image will be included in your first message.\n")

    for turn in range(1, MAX_TURNS + 1):
        try:
            user_input = input(f"[Turn {turn}/{MAX_TURNS}] You: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nExiting.")
            break

        if not user_input:
            print("Empty input, skipping.")
            continue

        # Build message content
        if turn == 1 and image is not None:
            content = [{"type": "image"}, {"type": "text", "text": user_input}]
        else:
            content = user_input

        history.append({"role": "user", "content": content})

        # Apply chat template
        text = processor.apply_chat_template(history, tokenize=False, add_generation_prompt=True)

        # Tokenize (include image only on first turn)
        if turn == 1 and image is not None:
            inputs = processor(images=image, text=text, return_tensors="pt")
        else:
            inputs = processor(text=text, return_tensors="pt")

        n_input = inputs["input_ids"].shape[1]

        # Generate
        t0 = time.time()
        output = model.generate(
            **inputs,
            max_new_tokens=args.max_tokens,
            repetition_penalty=REPETITION_PENALTY,
        )
        elapsed = time.time() - t0

        n_gen = output.shape[1] - n_input
        if args.reasoning:
            reply = processor.decode(output[0][n_input:], skip_special_tokens=False)
            # Strip only the EOS token, keep [THINK]/[/THINK]
            reply = reply.replace("</s>", "").strip()
        else:
            reply = processor.decode(output[0][n_input:], skip_special_tokens=True)
        tps = n_gen / elapsed if elapsed > 0 else 0

        # Store plain text for history
        history.append({"role": "assistant", "content": reply})

        print(f"\nAssistant [{n_gen} tok, {elapsed:.1f}s, {tps:.1f} tok/s]:")
        if args.reasoning:
            print(format_reasoning_reply(reply))
        else:
            print(reply)
        print()

    print(f"\n--- Chat ended after {min(turn, MAX_TURNS)} turns ---")


if __name__ == "__main__":
    main()

Sample output:

Multi-turn output:
(venv) root@106f05962c95:/workspace# python3 /workspace/instance-3/chat.py --format int4 --reasoning --max-tokens 1024
Loading INT4 model from /workspace/instance-3/exported_models/ministral-3-3b-reasoning-int4...
Model loaded. Mode: reasoning, max 5 turns, 1024 tokens/response.

🧠 Reasoning mode ON — model will show its thinking process.

[Turn 1/5] You: Hi
Both `max_new_tokens` (=1024) and `max_length`(=262144) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [252 tok, 8.1s, 31.2 tok/s]:
💭 Thinking:
   Okay, the user has just started a conversation with me. They've said "Hi" which is a common greeting. I need to respond in a friendly manner while being aware that this might be the start of a longer interaction. Since there's no specific question yet, perhaps it would be good to acknowledge their greeting and offer some general information or assistance.
   
   But before jumping into anything too detailed, maybe I should keep it simple and open-ended to see where they go next.
   
   Possible responses could include:
   - Acknowledging the greeting
   - Asking if they have a specific request or question
   - Offering to help with something general
   
   Let me try to craft a polite and welcoming response.
   
   I'll say something like:
   
   "I'm here to help! How can I assist you today?"
   
   Alternatively, since the initial phrase was very short, maybe even simpler:
   
   "Hello! What can I do for you today?"
   
   Both options seem appropriate. Maybe the second one is slightly more direct but still warm.
   
   Wait, perhaps the first option is better because it leaves room for them to elaborate without pressure.
   
   Final decision: "I'm here to help! How can I assist you today?"

✅ Answer:
I'm here to help! How can I assist you today?

[Turn 2/5] You: write dfs in python, keep the code shortest.
Both `max_new_tokens` (=1024) and `max_length`(=262144) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [954 tok, 41.4s, 23.0 tok/s]:
💭 Thinking:
   Alright, the task is to write a depth-first search (DFS) algorithm in Python, keeping the code as short as possible. Let me think about how to implement DFS concisely.
   
   In graph traversal, DFS typically uses recursion or a stack. For minimalism, recursion might be shorter, but we also need to consider whether to represent the graph as an adjacency list or matrix. Adjacency lists are often used for brevity when graphs aren't fixed-size.
   
   A basic recursive DFS function that takes a node and a visited set:
   
   1. Start from a given node.
   2. Mark the current node as visited.
   3. Print or store the node.
   4. Recursively visit all unvisited neighbors.
   
   For the shortest code, let's assume the graph is represented by an adjacency list and we're printing nodes.
   
   Here’s a concise version:
   
   ```python
   def dfs(graph, v):
       if v not in graph: return []
       result = [v]
       for neighbor in graph[v]:
           if neighbor not in result:
               result += dfs(graph, neighbor)
       return result

This is quite compact. It assumes graph is a dictionary where keys are nodes and values are lists of adjacent nodes. The function returns the order of visitation.

If we want to make it even shorter, perhaps by avoiding the check against 'result' each time (which may lead to duplicates), but then we'd lose the uniqueness guarantee. Alternatively, we could pass a visited set, but that might complicate things.

Another approach is to use a stack iteratively, which might be shorter due to lack of recursion overhead in terms of lines, though not necessarily character count.

Iterative DFS:

def dfs(graph, v):
    s, r = {v}, []
    while s:
        n = s.pop()
        if n not in r: r.append(n); s.update(graph[n])
    return r

Hmm, this seems similar in length. But the problem asks for the shortest code, so perhaps counting characters matters.

The first version is 97 characters (without whitespace for readability):

def d(g,v):r=[v];[n for i in g.get(v,[])if n not in r and(r.extend([n])or 0)]+d(g,n)for n in g.get(v,[])]

Wait, that looks messy. Maybe stick to the clear iterative version.

Actually, the most concise might be the recursive one, assuming the graph is passed correctly.

Perhaps the following is the shortest:

def f(g,v,s=[]):
    if v not in s:s+=[v]
    for w in g.get(v,[]):
        if w not in s:f(g,w,s)

But this doesn’t return the path; it modifies the list in place. If we need to capture the path, the previous versions are needed.

Assuming the goal is to get the traversal order, the first recursive version is probably the way to go, and it's relatively short.

So, final answer:

def dfs(graph, v):
    if v not in graph: return []
    res = [v]
    for u in graph[v]:
        if u not in res: res += dfs(graph, u)
    return res

Character count: 85 (excluding spaces for indentation, but usually we count actual characters including spaces).

Alternative, using a default parameter for visited set to avoid global state:

def dfs(graph, v, s=None):
    if s is None: s=[]
    if v not in s: s+=[v]
    for u in graph.get(v, []):
        if u not in s: dfs(graph, u, s)
    return s

This is 106 characters, which is longer than the previous version. So the first one is shorter.

Thus, the shortest is likely the first one.

Now, to present the answer clearly.

✅ Answer:
Here's a concise implementation of DFS in Python, returning the order of visited nodes starting from v:

def dfs(graph, v):
    if v not in graph: return []
    res = [v]
    for u in graph[v]:
        if u not in res: res += dfs(graph, u)
    return res

This works with a graph represented as a dictionary where keys are nodes and values are lists of adjacent nodes. The function recursively visits unvisited neighbors, appending them to the result list. The base case checks if the node exists in the graph. This is approximately 85 characters long.

[Turn 3/5] You: what is its time complexity?
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [1024 tok, 67.0s, 15.3 tok/s]:
[THINK]Okay, the user asked about the time complexity of the provided DFS implementation. Let me recall how to analyze the time complexity for this recursive DFS function.

Given the function:

def dfs(graph, v):
    if v not in graph: return []
    res = [v]
    for u in graph[v]:
        if u not in res: res += dfs(graph, u)
    return res

Assumptions:

  • Graph is represented as an adjacency list (a dictionary where each key is a node and value is a list of its neighbors).
  • Each call processes the current node and makes recursive calls for each unvisited neighbor.

Time Complexity Analysis:

  1. Visiting a single node takes O(1) time (assuming hash table lookups are constant time).
  2. In the worst case, every edge is traversed exactly once during the entire execution.
    • Suppose the graph has V vertices and E edges.
    • At each step, we either add a new vertex to the result or move to a neighbor.
    • Therefore, total operations are proportional to the number of vertices plus the number of edges.

Since we traverse each node and edge once, the time complexity is O(V + E). However, note that checking membership in the result list (u not in res) is O(k) where k is the size of the result list, which grows up to V. Thus, the full analysis shows that the inner loop adds an extra factor of O(V) per level, leading to O((V + E)V) in the worst case for dense graphs. Wait, actually, in practice, the res list contains only the current path, so the membership test is O(L) where L is the length of the current path, which is at most V. And since the sum over all paths is linear in V+E, the overall complexity remains O(V + E) * average path length, which simplifies to O(VE) in the worst case for sparse graphs? Hmm, perhaps it's better to think that the work done at each node is proportional to its degree, and the sum of degrees is 2E. Then the total time is O(E) for processing edges, and O(V) for building the result list. So overall O(V + E).

Wait, perhaps the standard statement is that DFS runs in O(V + E) time regardless of the data structure, assuming constant-time access to neighbors.

To confirm, let me double-check:

Each node is added to the result list exactly once, and each edge is considered twice (once in each direction, except for undirected graphs where it's once). The main cost comes from iterating over the adjacency list of each node. Assuming that the adjacency list representation allows us to iterate over neighbors in O(degree(node)) time, and summing these over all nodes gives O(E) for directed graphs (since each edge is processed once) and O(2E) for undirected graphs (each edge counted twice). Additionally, the list append operation is amortized O(1), and the membership tests in the result list—though technically, the res list is growing, and the in operator is O(len(res)), which in the worst case is O(V) per node. However, since each node is appended only once, the total cost of all in checks is O(V^2) in the worst case (e.g., for a complete graph). But this is less important compared to the edge processing.

However, many sources simplify this to O(V + E) by considering that the dominant term is the edge traversal, and the list operations are negligible or assumed to be constant time for practical purposes. To be precise, the exact time complexity is O(V + E) for the iteration part, and O(V^2) for the list membership checks, making it O(V + E + V^2) = O(V^2) in the worst case for connected graphs. But often, the latter is ignored, focusing on the edge-based complexity.

For simplicity, the commonly accepted time complexity for DFS is O(V + E), especially when the graph is stored in an adjacency list and the list membership is considered constant time (perhaps by maintaining a separate visited set instead of checking in the result list).

Ah! Perhaps the original code can be improved to use a visited set for true O(V + E) time. Here's how:

def dfs(graph, v, visited=None):
    if visited is None: visited = set()
    if v not in visited:
        print(v)  # or add to result
        visited.add(v)
        for u in graph.get(v, []):
            dfs(graph, u, visited)

With this, the time complexity is strictly O(V + E) because adding to a set and checking membership are O(1) on average, and we process each node and edge once.

But the original question refers to the first implementation, which

[Turn 4/5] You: ^C
Exiting.

--- Chat ended after 4 turns ---
(venv) root@106f05962c95:/workspace#
(venv) root@106f05962c95:/workspace# python3 /workspace/instance-3/chat.py --format int4 --reasoning --max-tokens 1024
Loading INT4 model from /workspace/instance-3/exported_models/ministral-3-3b-reasoning-int4...
Model loaded. Mode: reasoning, max 5 turns, 1024 tokens/response.

🧠 Reasoning mode ON — model will show its thinking process.

[Turn 1/5] You: Hi
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [252 tok, 8.1s, 31.3 tok/s]:
💭 Thinking:
Okay, the user has just started a conversation with me. They've said "Hi" which is a common greeting. I need to respond in a friendly manner while being aware that this might be the start of a longer interaction. Since there's no specific question yet, perhaps it would be good to acknowledge their greeting and offer some general information or assistance.

But before jumping into anything too detailed, maybe I should keep it simple and open-ended to see where they go next.

Possible responses could include:

  • Acknowledging the greeting
  • Asking if they have a specific request or question
  • Offering to help with something general

Let me try to craft a polite and welcoming response.

I'll say something like:

"I'm here to help! How can I assist you today?"

Alternatively, since the initial phrase was very short, maybe even simpler:

"Hello! What can I do for you today?"

Both options seem appropriate. Maybe the second one is slightly more direct but still warm.

Wait, perhaps the first option is better because it leaves room for them to elaborate without pressure.

Final decision: "I'm here to help! How can I assist you today?"

✅ Answer:
I'm here to help! How can I assist you today?

[Turn 2/5] You: what are the graph traversal algos? list 3 names.
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [952 tok, 41.5s, 22.9 tok/s]:
💭 Thinking:
Alright, the user is asking about graph traversal algorithms and wants three examples listed. Let me recall what graph traversal algorithms exist.

Graph traversal typically refers to methods used to visit all nodes in a graph starting from a given node. Common ones include Depth-First Search (DFS), Breadth-First Search (BFS), and maybe others like Prim's algorithm or Kruskal's, but those are more for MSTs. For pure graph traversals, DFS and BFS are fundamental.

Perhaps also mention two more, maybe Union-Find or something else? Wait, maybe stick to basic unweighted graph traversal algorithms.

So, three main ones are probably:

  1. Depth First Search (DFS)
  2. Breadth First Search (BFS)
  3. Maybe another like Prim’s or Kruskal’s? Hmm, not sure if these are strictly traversal.

Actually, let me double-check my memory. In many contexts, the primary graph traversal algorithms are DFS and BFS. There might be others like the "jump search," but that seems less likely.

Maybe the third one could be "Prim's algorithm" for minimum spanning tree, but that's not exactly traversal. Alternatively, maybe "Kahn's algorithm" for topological sorting?

Hmm, perhaps sticking to classic traversal methods:

  1. Depth-First Search (DFS)
  2. Breadth-First Search (BFS)
  3. Maybe "Depth-Limited Search"? But that's a variation.

Alternatively, perhaps "Prim's algorithm" is sometimes considered a type of graph traversal when building a MST.

To be safe, I’ll list DFS, BFS, and then perhaps "Prim’s algorithm" as it involves visiting vertices in order.

Now, how to present this clearly.

Response could be:

"The following are three common graph traversal algorithms:

  1. Depth-First Search (DFS): Explores as deep as possible along each branch before backtracking.
  2. Breadth-First Search (BFS): Explores all neighbors at the current level before moving to the next level.
  3. Prim’s Algorithm: Used to find a Minimum Spanning Tree by adding the cheapest edge connecting the growing tree to the rest of the graph."

Wait, is Prim’s really a traversal? It does involve visiting edges and nodes, so yes, it can be called a graph traversal method.

Alternatively, if we're looking for purely traversal methods without considering MST, maybe "Kahn's algorithm" for topological sort is acceptable.

But to avoid confusion, perhaps better to stick to clear definitions.

Another thought: maybe "Union-Find" is not a traversal, so perhaps skip that.

Okay, final selection:

Three types of graph traversal algorithms often taught together are:

  1. Depth-First Search (DFS)
  2. Breadth-Fist Search (BFS)
  3. Maybe "Prim’s" or "Kruskal’s", but perhaps better to pick "Topological Sorting" as well, though it's different.

Wait, perhaps the most straightforward answer is to list DFS, BFS, and then "Prim’s" or "Kruskal’s". To be precise, maybe:

The three most commonly referred-to graph traversal algorithms are:

  1. Depth-First Search (DFS)
    • Starts at a source vertex, explores as far as possible along each branch before backtracking.
  2. Breadth-First Search (BFS)
    • Uses a queue; starts at the source, explores all adjacent vertices at the present depth before moving inward.
  3. Prim’s Algorithm
    • Builds a minimum spanning tree by selecting the smallest weight edge that connects a new vertex to the existing tree.

This covers three distinct concepts related to graphs, focusing on traversal-like operations.

Now, format the answer properly.

✅ Answer:
Here are three common graph traversal algorithms:

  1. Depth-First Search (DFS):

    • Starts at a source vertex, explores as far as possible along each branch before backtracking.
    • Implemented using recursion or a stack.
  2. Breadth-First Search (BFS):

    • Uses a queue; starts at the source, explores all adjacent vertices at the current level before moving to the next level.
    • More efficient for finding shortest paths in unweighted graphs.
  3. Prim’s Algorithm:

    • Used to construct a minimum spanning tree by iteratively adding the least-weight edge that connects a new vertex to the existing tree. While primarily for MST construction, it involves graph traversal steps.

[Turn 3/5] You: Write shortest python code for DFS. No explanation, just give code.
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [990 tok, 64.3s, 15.4 tok/s]:
[THINK]Okay, the task is to write the shortest Python code for Depth-First Search (DFS). The requirement is minimal—just the code without explanations.

For a simple undirected graph represented as an adjacency list, DFS can be implemented recursively. Here's a concise version:

def dfs(graph, v):
    visited = set()
    def _dfs(v):
        if v not in visited:
            print(v, end=' ')
            visited.add(v)
            for neighbor in graph[v]:
                _dfs(neighbor)
    _dfs(v)

graph = {
    'A': ['B', 'C'],
    'B': ['A', 'D', 'E'],
    'C': ['A', 'F'],
    'D': ['B'],
    'E': ['B', 'G'],
    'F': ['C'],
    'G': ['E']
}
dfs(graph, 'A')

But wait, the instruction says "shortest" code. So, perhaps we can make it even shorter by avoiding nested functions and using iterative approach instead, which might save characters.

Iterative DFS:

def dfs(graph, v):
    s = [v]
    visited = set([v])
    while s:
        u = s.pop()
        print(u, end=' ')
        for w in graph[u] - visited:
            s.append(w)
            visited.add(w)

This is shorter in terms of lines, but perhaps not character-wise. However, the original recursive version is quite standard and clear, but the problem asks for the "shortest" code, possibly meaning fewest lines or characters.

Alternative attempt to minimize characters:

Assuming the graph is stored in a dictionary g and we call dfs(g, 'start'):

def d(s,v):
    if v not in g: return
    print(v,end=' ')
    for x in g.get(v,''):
        d(s,x)
d({'A':'BC','B':'ACDE','C':'ABF','D':'B','E':'BDG','F':'C','G':'EF'},'A')

This is a bit more compact, but perhaps the first one is clearer. Given the ambiguity of "shortest," I think providing the most concise functional implementation is best.

Final choice: the iterative version with minimal syntax.

Here's the code:

def dfs(g,s):
    v=set()
    q=[s]
    while q:
        u=q.pop()
        if u not in v:
            print(u,end=' ')
            v.add(u)
            q+=[x for x in g.get(u,'')if x not in v]

This uses a set for visited and a list as a stack. It avoids defining separate variables and tries to be as concise as possible.

But to ensure it works for the example, let's test it mentally:

Input: {'A':['B','C'], ...}, start 'A'

It prints 'A', adds B,C to queue, etc. Seems correct.

However, the line q+=[x for x in g.get(u,'')if x not in v] may not work as intended because v is added after processing, so initially, v only contains 'A'. When u is popped, it checks neighbors against v, which hasn't been updated yet. This will cause infinite loops unless the check is done differently.

Ah, mistake! Need to add to v before appending to the queue.

Corrected version:

def dfs(g,s):
    v={s}
    q=[s]
    while q:
        u=q.pop()
        print(u,end=' ')
        for x in g.get(u,''):
            if x not in v:
                v.add(x)
                q.append(x)

This is now correct and somewhat concise.

Character count: perhaps we can remove spaces where possible, but readability is important. Assuming the above is acceptable.

Thus, the final code is:

def dfs(g,s):
 v={s};q=[s]
 while q:
  u=q.pop();print(u,end=' ')
  for x in g.get(u,''):
   if x not in v:v.add(x);q+=[x]

This is under 80 characters (excluding whitespace for indentation, assuming single space indents).

If we consider that the function needs to take parameters and handle empty cases, this is a valid minimal implementation.

So, the answer is:

def dfs(g,s):
 v={s};q=[s]
 while q:
  u=q.pop();print(u,end=' ')
  for x in g.get(u,''):
   if x not in v:v.add(x);q+=[x]

<- limit reached for max 1024 token so didn't print final answer.

[Turn 4/5] You: what is the time complexity? just give ans in 1 line.
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [160 tok, 14.4s, 11.1 tok/s]:
💭 Thinking:
Okay, the user is asking for the time complexity of DFS in one line. From what I remember, for an undirected graph with V vertices and E edges, the time complexity of DFS is O(V + E).

Since the question is very brief, I'll provide the answer directly.

Time complexity of DFS: O(V + E)

But to confirm, is this always true? Yes, for connected components, and generally for the entire graph.

In the context of the previous code provided, which is an iterative DFS, the time complexity remains O(V + E) as it visits each vertex once and each edge twice (for undirected graphs).

So, the answer is simply:

Time complexity of DFS: O(V + E)

✅ Answer:
Time complexity of DFS: O(V + E)

[Turn 5/5] You: what about Dijkstra? ans in 1 line.
Both max_new_tokens (=1024) and max_length(=262144) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)

Assistant [140 tok, 13.7s, 10.2 tok/s]:
[THINK]Okay, the user is asking for the time complexity of Dijkstra's algorithm in one line.

From what I recall, Dijkstra's algorithm has a time complexity of O(E log V) for a graph with V vertices and E edges, assuming the use of a priority queue (like a min-heap).

Is this accurate? Yes, for weighted graphs, and it depends on the data structure used for the priority queue. If using binary heap, it's O((V + E) log V). Sometimes simplified to O(E log V) for sparse graphs.

Given the simplicity required, this should suffice.

Answer in one line:

Time complexity of Dijkstra: O(E log V)

--- Chat ended after 5 turns ---
(venv) root@106f05962c95:/workspace#


</details>

…istral-3-3B-Reasoning-2512)

Enable mistral3 model type for OpenVINO export and inference in optimum-intel.
This adds support for the Ministral-3-3B-Reasoning VLM model with Pixtral vision
encoder and Ministral3 language model.

Changes:
- exporters/openvino/utils.py: Register mistral3 in MULTI_MODAL_TEXT_GENERATION_MODELS
- exporters/openvino/model_configs.py: Add Mistral3OpenVINOConfig with task registration,
  text_config model_type fallback (ministral3 -> mistral), and VLM behavior routing
- exporters/openvino/model_patcher.py: Add inlined vision pipeline patcher
  (Mistral3ImageEmbeddingModelPatcher), language model patcher fixing sliding_window=None
  and cache_position injection (Mistral3LanguageModelPatcher), and eager_mask_without_vmap
  compatibility fix for transformers >= 5.4 q_length API change
- intel/openvino/modeling_visual_language.py: Add _OVMistral3ForCausalLM inference class
  with vision/text embedding merge via masked_scatter

Tested with FP16/INT8/INT4 exports, text generation, image description, and
5-turn multi-turn conversation. No repetition issues detected.
@dhandhalyabhavik dhandhalyabhavik changed the title Add OpenVINO export and inference support for ministral reasoning model 2510 [OpenVINO] Add export and inference support for ministral reasoning model 2510 Apr 4, 2026
@rkazants rkazants requested a review from Copilot April 20, 2026 18:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenVINO export + inference support in optimum-intel for the new mistral3 multimodal model type (Ministral-3-3B-Reasoning / Pixtral vision + Ministral3 LM), wiring it into the exporter task routing, patching for export compatibility, and adding an OpenVINO inference implementation.

Changes:

  • Register mistral3 as a supported multimodal text-generation model type for OpenVINO export.
  • Add Mistral3OpenVINOConfig to route VLM behaviors (vision embeddings / text embeddings / language) and apply model-specific patchers.
  • Add Mistral3-specific patchers (inline vision path + LM cache_position/sliding_window fixes) and a new inference class for embedding merge.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
optimum/intel/openvino/modeling_visual_language.py Adds _OVMistral3ForCausalLM and registers it in the model-type mapping for runtime inference.
optimum/exporters/openvino/utils.py Registers mistral3 in MULTI_MODAL_TEXT_GENERATION_MODELS so export selects the multimodal submodel path.
optimum/exporters/openvino/model_patcher.py Updates masking helper for newer Transformers API and introduces Mistral3 vision/LM patchers for export tracing.
optimum/exporters/openvino/model_configs.py Adds Mistral3OpenVINOConfig and TasksManager registration to enable export task/config selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +4180 to +4183
@register_in_tasks_manager("mistral3", *["image-text-to-text"], library_name="transformers")
class Mistral3OpenVINOConfig(BaseVLMOpenVINOConfig):
MIN_TRANSFORMERS_VERSION = "5.4.0"

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a new model type (mistral3) with custom export/inference behavior, but there are no corresponding OpenVINO tests added. Given the existing coverage for other VLMs in tests/openvino/*, please add at least a smoke test that exports and runs a short generate pass for mistral3 (or a tiny/random checkpoint) to prevent regressions in the patchers and behavior routing.

Copilot uses AI. Check for mistakes.
Comment on lines +8350 to +8352
# Step 2: Flatten and normalize (single image, batch=1)
patch_embeds = patch_embeds[0].flatten(1).T.unsqueeze(0) # (1, h*w, d)
patch_embeds = vision_tower.ln_pre(patch_embeds)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mistral3_vision_embed_forward hard-codes batch size 1 by indexing patch_embeds[0] (and later selected.squeeze(0)), which will silently drop all but the first image if pixel_values has batch>1. Either make the implementation batch-safe (operate on the full batch) or add an explicit check that pixel_values.shape[0] == 1 and raise a clear error so callers don't get incorrect results.

Copilot uses AI. Check for mistakes.

# Patch merger - unfold + merge
spatial_merge = config.spatial_merge_size
patch_size = config.vision_config.patch_size
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_size = config.vision_config.patch_size is assigned but never used. Please remove it to avoid dead code and keep this patcher easier to maintain.

Suggested change
patch_size = config.vision_config.patch_size

Copilot uses AI. Check for mistakes.
Comment on lines +4820 to +4823
special_image_mask = (input_ids == image_token_id).unsqueeze(-1)
special_image_mask = special_image_mask.expand_as(inputs_embeds)

image_features = image_features.view(-1, image_features.shape[-1]).to(inputs_embeds.device, inputs_embeds.dtype)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge_vision_text_embeddings relies on masked_scatter to fail if the number of image-token positions doesn't match the number of vision embeddings, which can produce a fairly opaque runtime error. Consider adding an explicit count check (like the _OVLlama4ForCausalLM implementation just above) and raising a clear ValueError when there is a mismatch; also ensure the boolean mask is on inputs_embeds.device before scattering.

Suggested change
special_image_mask = (input_ids == image_token_id).unsqueeze(-1)
special_image_mask = special_image_mask.expand_as(inputs_embeds)
image_features = image_features.view(-1, image_features.shape[-1]).to(inputs_embeds.device, inputs_embeds.dtype)
special_image_mask = (input_ids == image_token_id).unsqueeze(-1).to(inputs_embeds.device)
image_features = image_features.view(-1, image_features.shape[-1]).to(inputs_embeds.device, inputs_embeds.dtype)
num_image_tokens = special_image_mask[..., 0].sum().item()
num_image_features = image_features.shape[0]
if num_image_tokens != num_image_features:
raise ValueError(
f"Image features and image tokens do not match: tokens: {num_image_tokens}, features {num_image_features}"
)
special_image_mask = special_image_mask.expand_as(inputs_embeds)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add proper PR description and tests

@dhandhalyabhavik
Copy link
Copy Markdown
Author

Hi @rkazants

Thank you so much for review, I am working on both ministral PRs, will update soon with proper descriptions and changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants