Skip to content

Commit 7a914be

Browse files
titaiwangmsCopilot
andcommitted
Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export
Complete olive recipe for Ministral-3-3B-Instruct-2512 VLM using: - Text decoder: Olive/ModelBuilder (INT4 for both CPU and CUDA) - Vision encoder + embedding: Mobius (dynamo-free ONNX construction) - Vision INT4 quantization: Olive post-export (CPU only) - Permute3D transform in processor_config for NCHW layout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 11d132b commit 7a914be

10 files changed

Lines changed: 1325 additions & 0 deletions

File tree

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Generated model artifacts
2+
models/
3+
4+
# Python bytecode
5+
__pycache__/
6+
*.pyc
7+
8+
# Olive cache
9+
.olive-cache/
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Ministral-3-3B ONNX Runtime GenAI Example
2+
3+
This example demonstrates how to convert [Ministral-3-3B-Instruct-2512](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) vision-language model to ONNX format using Olive and run inference with ONNX Runtime GenAI.
4+
5+
Ministral-3-3B is a multimodal (VLM) model combining a Pixtral vision encoder with a Mistral text decoder using YaRN RoPE for extended context. The pipeline exports three sub-models:
6+
- **Vision encoder** and **embedding** via [mobius](https://github.com/onnxruntime/mobius) (declarative ONNX graph construction); vision optionally INT4-quantized via Olive for CPU
7+
- **Text decoder** via Olive/ModelBuilder (GQA + INT4/FP16 quantization)
8+
9+
## Prerequisites
10+
11+
```bash
12+
pip install -r requirements.txt
13+
```
14+
15+
Install ONNX Runtime GenAI:
16+
17+
| Device | Install Command |
18+
|--------|-----------------|
19+
| CPU | `pip install onnxruntime-genai --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` |
20+
| GPU (CUDA) | `pip install onnxruntime-genai-cuda --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` |
21+
22+
## Steps
23+
24+
### 1. Export & Optimize Models
25+
26+
**CPU (INT4 all models):**
27+
28+
```bash
29+
python optimize.py --config-dir cpu_and_mobile --device cpu
30+
```
31+
32+
**CUDA (FP16 all models):**
33+
34+
```bash
35+
python optimize.py --config-dir cuda --device gpu
36+
```
37+
38+
**With local dequantized checkpoint (skips FP8 dequant):**
39+
40+
```bash
41+
python optimize.py --config-dir cpu_and_mobile --device cpu --model-path /path/to/Ministral-3-3B-dequantized
42+
```
43+
44+
This runs:
45+
- **Olive/ModelBuilder** for text decoder (GQA attention, YaRN RoPE, INT4/FP16)
46+
- **Mobius** for vision encoder (Pixtral, dynamic H×W, 2D RoPE) and embedding (token + image fusion)
47+
- **Olive INT4 quantization** on vision (cpu_and_mobile only; embedding stays FP16)
48+
49+
Then generates `genai_config.json` and `processor_config.json` for the ORT GenAI runtime.
50+
51+
### 2. Output Structure
52+
53+
```
54+
cpu_and_mobile/models/ # or cuda/models/
55+
├── decoder/
56+
│ ├── model.onnx # Text decoder (Mistral + YaRN)
57+
│ └── model.onnx.data
58+
├── vision/
59+
│ ├── model.onnx # Pixtral vision encoder (FP16)
60+
│ └── model.onnx.data
61+
├── embedding/
62+
│ ├── model.onnx # Embedding fusion model (FP16)
63+
│ └── model.onnx.data
64+
├── genai_config.json # Runtime configuration
65+
├── processor_config.json # Pixtral image preprocessing
66+
├── tokenizer.json
67+
└── tokenizer_config.json
68+
```
69+
70+
### 3. Run Inference
71+
72+
```bash
73+
# Text-only
74+
python inference.py --prompt "What is the capital of France?"
75+
76+
# Image + text
77+
python inference.py --image photo.jpg --prompt "Describe this image"
78+
79+
# Interactive mode
80+
python inference.py --interactive
81+
82+
# CUDA model
83+
python inference.py --model_path cuda/models --prompt "Hello"
84+
```
85+
86+
Alternatively, use the built-in GenAI multimodal demo:
87+
88+
```bash
89+
python -m onnxruntime_genai.models.model_mm -m cpu_and_mobile/models --max_length 4096
90+
```
91+
92+
### 4. Evaluate
93+
94+
Run the AI2D science diagram QA benchmark:
95+
96+
```bash
97+
# ONNX only (CPU INT4)
98+
python eval.py --device cpu --model_path cpu_and_mobile/models
99+
100+
# ONNX only (CUDA FP16)
101+
python eval.py --device cuda --model_path cuda/models
102+
103+
# Compare ONNX vs PyTorch reference
104+
python eval.py --pytorch_model mistralai/Ministral-3-3B-Instruct-2512 --num_samples 100
105+
```
106+
107+
Expected precision gaps (ONNX vs PyTorch):
108+
- **FP32**: ~0 pp (exact parity)
109+
- **FP16**: <2 pp (precision loss)
110+
- **INT4**: <5 pp (quantization loss)
111+
112+
## Directory Structure
113+
114+
```
115+
mistralai-Ministral-3-3B-Instruct-2512/builtin/
116+
├── cpu_and_mobile/
117+
│ ├── text.json # INT4 text decoder config (Olive/ModelBuilder)
118+
│ └── vision.json # INT4 vision quantization (Olive, post-mobius)
119+
├── cuda/
120+
│ └── text.json # FP16 text decoder config (Olive/ModelBuilder)
121+
├── optimize.py # Export orchestrator (Olive + Mobius)
122+
├── inference.py # ORT GenAI inference (text + VLM)
123+
├── eval.py # AI2D benchmark evaluation
124+
├── requirements.txt
125+
├── info.yml
126+
└── README.md
127+
```
128+
129+
> **Note:** Unlike Qwen VLM recipes (which use Olive for all 3 sub-models end-to-end),
130+
> Ministral uses **mobius** for vision and embedding ONNX export, then **Olive** for
131+
> INT4 quantization (cpu_and_mobile only). The CUDA target uses FP16 from mobius directly.
132+
133+
## Differences from Qwen VLM Recipes
134+
135+
Qwen VLM recipes export all three sub-models through Olive using JSON configs
136+
(`text.json`, `vision.json`, `embedding.json`). Each JSON defines a multi-pass
137+
pipeline: PyTorch export → graph surgery → ORT fusion → quantization/FP16.
138+
139+
This recipe takes a different approach for **vision and embedding**:
140+
141+
| Component | Qwen | Ministral | Why |
142+
|-----------|------|-----------|-----|
143+
| Text decoder | Olive/ModelBuilder (`text.json`) | Olive/ModelBuilder (`text.json`) | Same — ModelBuilder handles GQA + quantization |
144+
| Vision encoder | Olive: PyTorch export + 5-6 passes | **Mobius** export + Olive INT4 (`vision.json`) | Pixtral's dynamic image dims break `torch.onnx.export` |
145+
| Embedding | Olive: PyTorch export + 5 passes | **Mobius** export (FP16, no INT4) | INT4 breaks embedding's Equal/Gather logic |
146+
147+
**Why does Ministral use mobius instead of Olive for export?** Mobius constructs
148+
the ONNX graph declaratively rather than tracing through PyTorch. The resulting
149+
models already contain the graph optimizations that Qwen's Olive passes spend
150+
5-6 steps creating:
151+
152+
- **Fused operators:** `MultiHeadAttention`, `SkipSimplifiedLayerNormalization`,
153+
`RotaryEmbedding` — already present in mobius output (Qwen achieves these via
154+
`OrtTransformersOptimization`)
155+
- **FP16 weights:** all 840M vision params exported as FP16 directly (Qwen
156+
converts from FP32 via `OnnxFloatToFloat16`)
157+
- **Clean graph:** 0 Gemm nodes, 0 redundant Cast chains (Qwen cleans these
158+
via `GemmToMatMulAdd` and `OnnxPeepholeOptimizer`)
159+
- **No PyTorch export artifacts:** no `PackedAttentionToLoopMHA` surgery needed
160+
since mobius doesn't go through dynamo
161+
162+
**What Olive still handles:** For `cpu_and_mobile`, `vision.json` applies
163+
`OnnxBlockWiseRtnQuantization` (INT4) to the mobius-exported FP16 vision model.
164+
For `cuda`, no additional Olive passes are needed — FP16 is optimal for GPU.
165+
166+
**Why optimize.py has more lines (~400) than Qwen (~170):**
167+
168+
| Code section | Lines | Why it can't be JSON-driven |
169+
|---|---|---|
170+
| `export_vision_and_embedding()` | ~55 | Olive has no mobius integration; Pixtral's dynamic dims cause dynamo failures |
171+
| `update_genai_config()` | ~150 | Olive generates decoder config only; VLM 3-model config + transforms-based processor_config has no Olive pass |
172+
| `quantize_vision_and_embedding()` | ~25 | Post-export INT4 on pre-built ONNX (Olive JSON-driven, but needs orchestration) |
173+
| `fix_tokenizer()` | ~15 | No Olive tokenizer patching pass |
174+
175+
The text decoder export (`text.json`) and INT4 quantization (`vision.json`) ARE Olive JSON-driven — identical to Qwen.
176+
177+
## Notes
178+
179+
- The HuggingFace checkpoint uses FP8 quantized weights. The export pipeline dequantizes these automatically (`weight * weight_scale_inv`).
180+
- The tokenizer uses `TokenizersBackend` class which genai doesn't support. The optimize script fixes this to `LlamaTokenizer`.
181+
- Pixtral vision supports dynamic image sizes (multiples of 28, up to 1540×1540).
182+
- The text decoder includes `llama_4_attn_scale` for long-context attention (>16K tokens).
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"input_model": {
3+
"type": "HfModel",
4+
"model_path": "mistralai/Ministral-3-3B-Instruct-2512"
5+
},
6+
"passes": {
7+
"convert": {
8+
"type": "ModelBuilder",
9+
"precision": "int4",
10+
"int4_accuracy_level": 4,
11+
"extra_options": {
12+
"filename": "model.onnx"
13+
}
14+
}
15+
},
16+
"no_artifacts": true,
17+
"output_dir": "cpu_and_mobile/models/decoder"
18+
}
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"input_model": {
3+
"type": "ONNXModel",
4+
"model_path": "cpu_and_mobile/models/vision/model.onnx"
5+
},
6+
"passes": {
7+
"int4": {
8+
"type": "OnnxBlockWiseRtnQuantization",
9+
"block_size": 128,
10+
"is_symmetric": true,
11+
"accuracy_level": 4,
12+
"save_as_external_data": true,
13+
"external_data_name": "model.onnx.data"
14+
}
15+
},
16+
"no_artifacts": true,
17+
"output_dir": "cpu_and_mobile/models/vision"
18+
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"input_model": {
3+
"type": "HfModel",
4+
"model_path": "mistralai/Ministral-3-3B-Instruct-2512"
5+
},
6+
"passes": {
7+
"convert": {
8+
"type": "ModelBuilder",
9+
"precision": "int4",
10+
"int4_accuracy_level": 4,
11+
"extra_options": {
12+
"filename": "model.onnx"
13+
}
14+
}
15+
},
16+
"engine": {
17+
"target": {
18+
"type": "LocalSystem",
19+
"accelerators": [
20+
{
21+
"device": "gpu",
22+
"execution_providers": [
23+
"CUDAExecutionProvider"
24+
]
25+
}
26+
]
27+
}
28+
},
29+
"no_artifacts": true,
30+
"output_dir": "cuda/models/decoder"
31+
}

0 commit comments

Comments
 (0)