|
| 1 | +# Ministral-3-3B ONNX Runtime GenAI Example |
| 2 | + |
| 3 | +This example demonstrates how to convert [Ministral-3-3B-Instruct-2512](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) vision-language model to ONNX format using Olive and run inference with ONNX Runtime GenAI. |
| 4 | + |
| 5 | +Ministral-3-3B is a multimodal (VLM) model combining a Pixtral vision encoder with a Mistral text decoder using YaRN RoPE for extended context. The pipeline exports three sub-models: |
| 6 | +- **Vision encoder** and **embedding** via [mobius](https://github.com/onnxruntime/mobius) (declarative ONNX graph construction); vision optionally INT4-quantized via Olive for CPU |
| 7 | +- **Text decoder** via Olive/ModelBuilder (GQA + INT4/FP16 quantization) |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +```bash |
| 12 | +pip install -r requirements.txt |
| 13 | +``` |
| 14 | + |
| 15 | +Install ONNX Runtime GenAI: |
| 16 | + |
| 17 | +| Device | Install Command | |
| 18 | +|--------|-----------------| |
| 19 | +| CPU | `pip install onnxruntime-genai --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` | |
| 20 | +| GPU (CUDA) | `pip install onnxruntime-genai-cuda --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` | |
| 21 | + |
| 22 | +## Steps |
| 23 | + |
| 24 | +### 1. Export & Optimize Models |
| 25 | + |
| 26 | +**CPU (INT4 all models):** |
| 27 | + |
| 28 | +```bash |
| 29 | +python optimize.py --config-dir cpu_and_mobile --device cpu |
| 30 | +``` |
| 31 | + |
| 32 | +**CUDA (FP16 all models):** |
| 33 | + |
| 34 | +```bash |
| 35 | +python optimize.py --config-dir cuda --device gpu |
| 36 | +``` |
| 37 | + |
| 38 | +**With local dequantized checkpoint (skips FP8 dequant):** |
| 39 | + |
| 40 | +```bash |
| 41 | +python optimize.py --config-dir cpu_and_mobile --device cpu --model-path /path/to/Ministral-3-3B-dequantized |
| 42 | +``` |
| 43 | + |
| 44 | +This runs: |
| 45 | +- **Olive/ModelBuilder** for text decoder (GQA attention, YaRN RoPE, INT4/FP16) |
| 46 | +- **Mobius** for vision encoder (Pixtral, dynamic H×W, 2D RoPE) and embedding (token + image fusion) |
| 47 | +- **Olive INT4 quantization** on vision (cpu_and_mobile only; embedding stays FP16) |
| 48 | + |
| 49 | +Then generates `genai_config.json` and `processor_config.json` for the ORT GenAI runtime. |
| 50 | + |
| 51 | +### 2. Output Structure |
| 52 | + |
| 53 | +``` |
| 54 | +cpu_and_mobile/models/ # or cuda/models/ |
| 55 | +├── decoder/ |
| 56 | +│ ├── model.onnx # Text decoder (Mistral + YaRN) |
| 57 | +│ └── model.onnx.data |
| 58 | +├── vision/ |
| 59 | +│ ├── model.onnx # Pixtral vision encoder (FP16) |
| 60 | +│ └── model.onnx.data |
| 61 | +├── embedding/ |
| 62 | +│ ├── model.onnx # Embedding fusion model (FP16) |
| 63 | +│ └── model.onnx.data |
| 64 | +├── genai_config.json # Runtime configuration |
| 65 | +├── processor_config.json # Pixtral image preprocessing |
| 66 | +├── tokenizer.json |
| 67 | +└── tokenizer_config.json |
| 68 | +``` |
| 69 | + |
| 70 | +### 3. Run Inference |
| 71 | + |
| 72 | +```bash |
| 73 | +# Text-only |
| 74 | +python inference.py --prompt "What is the capital of France?" |
| 75 | + |
| 76 | +# Image + text |
| 77 | +python inference.py --image photo.jpg --prompt "Describe this image" |
| 78 | + |
| 79 | +# Interactive mode |
| 80 | +python inference.py --interactive |
| 81 | + |
| 82 | +# CUDA model |
| 83 | +python inference.py --model_path cuda/models --prompt "Hello" |
| 84 | +``` |
| 85 | + |
| 86 | +Alternatively, use the built-in GenAI multimodal demo: |
| 87 | + |
| 88 | +```bash |
| 89 | +python -m onnxruntime_genai.models.model_mm -m cpu_and_mobile/models --max_length 4096 |
| 90 | +``` |
| 91 | + |
| 92 | +### 4. Evaluate |
| 93 | + |
| 94 | +Run the AI2D science diagram QA benchmark: |
| 95 | + |
| 96 | +```bash |
| 97 | +# ONNX only (CPU INT4) |
| 98 | +python eval.py --device cpu --model_path cpu_and_mobile/models |
| 99 | + |
| 100 | +# ONNX only (CUDA FP16) |
| 101 | +python eval.py --device cuda --model_path cuda/models |
| 102 | + |
| 103 | +# Compare ONNX vs PyTorch reference |
| 104 | +python eval.py --pytorch_model mistralai/Ministral-3-3B-Instruct-2512 --num_samples 100 |
| 105 | +``` |
| 106 | + |
| 107 | +Expected precision gaps (ONNX vs PyTorch): |
| 108 | +- **FP32**: ~0 pp (exact parity) |
| 109 | +- **FP16**: <2 pp (precision loss) |
| 110 | +- **INT4**: <5 pp (quantization loss) |
| 111 | + |
| 112 | +## Directory Structure |
| 113 | + |
| 114 | +``` |
| 115 | +mistralai-Ministral-3-3B-Instruct-2512/builtin/ |
| 116 | +├── cpu_and_mobile/ |
| 117 | +│ ├── text.json # INT4 text decoder config (Olive/ModelBuilder) |
| 118 | +│ └── vision.json # INT4 vision quantization (Olive, post-mobius) |
| 119 | +├── cuda/ |
| 120 | +│ └── text.json # FP16 text decoder config (Olive/ModelBuilder) |
| 121 | +├── optimize.py # Export orchestrator (Olive + Mobius) |
| 122 | +├── inference.py # ORT GenAI inference (text + VLM) |
| 123 | +├── eval.py # AI2D benchmark evaluation |
| 124 | +├── requirements.txt |
| 125 | +├── info.yml |
| 126 | +└── README.md |
| 127 | +``` |
| 128 | + |
| 129 | +> **Note:** Unlike Qwen VLM recipes (which use Olive for all 3 sub-models end-to-end), |
| 130 | +> Ministral uses **mobius** for vision and embedding ONNX export, then **Olive** for |
| 131 | +> INT4 quantization (cpu_and_mobile only). The CUDA target uses FP16 from mobius directly. |
| 132 | +
|
| 133 | +## Differences from Qwen VLM Recipes |
| 134 | + |
| 135 | +Qwen VLM recipes export all three sub-models through Olive using JSON configs |
| 136 | +(`text.json`, `vision.json`, `embedding.json`). Each JSON defines a multi-pass |
| 137 | +pipeline: PyTorch export → graph surgery → ORT fusion → quantization/FP16. |
| 138 | + |
| 139 | +This recipe takes a different approach for **vision and embedding**: |
| 140 | + |
| 141 | +| Component | Qwen | Ministral | Why | |
| 142 | +|-----------|------|-----------|-----| |
| 143 | +| Text decoder | Olive/ModelBuilder (`text.json`) | Olive/ModelBuilder (`text.json`) | Same — ModelBuilder handles GQA + quantization | |
| 144 | +| Vision encoder | Olive: PyTorch export + 5-6 passes | **Mobius** export + Olive INT4 (`vision.json`) | Pixtral's dynamic image dims break `torch.onnx.export` | |
| 145 | +| Embedding | Olive: PyTorch export + 5 passes | **Mobius** export (FP16, no INT4) | INT4 breaks embedding's Equal/Gather logic | |
| 146 | + |
| 147 | +**Why does Ministral use mobius instead of Olive for export?** Mobius constructs |
| 148 | +the ONNX graph declaratively rather than tracing through PyTorch. The resulting |
| 149 | +models already contain the graph optimizations that Qwen's Olive passes spend |
| 150 | +5-6 steps creating: |
| 151 | + |
| 152 | +- **Fused operators:** `MultiHeadAttention`, `SkipSimplifiedLayerNormalization`, |
| 153 | + `RotaryEmbedding` — already present in mobius output (Qwen achieves these via |
| 154 | + `OrtTransformersOptimization`) |
| 155 | +- **FP16 weights:** all 840M vision params exported as FP16 directly (Qwen |
| 156 | + converts from FP32 via `OnnxFloatToFloat16`) |
| 157 | +- **Clean graph:** 0 Gemm nodes, 0 redundant Cast chains (Qwen cleans these |
| 158 | + via `GemmToMatMulAdd` and `OnnxPeepholeOptimizer`) |
| 159 | +- **No PyTorch export artifacts:** no `PackedAttentionToLoopMHA` surgery needed |
| 160 | + since mobius doesn't go through dynamo |
| 161 | + |
| 162 | +**What Olive still handles:** For `cpu_and_mobile`, `vision.json` applies |
| 163 | +`OnnxBlockWiseRtnQuantization` (INT4) to the mobius-exported FP16 vision model. |
| 164 | +For `cuda`, no additional Olive passes are needed — FP16 is optimal for GPU. |
| 165 | + |
| 166 | +**Why optimize.py has more lines (~400) than Qwen (~170):** |
| 167 | + |
| 168 | +| Code section | Lines | Why it can't be JSON-driven | |
| 169 | +|---|---|---| |
| 170 | +| `export_vision_and_embedding()` | ~55 | Olive has no mobius integration; Pixtral's dynamic dims cause dynamo failures | |
| 171 | +| `update_genai_config()` | ~150 | Olive generates decoder config only; VLM 3-model config + transforms-based processor_config has no Olive pass | |
| 172 | +| `quantize_vision_and_embedding()` | ~25 | Post-export INT4 on pre-built ONNX (Olive JSON-driven, but needs orchestration) | |
| 173 | +| `fix_tokenizer()` | ~15 | No Olive tokenizer patching pass | |
| 174 | + |
| 175 | +The text decoder export (`text.json`) and INT4 quantization (`vision.json`) ARE Olive JSON-driven — identical to Qwen. |
| 176 | + |
| 177 | +## Notes |
| 178 | + |
| 179 | +- The HuggingFace checkpoint uses FP8 quantized weights. The export pipeline dequantizes these automatically (`weight * weight_scale_inv`). |
| 180 | +- The tokenizer uses `TokenizersBackend` class which genai doesn't support. The optimize script fixes this to `LlamaTokenizer`. |
| 181 | +- Pixtral vision supports dynamic image sizes (multiples of 28, up to 1540×1540). |
| 182 | +- The text decoder includes `llama_4_attn_scale` for long-context attention (>16K tokens). |
0 commit comments