Commit 45a501c
qwen3_vl: match HF reference by fixing two upstream mlx-vlm bugs
On the 6-query × 6-image retrieval benchmark, the mlx-embeddings output
had max|cosine diff| = 0.087 vs HF transformers reference and only 83%
top-1 agreement. Three fixes close the gap to max 0.006 diff and 100%
top-1/top-3 agreement:
1. Forward the embedder's MIN_PIXELS/MAX_PIXELS (4096..1,843,200) onto
the inner image_processor. The Qwen3-VL preprocessor_config.json
lists the full-context size bounds (16 MP), so without this override
the image_processor resized to a different grid than the HF reference
and the comparison ran on different visual tokens.
2. Work around mlx-vlm bug in Qwen3-VL get_input_embeddings: the
upstream assigns `mx.eval(deepstack_image_embeds)` to
`deepstack_visual_embeds`, but mx.eval returns None — so multi-scale
deepstack features were silently dropped at every LM layer the
model was supposed to inject them into. Re-run the vision tower in
our Model.get_input_embeddings when we detect this.
3. Patch mlx-vlm's `_deepstack_process` on the language-model instance:
upstream indexes the full concatenated visual_embeds at each batch
sample's image positions, which only works for batch_size=1. Our
patched version slices visual_embeds per sample using a running
offset so multi-image batches work.
Once (2) is fixed upstream, (3) surfaces immediately — they're stacked
bugs that cancel for single-image batches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 1bd3299 commit 45a501c
2 files changed
Lines changed: 74 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
14 | 49 | | |
15 | 50 | | |
16 | 51 | | |
| |||
159 | 194 | | |
160 | 195 | | |
161 | 196 | | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
162 | 205 | | |
163 | 206 | | |
164 | 207 | | |
| |||
178 | 221 | | |
179 | 222 | | |
180 | 223 | | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
181 | 247 | | |
182 | 248 | | |
183 | 249 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
711 | 711 | | |
712 | 712 | | |
713 | 713 | | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
714 | 722 | | |
715 | 723 | | |
716 | 724 | | |
| |||
0 commit comments