Qwen3-VL: torch-free numpy processor by Blaizzy · Pull Request #61 · Blaizzy/mlx-embeddings

Blaizzy · 2026-04-23T20:05:33Z

Summary

HF's Qwen3-VL image/video processors hard-require torch/torchvision. This PR inlines a numpy + PIL port of both processors into mlx_embeddings/models/qwen3_vl/processor.py so mlx-embeddings can run real Qwen3-VL checkpoints without torch installed.

Adapted from mlx-vlm's unreleased torch-free port (commit 1bf7742 on the local dev tree; not yet in any mlx-vlm PyPI release, so we inline rather than bump the dep).

Changes

Adds numpy Qwen3VLImageProcessor + Qwen3VLVideoProcessor (subclasses of HF's ImageProcessingMixin / BaseVideoProcessor, duck-typed to match).
Adds a torch-free Qwen3VLProcessor subclass of HF ProcessorMixin that overrides check_argument_for_proper_class to bypass HF's isinstance check against transformers.utils.dummy_torchvision_objects.
Processor.from_pretrained now delegates to the local Qwen3VLProcessor.from_pretrained, which reads processor_config.json / preprocessor_config.json / video_preprocessor_config.json directly.
Drops the AutoImageProcessor.from_pretrained(use_fast=False) fallback path, the _UnsupportedVideoProcessor stub, and the object.__new__(Qwen3VLProcessor) workaround. Video inputs are now supported, not stubbed.
Updates test_qwen3_vl_processor_from_pretrained_uses_custom_loader to mock at the new Qwen3VLProcessor.from_pretrained boundary.

Fixes on top of the mlx-vlm source

Flatten list-of-list image/video batches — HF's apply_chat_template nests inputs that way and the upstream __call__ crashed on them.
Treat explicit None in preprocessor_config.json (min_pixels / max_pixels) the same as missing. The 2B Instruct checkpoint ships nulls alongside valid size.shortest_edge / size.longest_edge; the previous logic let the nulls clobber the valid sizes.

Test plan

pytest mlx_embeddings/tests/test_models.py — 16/16 pass
End-to-end embedding from mlx-community/Qwen3-VL-2B-Instruct-4bit: model.embed({text, image=PIL}) returns (1, 2048) bf16
End-to-end reranker with image: model.rerank({query, documents=[{text, image}, ...]}) returns sensible scores
'torch' in sys.modules stays False across load + embed + rerank (venv also has no torch installed, so this is a hard guarantee)
Qwen3VLImageProcessor and Qwen3VLVideoProcessor are confirmed to be the local numpy classes (not transformers' or mlx-vlm's)

🤖 Generated with Claude Code

HF's Qwen3-VL image/video processors hard-require torch/torchvision. Inline the numpy port adapted from mlx-vlm (commit 1bf7742, unreleased) so mlx-embeddings can run without torch installed — including real checkpoints like mlx-community/Qwen3-VL-2B-Instruct-4bit. Drops the AutoImageProcessor.from_pretrained(use_fast=False) path, the _UnsupportedVideoProcessor stub, and the object.__new__(Qwen3VLProcessor) trick. Processor.from_pretrained now delegates to the local torch-free Qwen3VLProcessor.from_pretrained, which reads processor_config.json / preprocessor_config.json / video_preprocessor_config.json directly and builds numpy Qwen3VLImageProcessor / Qwen3VLVideoProcessor. Small fixes on top of the mlx-vlm source: - Flatten list-of-list image/video batches (HF's apply_chat_template nests them that way). - Treat explicit None in preprocessor_config.json (min_pixels/max_pixels) the same as missing — the 2B Instruct checkpoint ships nulls alongside valid size.shortest_edge/longest_edge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README examples surfaced two gaps in the port: - Image inputs passed as https:// URLs (the embedding/reranker README examples) hit `FileNotFoundError` because `_to_numpy_image` treated every string as a local path. Detect URLs and fetch via requests. - `Qwen/Qwen3-VL-Reranker-2B` ships its chat template in chat_template.jinja, not in tokenizer_config.json. Add a `_load_qwen_vl_text` helper (local-then-Hub) and fall back to it when neither processor_config.json nor the tokenizer carries a template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both helpers did the same local-then-Hub read; only the parsing differed. Unify as _load_qwen_vl_file that dispatches on the .json suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The folder name already scopes these, so the prefix is noise: - _load_qwen_vl_file -> _load_file - _qwen_vl_image_kwargs -> _image_kwargs - _qwen_vl_video_kwargs -> _video_kwargs Classes keep their qualified Qwen3VL* names since they're the module's public surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On the 6-query × 6-image retrieval benchmark, the mlx-embeddings output had max|cosine diff| = 0.087 vs HF transformers reference and only 83% top-1 agreement. Three fixes close the gap to max 0.006 diff and 100% top-1/top-3 agreement: 1. Forward the embedder's MIN_PIXELS/MAX_PIXELS (4096..1,843,200) onto the inner image_processor. The Qwen3-VL preprocessor_config.json lists the full-context size bounds (16 MP), so without this override the image_processor resized to a different grid than the HF reference and the comparison ran on different visual tokens. 2. Work around mlx-vlm bug in Qwen3-VL get_input_embeddings: the upstream assigns `mx.eval(deepstack_image_embeds)` to `deepstack_visual_embeds`, but mx.eval returns None — so multi-scale deepstack features were silently dropped at every LM layer the model was supposed to inject them into. Re-run the vision tower in our Model.get_input_embeddings when we detect this. 3. Patch mlx-vlm's `_deepstack_process` on the language-model instance: upstream indexes the full concatenated visual_embeds at each batch sample's image positions, which only works for batch_size=1. Our patched version slices visual_embeds per sample using a running offset so multi-image batches work. Once (2) is fixed upstream, (3) surfaces immediately — they're stacked bugs that cancel for single-image batches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ugs" This reverts commit 45a501c.

Replaces the toy "embed 4 mixed inputs and print a 4x4 similarity" snippet with a real retrieval workflow: embed an image gallery once, score multiple text queries, and rank top-K per query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Convert examples/qwen3_vl_retrieval.py into a notebook so the plot renders inline on GitHub (no separate PNG to keep in sync). README now links to the .ipynb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Blaizzy and others added 8 commits April 23, 2026 22:05

qwen3_vl: collapse _load_qwen_vl_{json,text} into one helper

66d7d21

Both helpers did the same local-then-Hub read; only the parsing differed. Unify as _load_qwen_vl_file that dispatches on the .json suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

format

1bd3299

Revert "qwen3_vl: match HF reference by fixing two upstream mlx-vlm b…

15208cb

…ugs" This reverts commit 45a501c.

Blaizzy force-pushed the pc/qwen3-vl-torch-free-processor branch from 88af1f0 to 967fbaf Compare April 24, 2026 00:15

Blaizzy and others added 2 commits April 24, 2026 02:18

qwen3_vl: ship retrieval demo as a notebook

af0700b

Convert examples/qwen3_vl_retrieval.py into a notebook so the plot renders inline on GitHub (no separate PNG to keep in sync). README now links to the .ipynb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

format

2474cd4

penumbrazz mentioned this pull request May 2, 2026

[codex] fix Qwen3-VL embedding processor compat jundot/omlx#1039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-VL: torch-free numpy processor#61

Qwen3-VL: torch-free numpy processor#61
Blaizzy wants to merge 10 commits into
mainfrom
pc/qwen3-vl-torch-free-processor

Blaizzy commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Blaizzy commented Apr 23, 2026

Summary

Changes

Fixes on top of the mlx-vlm source

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant