Skip to content

Qwen3-VL: torch-free numpy processor#61

Open
Blaizzy wants to merge 10 commits into
mainfrom
pc/qwen3-vl-torch-free-processor
Open

Qwen3-VL: torch-free numpy processor#61
Blaizzy wants to merge 10 commits into
mainfrom
pc/qwen3-vl-torch-free-processor

Conversation

@Blaizzy
Copy link
Copy Markdown
Owner

@Blaizzy Blaizzy commented Apr 23, 2026

Summary

HF's Qwen3-VL image/video processors hard-require torch/torchvision. This PR inlines a numpy + PIL port of both processors into mlx_embeddings/models/qwen3_vl/processor.py so mlx-embeddings can run real Qwen3-VL checkpoints without torch installed.

Adapted from mlx-vlm's unreleased torch-free port (commit 1bf7742 on the local dev tree; not yet in any mlx-vlm PyPI release, so we inline rather than bump the dep).

Changes

  • Adds numpy Qwen3VLImageProcessor + Qwen3VLVideoProcessor (subclasses of HF's ImageProcessingMixin / BaseVideoProcessor, duck-typed to match).
  • Adds a torch-free Qwen3VLProcessor subclass of HF ProcessorMixin that overrides check_argument_for_proper_class to bypass HF's isinstance check against transformers.utils.dummy_torchvision_objects.
  • Processor.from_pretrained now delegates to the local Qwen3VLProcessor.from_pretrained, which reads processor_config.json / preprocessor_config.json / video_preprocessor_config.json directly.
  • Drops the AutoImageProcessor.from_pretrained(use_fast=False) fallback path, the _UnsupportedVideoProcessor stub, and the object.__new__(Qwen3VLProcessor) workaround. Video inputs are now supported, not stubbed.
  • Updates test_qwen3_vl_processor_from_pretrained_uses_custom_loader to mock at the new Qwen3VLProcessor.from_pretrained boundary.

Fixes on top of the mlx-vlm source

  • Flatten list-of-list image/video batches — HF's apply_chat_template nests inputs that way and the upstream __call__ crashed on them.
  • Treat explicit None in preprocessor_config.json (min_pixels / max_pixels) the same as missing. The 2B Instruct checkpoint ships nulls alongside valid size.shortest_edge / size.longest_edge; the previous logic let the nulls clobber the valid sizes.

Test plan

  • pytest mlx_embeddings/tests/test_models.py — 16/16 pass
  • End-to-end embedding from mlx-community/Qwen3-VL-2B-Instruct-4bit: model.embed({text, image=PIL}) returns (1, 2048) bf16
  • End-to-end reranker with image: model.rerank({query, documents=[{text, image}, ...]}) returns sensible scores
  • 'torch' in sys.modules stays False across load + embed + rerank (venv also has no torch installed, so this is a hard guarantee)
  • Qwen3VLImageProcessor and Qwen3VLVideoProcessor are confirmed to be the local numpy classes (not transformers' or mlx-vlm's)

🤖 Generated with Claude Code

Blaizzy and others added 8 commits April 23, 2026 22:05
HF's Qwen3-VL image/video processors hard-require torch/torchvision.
Inline the numpy port adapted from mlx-vlm (commit 1bf7742, unreleased)
so mlx-embeddings can run without torch installed — including real
checkpoints like mlx-community/Qwen3-VL-2B-Instruct-4bit.

Drops the AutoImageProcessor.from_pretrained(use_fast=False) path, the
_UnsupportedVideoProcessor stub, and the object.__new__(Qwen3VLProcessor)
trick. Processor.from_pretrained now delegates to the local torch-free
Qwen3VLProcessor.from_pretrained, which reads processor_config.json /
preprocessor_config.json / video_preprocessor_config.json directly and
builds numpy Qwen3VLImageProcessor / Qwen3VLVideoProcessor.

Small fixes on top of the mlx-vlm source:
- Flatten list-of-list image/video batches (HF's apply_chat_template
  nests them that way).
- Treat explicit None in preprocessor_config.json (min_pixels/max_pixels)
  the same as missing — the 2B Instruct checkpoint ships nulls alongside
  valid size.shortest_edge/longest_edge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README examples surfaced two gaps in the port:

- Image inputs passed as https:// URLs (the embedding/reranker README
  examples) hit `FileNotFoundError` because `_to_numpy_image` treated
  every string as a local path. Detect URLs and fetch via requests.

- `Qwen/Qwen3-VL-Reranker-2B` ships its chat template in
  chat_template.jinja, not in tokenizer_config.json. Add a
  `_load_qwen_vl_text` helper (local-then-Hub) and fall back to it when
  neither processor_config.json nor the tokenizer carries a template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both helpers did the same local-then-Hub read; only the parsing differed.
Unify as _load_qwen_vl_file that dispatches on the .json suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The folder name already scopes these, so the prefix is noise:
- _load_qwen_vl_file  -> _load_file
- _qwen_vl_image_kwargs -> _image_kwargs
- _qwen_vl_video_kwargs -> _video_kwargs

Classes keep their qualified Qwen3VL* names since they're the module's
public surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the 6-query × 6-image retrieval benchmark, the mlx-embeddings output
had max|cosine diff| = 0.087 vs HF transformers reference and only 83%
top-1 agreement. Three fixes close the gap to max 0.006 diff and 100%
top-1/top-3 agreement:

1. Forward the embedder's MIN_PIXELS/MAX_PIXELS (4096..1,843,200) onto
   the inner image_processor. The Qwen3-VL preprocessor_config.json
   lists the full-context size bounds (16 MP), so without this override
   the image_processor resized to a different grid than the HF reference
   and the comparison ran on different visual tokens.

2. Work around mlx-vlm bug in Qwen3-VL get_input_embeddings: the
   upstream assigns `mx.eval(deepstack_image_embeds)` to
   `deepstack_visual_embeds`, but mx.eval returns None — so multi-scale
   deepstack features were silently dropped at every LM layer the
   model was supposed to inject them into. Re-run the vision tower in
   our Model.get_input_embeddings when we detect this.

3. Patch mlx-vlm's `_deepstack_process` on the language-model instance:
   upstream indexes the full concatenated visual_embeds at each batch
   sample's image positions, which only works for batch_size=1. Our
   patched version slices visual_embeds per sample using a running
   offset so multi-image batches work.

Once (2) is fixed upstream, (3) surfaces immediately — they're stacked
bugs that cancel for single-image batches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the toy "embed 4 mixed inputs and print a 4x4 similarity"
snippet with a real retrieval workflow: embed an image gallery once,
score multiple text queries, and rank top-K per query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Blaizzy Blaizzy force-pushed the pc/qwen3-vl-torch-free-processor branch from 88af1f0 to 967fbaf Compare April 24, 2026 00:15
Blaizzy and others added 2 commits April 24, 2026 02:18
Convert examples/qwen3_vl_retrieval.py into a notebook so the plot
renders inline on GitHub (no separate PNG to keep in sync). README
now links to the .ipynb.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant