Skip to content

Add Qwen3-VL multimodal support#132

Open
86MaxCao wants to merge 1 commit into
GeeeekExplorer:mainfrom
86MaxCao:feature/qwen3-vl-support
Open

Add Qwen3-VL multimodal support#132
86MaxCao wants to merge 1 commit into
GeeeekExplorer:mainfrom
86MaxCao:feature/qwen3-vl-support

Conversation

@86MaxCao
Copy link
Copy Markdown

Summary

  • add the Qwen3-VL multimodal model and loader entry so nano-vllm can run vision-language workloads
  • extend engine components (placeholder expansion, vision-cache slicing, KV guard) to mirror vLLM’s multimodal behavior
  • provide bench_multimodal.py and example_multimodal.py for benchmarking and quick testing
  • document how to download Qwen3-VL-2B-Instruct and where to find the multimodal example

Benchmark

  • GPU: NVIDIA H20 (96GB)
  • Command: CUDA_VISIBLE_DEVICES=0 python3 bench_multimodal.py --model ~/huggingface/Qwen3-VL-2B-Instruct
  • Result: 10 requests · 2958 prompt tokens · 2629 generated tokens · 12.49 s latency · 210.55 tok/s throughput

Testing

  • python3 example_multimodal.py

Notes

  • large diff because the feature touches model loading, scheduler, and caching; happy to walk through the details if needed
  • if maintainers feel multimodal support shouldn’t land in core yet, I’m open to discussing an extension repo instead

@tuanhe
Copy link
Copy Markdown

tuanhe commented Dec 2, 2025

Just tried the commit on RTX4090 and L40 , it has error report like following:
File "/work/example/example_multimodal.py", line 67, in main [rank0]: outputs = llm.generate_multimodal( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/engine/engine.py", line 421, in generate_multimodal [rank0]: output, num_tokens = self.step() [rank0]: ^^^^^^^^^^^ [rank0]: File "/work/liveformer/engine/engine.py", line 227, in step [rank0]: token_ids = self.model_runner.call("run", seqs, is_prefill) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/core/model_runner.py", line 262, in call [rank0]: return method(*args) [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/core/model_runner.py", line 658, in run [rank0]: self._ensure_vision_cache(seq) [rank0]: File "/work/liveformer/core/model_runner.py", line 887, in _ensure_vision_cache [rank0]: image_embeds, deepstack_features = self.model.visual(pixel, grid) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 746, in forward [rank0]: return self._run_vision_from_tokens(token_list, grids) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 664, in _run_vision_from_tokens [rank0]: hidden_states = block(hidden_states, seq_lengths, position_embeddings) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 462, in forward [rank0]: hidden_states = residual + self.attn(hidden_states, seq_lengths, position_embeddings) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 411, in forward [rank0]: attn_scores = torch.matmul(q, k.transpose(-1, -2)) * self.scale [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__ [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 16.69 MiB is free. Process 112388 has 23.54 GiB memory in use. Of the allocated memory 22.84 GiB is allocated by PyTorch, and 252.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@linzm1007
Copy link
Copy Markdown

Don't format the original code of others.

@86MaxCao
Copy link
Copy Markdown
Author

Hi @tuanhe
Sorry for the very late reply; I haven’t been checking this inbox / GitHub notifications much over the past ~6 months.

Thanks for the detailed report and stack trace. I don’t have RTX 4090 or L40 on hand, so I’m not able to reproduce the OOM locally on the same hardware.

If you’re mainly constrained by VRAM, you could try a smaller vision–language checkpoint (e.g. Qwen3.5-0.8B), which tends to use noticeably less memory than larger VL models. I’ve also opened a newer PR that focuses on Qwen3.5 support: #232 — feedback there is welcome too.

@86MaxCao
Copy link
Copy Markdown
Author

@linzm1007
Thanks for pointing that out — you’re right. I’ll avoid reformatting unrelated code in future PRs.
Appreciate the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants