Add Qwen3-VL multimodal support by 86MaxCao · Pull Request #132 · GeeeekExplorer/nano-vllm

86MaxCao · 2025-11-11T21:22:18Z

Summary

add the Qwen3-VL multimodal model and loader entry so nano-vllm can run vision-language workloads
extend engine components (placeholder expansion, vision-cache slicing, KV guard) to mirror vLLM’s multimodal behavior
provide bench_multimodal.py and example_multimodal.py for benchmarking and quick testing
document how to download Qwen3-VL-2B-Instruct and where to find the multimodal example

Benchmark

GPU: NVIDIA H20 (96GB)
Command: CUDA_VISIBLE_DEVICES=0 python3 bench_multimodal.py --model ~/huggingface/Qwen3-VL-2B-Instruct
Result: 10 requests · 2958 prompt tokens · 2629 generated tokens · 12.49 s latency · 210.55 tok/s throughput

Testing

python3 example_multimodal.py

Notes

large diff because the feature touches model loading, scheduler, and caching; happy to walk through the details if needed
if maintainers feel multimodal support shouldn’t land in core yet, I’m open to discussing an extension repo instead

tuanhe · 2025-12-02T07:21:50Z

Just tried the commit on RTX4090 and L40 , it has error report like following:
File "/work/example/example_multimodal.py", line 67, in main [rank0]: outputs = llm.generate_multimodal( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/engine/engine.py", line 421, in generate_multimodal [rank0]: output, num_tokens = self.step() [rank0]: ^^^^^^^^^^^ [rank0]: File "/work/liveformer/engine/engine.py", line 227, in step [rank0]: token_ids = self.model_runner.call("run", seqs, is_prefill) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/core/model_runner.py", line 262, in call [rank0]: return method(*args) [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/core/model_runner.py", line 658, in run [rank0]: self._ensure_vision_cache(seq) [rank0]: File "/work/liveformer/core/model_runner.py", line 887, in _ensure_vision_cache [rank0]: image_embeds, deepstack_features = self.model.visual(pixel, grid) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 746, in forward [rank0]: return self._run_vision_from_tokens(token_list, grids) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 664, in _run_vision_from_tokens [rank0]: hidden_states = block(hidden_states, seq_lengths, position_embeddings) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 462, in forward [rank0]: hidden_states = residual + self.attn(hidden_states, seq_lengths, position_embeddings) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/work/liveformer/models/qwen3vl.py", line 411, in forward [rank0]: attn_scores = torch.matmul(q, k.transpose(-1, -2)) * self.scale [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__ [rank0]: return func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 23.63 GiB of which 16.69 MiB is free. Process 112388 has 23.54 GiB memory in use. Of the allocated memory 22.84 GiB is allocated by PyTorch, and 252.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

linzm1007 · 2026-01-11T08:50:16Z

Don't format the original code of others.

86MaxCao · 2026-05-13T14:12:29Z

Hi @tuanhe
Sorry for the very late reply; I haven’t been checking this inbox / GitHub notifications much over the past ~6 months.

Thanks for the detailed report and stack trace. I don’t have RTX 4090 or L40 on hand, so I’m not able to reproduce the OOM locally on the same hardware.

If you’re mainly constrained by VRAM, you could try a smaller vision–language checkpoint (e.g. Qwen3.5-0.8B), which tends to use noticeably less memory than larger VL models. I’ve also opened a newer PR that focuses on Qwen3.5 support: #232 — feedback there is welcome too.

86MaxCao · 2026-05-13T14:14:15Z

@linzm1007
Thanks for pointing that out — you’re right. I’ll avoid reformatting unrelated code in future PRs.
Appreciate the review.

Add Qwen3-VL multimodal support

2f352ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3-VL multimodal support#132

Add Qwen3-VL multimodal support#132
86MaxCao wants to merge 1 commit into
GeeeekExplorer:mainfrom
86MaxCao:feature/qwen3-vl-support

86MaxCao commented Nov 11, 2025

Uh oh!

tuanhe commented Dec 2, 2025

Uh oh!

linzm1007 commented Jan 11, 2026

Uh oh!

86MaxCao commented May 13, 2026

Uh oh!

86MaxCao commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

86MaxCao commented Nov 11, 2025

Summary

Benchmark

Testing

Notes

Uh oh!

tuanhe commented Dec 2, 2025

Uh oh!

linzm1007 commented Jan 11, 2026

Uh oh!

86MaxCao commented May 13, 2026

Uh oh!

86MaxCao commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants