Skip to content

feat: add Intel XPU transformers support#4801

Open
zy6p wants to merge 1 commit into
opendatalab:masterfrom
zy6p:feat/intel-xpu
Open

feat: add Intel XPU transformers support#4801
zy6p wants to merge 1 commit into
opendatalab:masterfrom
zy6p:feat/intel-xpu

Conversation

@zy6p
Copy link
Copy Markdown

@zy6p zy6p commented Apr 16, 2026

Summary

This PR adds a minimal Intel XPU path for the transformers backend on Linux.

Changes:

  • detect xpu in get_device() when torch.xpu.is_available()
  • prefer transformers over vllm on Linux when the selected device is xpu
  • load Qwen2VLForConditionalGeneration on CPU first and then move it to xpu

Why

On Intel Arc A750 with the current oneAPI / torch xpu stack, the default Linux auto-engine path selects vllm, but the end-to-end MinerU VLM service is not stable through that route. The transformers backend can work, but it needs two adjustments:

  1. MinerU must recognize xpu as a device type.
  2. Qwen2VL should avoid the standard device_map={"": device} loading path on XPU and instead load on CPU first, then call .to("xpu").

Validation

Validated on a private Intel Arc A750 deployment:

  • torch.xpu.is_available() == True
  • MinerU API service can parse PDF -> Markdown through the transformers backend on XPU
  • basic syntax check: python3 -m py_compile on the touched files

Scope

This PR intentionally stays small and does not add new deployment docs or packaging changes.

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dosubot dosubot Bot added the enhancement New feature or request label Apr 16, 2026
@zy6p
Copy link
Copy Markdown
Author

zy6p commented Apr 16, 2026

I have read the CLA Document and I hereby sign the CLA

@zy6p zy6p mentioned this pull request Apr 16, 2026
@myhloli
Copy link
Copy Markdown
Collaborator

myhloli commented Apr 16, 2026

vLLM provides Docker images that support XPU, why is it still necessary to explicitly specify using the transformers backend?

@zy6p
Copy link
Copy Markdown
Author

zy6p commented Apr 17, 2026

Thanks. Here is the concrete reason I switched to the conservative settings, with the actual logs inline.

The short version is:

  • I did not switch away from vLLM on Intel XPU based on preference.
  • I switched because the actual MinerU + Qwen2VL multimodal startup path on Intel Arc A750 was not stable in my testing, even after isolating the failure surface with more conservative settings.
  • transformers + xpu did work on the same machine for the same MinerU model path.

Why I changed those parameters

I changed them as a failure-isolation sequence, not as random tuning:

  • mm_encoder_attn_backend=TORCH_SDPA
    • to bypass the default XPU FLASH_ATTN path for the visual encoder
  • mm_encoder_attn_backend=TRITON_ATTN
    • to test another supported non-default visual-attention backend
  • enforce_eager=True
    • to remove compile / graph-capture variables
  • skip_mm_profiling=True
    • to remove multimodal profiling variables

This was based on the XPU ViT attention backend logic in the vLLM image, which supports:

@classmethod
def get_supported_vit_attn_backends(cls):
    return [
        AttentionBackendEnum.FLASH_ATTN,
        AttentionBackendEnum.TRITON_ATTN,
        AttentionBackendEnum.TORCH_SDPA,
    ]

and defaults to FLASH_ATTN if nothing is specified.

What I observed on the real machine

  1. The Intel GPU was actually visible and usable from the container. This was not a “device not passed through” problem:
[mineru-vllm] visible SYCL devices:
INFO: Output filtered by ONEAPI_DEVICE_SELECTOR environment variable, which is set to level_zero:gpu.

[level_zero:gpu] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A750 Graphics 12.55.8 [1.14.36300+8]
[mineru-vllm] torch.xpu.is_available = True
[mineru-vllm] torch.xpu.device_count = 1
[mineru-vllm] device 0: name='Intel(R) Arc(TM) A750 Graphics' total_memory=8096681984
  1. On the default run, vLLM selected FLASH_ATTN for the multimodal visual encoder path, loaded the model, and then failed during MM encoder startup:
(APIServer pid=1) INFO 04-15 15:54:13 [__init__.py:254] Automatically detected platform xpu.
(APIServer pid=1) INFO 04-15 15:54:14 [api_server.py:962] vLLM API server version 0.1.dev14456+gde3f7fe65
...
(EngineCore_DP0 pid=121) INFO 04-15 15:54:20 [loader.py:489] Starting to load model /data/models/mineru-vl...
(EngineCore_DP0 pid=121) INFO 04-15 15:54:20 [xpu.py:114] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=121) INFO 04-15 15:54:20 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
...
(EngineCore_DP0 pid=121) INFO 04-15 15:54:24 [loader.py:542] Model loading took 2.16 GiB memory and 3.400012 seconds
(EngineCore_DP0 pid=121) INFO 04-15 15:54:24 [cache_utils.py:513] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=121) ERROR 04-15 15:54:25 [core.py:494] EngineCore failed to start.
...
(EngineCore_DP0 pid=121) ERROR 04-15 15:54:25 [core.py:494]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/attention/mm_encoder_attention.py", line 423, in forward_xpu
(EngineCore_DP0 pid=121) ERROR 04-15 15:54:25 [core.py:494]     return self._forward_fa(query, key, value, cu_seqlens, max_seqlen)

In the same run, my diagnosis note recorded that this corresponded to the XPU FlashAttention kernel failure:

Only XE2 cutlass kernel is supported currently.

I am calling out explicitly that this exact line came from the same failing run, but the full stderr capture in the session log was truncated, so I do not have the complete raw block for that one line anymore.

  1. I then switched to TORCH_SDPA specifically to get off the default flash path. That did change the selected backend, but the engine still failed:
(APIServer pid=1) INFO 04-15 16:02:27 [api_server.py:969] non-default args: {'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/mineru-vl', 'served_model_name': ['OpenDataLab/MinerU2.5-Pro-2604-1.2B'], 'logits_processors': ['mineru_vl_utils:MinerULogitsProcessor'], 'gpu_memory_utilization': 0.7, 'mm_encoder_attn_backend': 'TORCH_SDPA'}
...
(EngineCore_DP0 pid=121) INFO 04-15 16:02:34 [xpu.py:111] Using backend AttentionBackendEnum.TORCH_SDPA for vit attention
(EngineCore_DP0 pid=121) INFO 04-15 16:02:34 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
...
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html
...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

So this was already evidence that the issue was not just “default FLASH_ATTN is too aggressive”.

  1. I then went even more conservative:
  • mm_encoder_attn_backend=TRITON_ATTN
  • enforce_eager=True
  • skip_mm_profiling=True

That run still failed:

(APIServer pid=1) INFO ... non-default args: {'model_tag': '/data/models/mineru-vl', 'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/mineru-vl', 'enforce_eager': True, 'served_model_name': ['OpenDataLab/MinerU2.5-Pro-2604-1.2B'], 'logits_processors': ['mineru_vl_utils:MinerULogitsProcessor'], 'gpu_memory_utilization': 0.7, 'mm_encoder_attn_backend': 'TRITON_ATTN', 'skip_mm_profiling': True}
...
(EngineCore_DP0 pid=121) INFO ... [xpu.py:111] Using backend AttentionBackendEnum.TRITON_ATTN for vit attention
(EngineCore_DP0 pid=121) INFO ... [mm_encoder_attention.py:215] Using AttentionBackendEnum.TRITON_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=121) WARNING ... Enforce eager set, disabling torch.compile and CUDAGraphs.
...
(EngineCore_DP0 pid=121) INFO ... Model loading took 2.16 GiB memory and 2.215206 seconds
(EngineCore_DP0 pid=121) INFO ... Skipping memory profiling for multimodal encoder and encoder cache.
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html
...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Why that matters

At that point the experiment had already shown:

  • the Arc A750 was visible to SYCL and torch.xpu
  • the model could start loading
  • the default MM encoder flash path failed
  • switching to TORCH_SDPA did not make the engine stable
  • switching to TRITON_ATTN and also removing compile/profiling variables still did not make the engine stable

So my conclusion was not “vLLM XPU never works”.

My conclusion was:

  • on Intel Arc A750
  • for the current MinerU + Qwen2VL multimodal startup path
  • the vLLM + XPU route was not stable enough in my testing to be used as the default backend

Why transformers was preferred in the PR

Because transformers + xpu was the path that actually worked for the same hardware and model family.

I also found that the model loading path had to be conservative there as well. Direct XPU device_map loading was not reliable on Arc A750, while loading on CPU first and then moving the model to XPU did work:

loading processor
loading model on cpu
moving to xpu
model device xpu:0
done

So the current PR behavior is intended as a compatibility fallback:

  • prefer transformers on Intel XPU today
  • because that path was validated on real hardware
  • while the current vLLM multimodal path was not stable enough in the same testing

If the vLLM + XPU + Qwen2VL multimodal path becomes stable on Arc-class Intel GPUs later, I agree that this preference can be revisited.

@zy6p
Copy link
Copy Markdown
Author

zy6p commented Apr 17, 2026

recheck

github-actions Bot added a commit that referenced this pull request Apr 17, 2026
@255doesnotexist
Copy link
Copy Markdown

Any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants