feat: add Intel XPU transformers support#4801
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
vLLM provides Docker images that support XPU, why is it still necessary to explicitly specify using the transformers backend? |
|
Thanks. Here is the concrete reason I switched to the conservative settings, with the actual logs inline. The short version is:
Why I changed those parameters I changed them as a failure-isolation sequence, not as random tuning:
This was based on the XPU ViT attention backend logic in the @classmethod
def get_supported_vit_attn_backends(cls):
return [
AttentionBackendEnum.FLASH_ATTN,
AttentionBackendEnum.TRITON_ATTN,
AttentionBackendEnum.TORCH_SDPA,
]and defaults to What I observed on the real machine
In the same run, my diagnosis note recorded that this corresponded to the XPU FlashAttention kernel failure: I am calling out explicitly that this exact line came from the same failing run, but the full stderr capture in the session log was truncated, so I do not have the complete raw block for that one line anymore.
So this was already evidence that the issue was not just “default FLASH_ATTN is too aggressive”.
That run still failed: Why that matters At that point the experiment had already shown:
So my conclusion was not “vLLM XPU never works”. My conclusion was:
Why Because I also found that the model loading path had to be conservative there as well. Direct XPU So the current PR behavior is intended as a compatibility fallback:
If the |
|
recheck |
|
Any progress? |
Summary
This PR adds a minimal Intel XPU path for the transformers backend on Linux.
Changes:
xpuinget_device()whentorch.xpu.is_available()transformersovervllmon Linux when the selected device isxpuQwen2VLForConditionalGenerationon CPU first and then move it toxpuWhy
On Intel Arc A750 with the current oneAPI / torch xpu stack, the default Linux auto-engine path selects
vllm, but the end-to-end MinerU VLM service is not stable through that route. The transformers backend can work, but it needs two adjustments:xpuas a device type.device_map={"": device}loading path on XPU and instead load on CPU first, then call.to("xpu").Validation
Validated on a private Intel Arc A750 deployment:
torch.xpu.is_available() == Truepython3 -m py_compileon the touched filesScope
This PR intentionally stays small and does not add new deployment docs or packaging changes.