Skip to content

[XPU][NIXL] Add GPUDirect RDMA support for XPU#35270

Merged
jikunshang merged 9 commits into
vllm-project:mainfrom
zhenwei-intel:xpu_pd_2026
Mar 3, 2026
Merged

[XPU][NIXL] Add GPUDirect RDMA support for XPU#35270
jikunshang merged 9 commits into
vllm-project:mainfrom
zhenwei-intel:xpu_pd_2026

Conversation

@zhenwei-intel
Copy link
Copy Markdown
Contributor

@zhenwei-intel zhenwei-intel commented Feb 25, 2026

Purpose

Add GPUDirect RDMA support for XPU in NIXL connector.

Requirements

Limitations:

Test Plan

Performance data of Llama3.3-70B int4 model with fp8 kvcache on 8xB60, ISL=1500, OSL=150
2P1D vs Non-PD under SLO TTFT<5s, ITL<100ms

  • Serve more requests: under SLO, 2P1D achieved a request throughput of 1.06, compared to 0.64 for the Non-PD — 1.65x improvement.
image

PD commands

prefill

export UCX_TLS=ib,rc,ze_copy

export ZE_AFFINITY_MASK=2,3
export model_name=ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
export tp_size=2


VLLM_USE_V1=1 VLLM_NIXL_SIDE_CHANNEL_HOST=localhost VLLM_NIXL_SIDE_CHANNEL_PORT=5577 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve $model_name -tp $tp_size --host localhost --port 7101 --seed 42 --enforce-eager --dtype float16 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"xpu"}' --max-model-len 8192 --block-size 64 --no-enable-prefix-caching --kv-cache-dtype fp8

prefill2

export UCX_TLS=ib,rc,ze_copy

export ZE_AFFINITY_MASK=4,5
export model_name=ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
export tp_size=2


VLLM_USE_V1=1 VLLM_NIXL_SIDE_CHANNEL_HOST=localhost VLLM_NIXL_SIDE_CHANNEL_PORT=5377 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve $model_name -tp $tp_size --host localhost --port 7102 --seed 42 --enforce-eager --dtype float16 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"xpu"}' --max-model-len 8192 --block-size 64 --no-enable-prefix-caching--kv-cache-dtype fp8

decode

export UCX_TLS=ib,rc,ze_copy

export ZE_AFFINITY_MASK=0,1
export model_name=ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
export tp_size=2


VLLM_USE_V1=1 VLLM_NIXL_SIDE_CHANNEL_HOST=localhost VLLM_NIXL_SIDE_CHANNEL_PORT=5177 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_ENABLE_V1_MULTIPROCESSING=1 vllm serve $model_name -tp $tp_size --host localhost --port 7201 --seed 42 --enforce-eager --dtype float16 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"xpu"}' --max-model-len 8192 --block-size 64 --no-enable-prefix-caching --kv-cache-dtype fp8

proxy

python3 tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-hosts localhost  localhost --prefiller-ports 7101 7102 --decoder-host localhost --decoder-port 7201 --host localhost --port 7300

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@zhenwei-intel
Copy link
Copy Markdown
Contributor Author

cc. @xuechendi, @rogerxfeng8, @yma11

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds GPUDirect RDMA support for XPU in the NIXL connector. The changes involve updating the UCX build script to a specific commit that includes the necessary fixes, enabling Level Zero support, and modifying the NIXL connector and XPU platform code to support XPU device memory for KV transfer. The changes are generally correct and well-targeted. I have one suggestion to improve the precision of a workaround to avoid potential side effects.

Comment thread vllm/platforms/xpu.py
Comment thread tools/install_nixl_from_source_ubuntu.py Outdated
Comment thread vllm/platforms/xpu.py
Comment thread tools/install_nixl_from_source_ubuntu.py Outdated
@mergify mergify Bot added the ci/build label Feb 25, 2026
Comment thread tools/install_nixl_from_source_ubuntu.py Outdated
@xuechendi
Copy link
Copy Markdown
Collaborator

@zhenwei-intel, --kv-cache-dtype fp8 , we won't be able to transfer scale at this moment, so the accuracy is impact. Might not significant in simple text.

@zhenwei-intel
Copy link
Copy Markdown
Contributor Author

@zhenwei-intel, --kv-cache-dtype fp8 , we won't be able to transfer scale at this moment, so the accuracy is impact. Might not significant in simple text.

The PR of FP8 KV Cache hasn't been upstreamed yet. On the upstream branch, BF16 KV Cache can be used.

The current performance testing is based on FP8 KV Cache (cherry-picked https://github.com/intel-innersource/applications.ai.gpu.vllm-xpu/pull/57), with a scale of 1.0.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated
@hshen14
Copy link
Copy Markdown

hshen14 commented Feb 27, 2026

@zhenwei-intel, --kv-cache-dtype fp8 , we won't be able to transfer scale at this moment, so the accuracy is impact. Might not significant in simple text.

The PR of FP8 KV Cache hasn't been upstreamed yet. On the upstream branch, BF16 KV Cache can be used.

The current performance testing is based on FP8 KV Cache (cherry-picked https://github.com/intel-innersource/applications.ai.gpu.vllm-xpu/pull/57), with a scale of 1.0.

No need to transfer the scale as both the models for P and D instances should have such info.

Copy link
Copy Markdown
Collaborator

@xuechendi xuechendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @NickLucche , may you help to review

Copy link
Copy Markdown
Member

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks ok on my side, nit on buffer naming

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated
@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2026
Copy link
Copy Markdown
Member

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced with @xuechendi

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
@zhenwei-intel
Copy link
Copy Markdown
Contributor Author

@jikunshang @1643661061leo I updated the dockerfile.xpu, please take another look.

Comment thread docker/Dockerfile.xpu
Comment thread docker/Dockerfile.xpu Outdated
@jikunshang jikunshang requested a review from Copilot February 28, 2026 08:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GPUDirect RDMA support for XPU in the NIXL connector by enabling XPU VRAM KV-buffer paths, applying a UCX workaround for memtype misdetection, and updating the XPU Docker image to build UCX+NIXL from source with RDMA dependencies.

Changes:

  • Enable "xpu" as a KV buffer device for the NIXL connector and map it to VRAM memory type.
  • Apply a UCX environment workaround on XPU to avoid memtype-cache misdetection.
  • Update Dockerfile.xpu to build UCX (pinned commit) and NIXL from source and install RDMA tooling/libs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
vllm/platforms/xpu.py Sets UCX env var to avoid UCX memtype-cache misdetection on XPU.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Allows XPU KV buffers and maps XPU device to VRAM in NIXL.
docker/Dockerfile.xpu Builds UCX/NIXL from source and installs RDMA dependencies for XPU images.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm/platforms/xpu.py
Comment thread docker/Dockerfile.xpu
Comment thread docker/Dockerfile.xpu Outdated
Comment thread docker/Dockerfile.xpu
Comment thread docker/Dockerfile.xpu Outdated
Comment thread docker/Dockerfile.xpu Outdated
Comment thread docker/Dockerfile.xpu Outdated
Comment thread docker/Dockerfile.xpu
Comment thread vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
zhenwei-intel and others added 2 commits February 28, 2026 18:52
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
@jikunshang
Copy link
Copy Markdown
Member

merge as ci passed. thanks for your contribution!

@jikunshang jikunshang merged commit 9dd656f into vllm-project:main Mar 3, 2026
115 checks passed
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build kv-connector ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants