Skip to content

feat: kvbm-v2 xpu-sycl enablement#87

Open
zxue2 wants to merge 1 commit into
ai-dynamo:mainfrom
zxue2:kvbm_v2_xpu_sycl_enablement
Open

feat: kvbm-v2 xpu-sycl enablement#87
zxue2 wants to merge 1 commit into
ai-dynamo:mainfrom
zxue2:kvbm_v2_xpu_sycl_enablement

Conversation

@zxue2
Copy link
Copy Markdown

@zxue2 zxue2 commented May 6, 2026

Please align with DEP ai-dynamo/dynamo#9313.
Describe the design and implementation details about how Intel XPU (SYCL/oneAPI) was integrated into KVBM v2 alongside the existing NVIDIA CUDA backend: the trait surfaces that were extracted, the SYCL implementations that were added, and the crate-level wiring that keeps KVBM v2 engine-agnostic and framework-agnostic. The documents covers the state of the branch, the evolution from the CUDA-only baseline, and the relationships between the crates under lib/ that make up KVBM v2.

Implementation PR:
ai-dynamo/dynamo#7946

Question:
Need to know what will be changed and what will be not changed for KVBM v2. e.g., kvbm-common/kvbm-config/kvbm-engine/kvbm-kernels/kvbm-logical/kvbm-physical/memory/bindings,etc.

  • What crates will serve KVBM v2?
  • What are the new crates for KVBM v2? Please share the schedule of merging new crates into main, e.g. kvbm-connector.
  • Will the basic operations (e.g., in device_executor_flow.md, collectives.md, sycl_kernels.md, etc.) be changed?
  • Will KVBM v2 decouple with llm/block_manager ?

Concern:
Measure xpu perf ( kvbm_v2_xpu_sycl_enablement.md) from different API layers based on raw API(kvbench in kvbm-kernel with Intel SYCL rust binding),  bench transfer (bench_transfer with transfer manager API in kvbm-physical)  and bench_engine ( in kvbm-engine)  with unified abstracted processes for batch copy, vectorized copy(GPU kernel), OneCCL(broadcast for MLA), NUMA pinned memory allocation, etc.

@zxue2
Copy link
Copy Markdown
Author

zxue2 commented May 6, 2026

@statiraju , could you pls help review? The above xpu-sycl enablement is based on kvbm-v2 on main branch and for reference only. We'd like to follow your multi-device/backend design and your team is the proper maintainer for device abstraction. It would be great to work closely with you then we may sync on this feature for each important milestone without the extra overheads to refactor it again.

XPU/system perf tuning may be more critical which is the foundation of kvbm v2. Our next step is to measure the xpu perf as illustrated in above description , so it's important to know what crates & operations will still serve KVBM v2. XPU bench work can be executed in parallel. thx.

Pls add other kvbm-v2 reviewers/designer if needed.
cc: @dzier @ziqifan617

@zxue2 zxue2 force-pushed the kvbm_v2_xpu_sycl_enablement branch 2 times, most recently from b00d6fd to 9efd6a2 Compare May 7, 2026 03:13
@zxue2 zxue2 force-pushed the kvbm_v2_xpu_sycl_enablement branch 4 times, most recently from 632db11 to 302a32b Compare May 19, 2026 05:57
Describe the design and implementation details about how Intel XPU
(SYCL/oneAPI) was integrated into KVBM v2 alongside the existing NVIDIA
CUDA backend: the trait surfaces that were extracted, the SYCL implementations
that were added, and the crate-level wiring that keeps KVBM v2 engine-agnostic
and framework-agnostic. The documents covers the state of the branch, the
evolution from the CUDA-only baseline, and the relationships between the crates
under lib/ that make up KVBM v2.

Signed-off-by: Zhan Xue <zhan.xue@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant