Skip to content

UCT/ZE/ZE_IPC: enable level zero ipc support for Intel GPUs#11218

Open
zhangxiaoli73 wants to merge 14 commits into
openucx:masterfrom
zhangxiaoli73:cherry/add-ze-ipc-support
Open

UCT/ZE/ZE_IPC: enable level zero ipc support for Intel GPUs#11218
zhangxiaoli73 wants to merge 14 commits into
openucx:masterfrom
zhangxiaoli73:cherry/add-ze-ipc-support

Conversation

@zhangxiaoli73
Copy link
Copy Markdown
Contributor

@zhangxiaoli73 zhangxiaoli73 commented Feb 27, 2026

What?

Enable level zero IPC support for Intel GPUs, this PR adds:

  • Add level zero IPC component
  • Add level zero IPC iface
  • Add level zero async copy for IPC EP
  • Add IPC cache to reduce IPC exchange overhead in certain cases

Why?

We want to provide IPC transport for users within a single node.

@j-xiong
Copy link
Copy Markdown
Contributor

j-xiong commented Mar 16, 2026

@zhangxiaoli73 Please fix the commit titles and do a rebase to resolve the conflicts. Maybe consolidate the two commits at the same time since a forced push is inevitable.

@yosefe I am assuming forced push is allow in this case. Is that right? If not, what is the recommended way to fix the commit titles?

@zhangxiaoli73
Copy link
Copy Markdown
Contributor Author

@zhangxiaoli73 Please fix the commit titles and do a rebase to resolve the conflicts. Maybe consolidate the two commits at the same time since a forced push is inevitable.

@yosefe I am assuming forced push is allow in this case. Is that right? If not, what is the recommended way to fix the commit titles?

Got it. Let me know if I can force push to change the commit.

@zhangxiaoli73 zhangxiaoli73 force-pushed the cherry/add-ze-ipc-support branch from 118b1f6 to 8b1525f Compare April 3, 2026 10:52
@openucx openucx deleted a comment from svc-nixl May 4, 2026
yuanwu2017 and others added 6 commits May 7, 2026 08:12
# Conflicts:
#	src/uct/ze/base/ze_base.c
#	src/uct/ze/base/ze_base.h
- Remove iface_mem_element_pack from uct_iface_internal_ops (field
  removed in upstream)
- Add uct_ze_base_get_device(int ordinal) helper that returns the
  root device handle by ordinal, replacing the legacy direct lookup
  via uct_ze_base_info.devices[i] which was renamed to
  uct_ze_base.devices[i].root_device

Signed-off-by: yuanwu <yuan.wu@intel.com>
- test_ze_base: validates uct_ze_base_get_device / get_num_devices /
  get_device_ordinal helpers added during the upstream-master merge.
- test_ze_ipc_md: smoke-tests component registration, MD resource
  query, MD open/close (verifies ze_device + ze_context populated
  per sub-device after the merge), and md_attr.reg_mem_types.
- Wire HAVE_ZE block in test/gtest/Makefile.am: appends sources,
  ZE_CPPFLAGS, ZE_LDFLAGS, ZE_LIBS and libuct_ze.la.

Signed-off-by: yuanwu <yuan.wu@intel.com>
Add end-to-end mem_reg/mkey_pack/mem_dereg test for ze_ipc_md
(NIXL KV path), ze_ipc_cache lifecycle tests, and full ze_copy_md
coverage (open/close, alloc/free, detect_memory_type, mem_query).

21 tests total under HAVE_ZE; all pass on Intel PVC.

Signed-off-by: yuanwu <yuan.wu@intel.com>
Signed-off-by: yuanwu <yuan.wu@intel.com>
@yuanwu2017
Copy link
Copy Markdown

gtest results:
image

Performance bechmark:
image

@yuanwu2017
Copy link
Copy Markdown

@yosefe Please help to review.

@yuanwu2017
Copy link
Copy Markdown

Hi @yosefe @brminich,

Just a friendly ping on this one 🙂 CI is green now and the branch is up to date with master. Whenever you have a few minutes, could you take a look? Would really love to hear your feedback — happy to tweak anything based on your comments.

Comment thread src/uct/ze/copy/ze_copy_ep.c Outdated
Comment thread src/uct/ze/copy/ze_copy_ep.c Outdated
Comment thread src/uct/ze/copy/ze_copy_ep.c Outdated
zhangxiaoli73 and others added 3 commits May 21, 2026 10:02
Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>
Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>
Co-authored-by: Yaser Afshar <yaser.afshar@intel.com>
@yosefe
Copy link
Copy Markdown
Member

yosefe commented May 22, 2026

@zhangxiaoli73 @yafshar can we add a build with ZE to builds.sh so it will run in CI with at least basic compilation check?
Also, is there a "mock" library for ZE hardware that we can run in CI?

@yafshar
Copy link
Copy Markdown
Contributor

yafshar commented May 22, 2026

can we add a build with ZE to builds.sh so it will run in CI with at least basic compilation check?
Also, is there a "mock" library for ZE hardware that we can run in CI?

Yes, we can add a ZE compile check to builds.sh and run it in CI.

For this to be a meaningful ZE check, the selected Linux job must have Level Zero headers and the ze_loader development package available (or an equivalent module/path setup). We should also force configuration with --with-ze so the job fails when ZE dependencies are missing. If we rely on the default auto-detect mode, ZE may be silently skipped and the build could pass without actually validating ZE compilation.
I plan to send a separate CI-focused PR that:

  • Adds a ZE compile-only check in one x86_64 Linux lane
  • Adds the required container/job setup for Level Zero dependencies
  • Keeps ZE runtime coverage out of scope

In parallel, I will investigate a production-grade ZE mock/simulation option and follow up with another separate PR, as that work is broader in scope.

@yosefe
Copy link
Copy Markdown
Member

yosefe commented May 28, 2026

hi @yafshar ,
I think we can start with log hanging fruit - add container URL just to compile ZE code. It will be good to catch potential regressions caused by infrastructure refactoring.

yuanwu2017 and others added 3 commits May 29, 2026 09:41
Add a dedicated 'ze' build_mode that runs configure-devel --with-ze and
verifies HAVE_ZE=1 in config.h. Mirrors the compile-only pattern used by
build_cuda / build_rocm: PR-CI only checks that ZE code keeps compiling
and linking; device gtests run on hardware lanes.

Changes:
- buildlib/tools/builds.sh:
    * new build_ze(): strict path when require_ze=yes (used by the
      dedicated lane), otherwise auto-skip when level_zero/ze_api.h
      is missing so short/long flows on non-ZE containers stay green
    * register 'build_ze' in base_tests and add a 'ze' build_mode
    * thread require_ze through the Azure env-var unset guard
- buildlib/pr/main.yml: new container alias ubuntu2404_ze reusing the
  existing doca-2.9.0 ubuntu24.04 image
- buildlib/pr/build_job.yml:
    * new x86_64 matrix row ubuntu2404_ze (build_mode=ze,
      require_ze=yes, install_ze_deps=yes)
    * pre-build step that apt-installs libze-dev (falls back to
      level-zero-dev) only when install_ze_deps=yes
    * pass require_ze into the builds.sh env block

Locally validated: configure-devel --with-ze on Ubuntu 24.04 (libze-dev
1.27.0) produces UCT/UCM/Perf 'ze' modules, HAVE_ZE=1, and a full
make -j succeeds, linking libuct_ze.so / libucm_ze.so / libucx_perftest_ze
against -lze_loader.

Signed-off-by: yuanwu <yuan.wu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants