Skip to content

Commit 9ed6e19

Browse files
PMZFXclaudearthw
authored
SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597)
* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM. zeMemAllocDevice uses the SVM/P2P path with no host staging. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes. With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with no performance regression. All Level Zero calls include automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable. * SYCL: address review feedback - remove try/catch, check device types, deduplicate - Remove try/catch from malloc/free/memcpy helpers, check backend and device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu) - Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp and declare in common.hpp to eliminate code duplication - Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls - Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the host-staged path for iGPU-to-dGPU transfers - Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH) in CMakeLists.txt (co-authored with @arthw) * SYCL: add build/runtime flags for Level Zero, address review feedback Implements the architecture suggested by @arthw: compile-time and runtime flags to cleanly separate Level Zero and SYCL memory API paths. - Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level Zero code is wrapped in #ifdef so the build works on systems without the Level Zero SDK installed (e.g. CPU-only CI servers). Both the loader library and headers are checked before enabling. - Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls whether Level Zero or SYCL memory APIs are used. Only one API style is used per session, no mixing. If Level Zero is enabled but the devices don't support the Level Zero backend, it auto-disables with a warning. - Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory is not called anywhere in the backend) and used try/catch for flow control. - Update SYCL.md with documentation for both new parameters. Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development (Claude). Code reviewed and tested on my hardware. * SYCL: unify Level Zero malloc/free call sites, address review feedback Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device. Both functions are now unconditionally available — Level Zero code is #ifdef'd inside the functions, not at call sites. All call sites use uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks. Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack traces on failure, eliminate duplicated #ifdef/else patterns at 6 call sites (-29 lines net). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs so the Level Zero code path is compiled and tested in CI. Fix two bugs found during extended dual-GPU testing (no ONEAPI_DEVICE_SELECTOR set): - The Level Zero backend check was iterating all SYCL devices including CPU. The OpenCL CPU device caused Level Zero to be disabled for the GPUs, defeating the fix on multi-GPU systems. Added is_gpu() filter so only GPU devices are checked. - sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers) were still calling sycl::malloc/sycl::free directly, bypassing the Level Zero path. Routed through ggml_sycl_malloc_device/free_device for consistency with the other device memory call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: address arthw review feedback on Level Zero memory API structure - Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp; only ggml_sycl_free_device (used by common.cpp) stays in common.cpp - Switch both helpers to use g_ggml_sycl_enable_level_zero global instead of per-call queue backend checks - Remove #ifdef wrapper from global definition; always declare at 0, add #else branch in init block so it stays 0 when L0 not compiled in - Update init loop comment to explain GPU-only device check - CMakeLists: message(STATUS) before the if block; align option wording AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed <5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * SYCL: remove unused cstdio/cstdlib includes from common.cpp Leftover from the deleted ggml_sycl_queue_supports_level_zero helper. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply suggestions from code review Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com> * SYCL: preserve Level Zero allocation path during early malloc * ci: fix Level Zero package conflict in Intel Docker build * ci: find Level Zero loader in oneAPI package step * ci: allow Windows SYCL package without Level Zero DLL --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
1 parent 4c1c3ac commit 9ed6e19

9 files changed

Lines changed: 242 additions & 25 deletions

File tree

.devops/intel.Dockerfile

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,15 @@ ARG ONEAPI_VERSION=2025.3.3-0-devel-ubuntu24.04
55
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS build
66

77
ARG GGML_SYCL_F16=OFF
8+
ARG LEVEL_ZERO_VERSION=1.28.2
9+
ARG LEVEL_ZERO_UBUNTU_VERSION=u24.04
810
RUN apt-get update && \
9-
apt-get install -y git libssl-dev
11+
apt-get install -y git libssl-dev wget ca-certificates && \
12+
cd /tmp && \
13+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb && \
14+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb && \
15+
apt-get -o Dpkg::Options::="--force-overwrite" install -y ./level-zero.deb ./level-zero-devel.deb && \
16+
rm -f /tmp/level-zero.deb /tmp/level-zero-devel.deb
1017

1118
WORKDIR /app
1219

@@ -109,4 +116,3 @@ WORKDIR /app
109116
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
110117

111118
ENTRYPOINT [ "/app/llama-server" ]
112-

.github/workflows/build-sycl.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ jobs:
5050
env:
5151
ONEAPI_ROOT: /opt/intel/oneapi/
5252
ONEAPI_INSTALLER_VERSION: "2025.3.3"
53+
LEVEL_ZERO_VERSION: "1.28.2"
54+
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
5355

5456
continue-on-error: true
5557

@@ -71,6 +73,14 @@ jobs:
7173
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
7274
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
7375
76+
- name: Install Level Zero SDK
77+
shell: bash
78+
run: |
79+
cd /tmp
80+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
81+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
82+
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
83+
7484
- name: Clone
7585
id: checkout
7686
uses: actions/checkout@v6
@@ -107,6 +117,7 @@ jobs:
107117
env:
108118
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
109119
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
120+
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
110121
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
111122
ONEAPI_INSTALLER_VERSION: "2025.3.3"
112123
steps:
@@ -127,6 +138,13 @@ jobs:
127138
run: |
128139
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
129140
141+
- name: Install Level Zero SDK
142+
shell: pwsh
143+
run: |
144+
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
145+
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
146+
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
147+
130148
- name: ccache
131149
uses: ggml-org/ccache-action@v1.2.21
132150
with:

.github/workflows/release.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -600,6 +600,7 @@ jobs:
600600
env:
601601
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
602602
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
603+
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
603604
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
604605
ONEAPI_INSTALLER_VERSION: "2025.3.3"
605606

@@ -621,6 +622,13 @@ jobs:
621622
run: |
622623
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
623624
625+
- name: Install Level Zero SDK
626+
shell: pwsh
627+
run: |
628+
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
629+
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
630+
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
631+
624632
- name: ccache
625633
uses: ggml-org/ccache-action@v1.2.21
626634
with:
@@ -655,6 +663,13 @@ jobs:
655663
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_opencl.dll" ./build/bin
656664
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_loader.dll" ./build/bin
657665
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_win_proxy_loader.dll" ./build/bin
666+
ZE_LOADER_DLL=$(find "${{ env.ONEAPI_ROOT }}" "$LEVEL_ZERO_V1_SDK_PATH" -iname ze_loader.dll -print -quit 2>/dev/null || true)
667+
if [ -n "$ZE_LOADER_DLL" ]; then
668+
echo "Using Level Zero loader: $ZE_LOADER_DLL"
669+
cp "$ZE_LOADER_DLL" ./build/bin
670+
else
671+
echo "Level Zero loader DLL not found in oneAPI or SDK; relying on system driver/runtime"
672+
fi
658673
659674
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl8.dll" ./build/bin
660675
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/svml_dispmd.dll" ./build/bin
@@ -695,6 +710,8 @@ jobs:
695710
env:
696711
ONEAPI_ROOT: /opt/intel/oneapi/
697712
ONEAPI_INSTALLER_VERSION: "2025.3.3"
713+
LEVEL_ZERO_VERSION: "1.28.2"
714+
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
698715

699716
steps:
700717
- name: Clone
@@ -718,6 +735,14 @@ jobs:
718735
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
719736
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
720737
738+
- name: Install Level Zero SDK
739+
shell: bash
740+
run: |
741+
cd /tmp
742+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
743+
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
744+
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
745+
721746
- name: ccache
722747
uses: ggml-org/ccache-action@v1.2.21
723748
with:

docs/backend/SYCL.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -720,6 +720,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
720720
| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
721721
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
722722
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
723+
| GGML_SYCL_SUPPORT_LEVEL_ZERO | ON *(default)* \|OFF *(Optional)* | Enable Level Zero API for device memory allocation. Requires Level Zero headers/library at build time and Intel GPU driver (Level Zero runtime) at run time. Reduces system RAM usage during multi-GPU inference. |
723724
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
724725
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
725726

@@ -733,9 +734,10 @@ use 1 SYCL GPUs: [0] with Max compute units:512
733734
| GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.|
734735
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) |
735736
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
737+
| GGML_SYCL_ENABLE_LEVEL_ZERO | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO=ON at build time. |
736738
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
737739
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
738-
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
740+
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Allow SYCL/Unified Runtime Level Zero device allocations larger than 4 GiB. llama.cpp's direct Level Zero allocation path requests the relaxed maximum-size limit itself when GGML_SYCL_ENABLE_LEVEL_ZERO=1. |
739741

740742
## Compile-time Flags
741743

@@ -819,7 +821,7 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo
819821

820822
- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device`
821823

822-
You need to enable to support 4GB memory malloc by:
824+
With the default `GGML_SYCL_ENABLE_LEVEL_ZERO=1`, llama.cpp requests Level Zero's relaxed maximum-size allocation limit directly. If Level Zero support is disabled at build time or runtime and the allocation goes through SYCL/Unified Runtime instead, enable support for allocations larger than 4 GiB by:
823825
```
824826
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
825827
set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1

ggml/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,7 @@ option(GGML_SYCL "ggml: use SYCL"
249249
option(GGML_SYCL_F16 "ggml: use 16 bit floats for sycl calculations" OFF)
250250
option(GGML_SYCL_GRAPH "ggml: enable graphs in the SYCL backend" ON)
251251
option(GGML_SYCL_HOST_MEM_FALLBACK "ggml: allow host memory fallback in SYCL reorder (requires kernel 6.8+)" ON)
252+
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero API in SYCL backend" ON)
252253
option(GGML_SYCL_DNN "ggml: enable oneDNN in the SYCL backend" ON)
253254
set (GGML_SYCL_TARGET "INTEL" CACHE STRING
254255
"ggml: sycl target device")

ggml/src/ggml-sycl/CMakeLists.txt

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,18 @@ if (WIN32)
3939
set(CMAKE_CXX_COMPILER "icx")
4040
set(CMAKE_CXX_COMPILER_ID "IntelLLVM")
4141
endif()
42+
# Level Zero SDK path for Windows (only when GGML_SYCL_SUPPORT_LEVEL_ZERO is enabled)
43+
if(GGML_SYCL_SUPPORT_LEVEL_ZERO)
44+
if(DEFINED ENV{LEVEL_ZERO_V1_SDK_PATH})
45+
set(LEVEL_ZERO_V1_SDK_PATH $ENV{LEVEL_ZERO_V1_SDK_PATH})
46+
if(EXISTS "${LEVEL_ZERO_V1_SDK_PATH}")
47+
target_include_directories(ggml-sycl PRIVATE "${LEVEL_ZERO_V1_SDK_PATH}/include")
48+
set(LEVEL_ZERO_V1_SDK_LIB_PATH "${LEVEL_ZERO_V1_SDK_PATH}/lib")
49+
else()
50+
message(WARNING "LEVEL_ZERO_V1_SDK_PATH set but folder not found: ${LEVEL_ZERO_V1_SDK_PATH}")
51+
endif()
52+
endif()
53+
endif()
4254
endif()
4355

4456
macro(detect_and_find_package package_name)
@@ -93,6 +105,23 @@ endif()
93105

94106
target_compile_options(ggml-sycl PRIVATE "-Wno-narrowing")
95107

108+
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO ${GGML_SYCL_SUPPORT_LEVEL_ZERO}")
109+
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
110+
# Link against Level Zero loader for direct device memory allocation.
111+
# Avoids sycl::malloc_device triggering DMA-buf/TTM system RAM staging
112+
# in the xe kernel driver during multi-GPU inference.
113+
find_path(LEVEL_ZERO_INCLUDE_DIR level_zero/ze_api.h HINTS ${ONEAPI_ROOT}/include ${LEVEL_ZERO_V1_SDK_PATH}/include)
114+
find_library(ZE_LOADER_LIB ze_loader HINTS ${ONEAPI_ROOT}/lib ${LEVEL_ZERO_V1_SDK_LIB_PATH} ENV LD_LIBRARY_PATH)
115+
if(ZE_LOADER_LIB AND LEVEL_ZERO_INCLUDE_DIR)
116+
target_link_libraries(ggml-sycl PRIVATE ${ZE_LOADER_LIB})
117+
target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_SUPPORT_LEVEL_ZERO)
118+
message(STATUS "Level Zero loader found: ${ZE_LOADER_LIB}")
119+
message(STATUS "Level Zero headers found: ${LEVEL_ZERO_INCLUDE_DIR}")
120+
else()
121+
message(WARNING "Level Zero loader or headers not found, Level Zero support disabled")
122+
endif()
123+
endif()
124+
96125
# Link against oneDNN
97126
set(GGML_SYCL_DNNL 0)
98127
if(GGML_SYCL_DNN)

ggml/src/ggml-sycl/common.cpp

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@
1111
//
1212

1313
#include "common.hpp"
14+
#include <sycl/backend.hpp>
15+
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
16+
#include <level_zero/ze_api.h>
17+
#endif
1418

1519
#include "ggml-backend-impl.h"
1620
#include "ggml-impl.h"
@@ -55,6 +59,20 @@ bool gpu_has_xmx(sycl::device &dev) {
5559
return dev.has(sycl::aspect::ext_intel_matrix);
5660
}
5761

62+
static int ggml_sycl_get_env(const char *env_name, int default_val) {
63+
char *user_device_string = getenv(env_name);
64+
int user_number = default_val;
65+
66+
unsigned n;
67+
if (user_device_string != NULL &&
68+
sscanf(user_device_string, " %u", &n) == 1) {
69+
user_number = (int)n;
70+
} else {
71+
user_number = default_val;
72+
}
73+
return user_number;
74+
}
75+
5876
int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block_size) {
5977
const int64_t max_range = std::numeric_limits<int>::max();
6078
int64_t sycl_down_blk_size = block_size;
@@ -66,6 +84,61 @@ int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block
6684
return sycl_down_blk_size;
6785
}
6886

87+
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
88+
static bool ggml_sycl_use_level_zero_device_alloc(sycl::queue &q) {
89+
return ggml_sycl_get_env("GGML_SYCL_ENABLE_LEVEL_ZERO", 1) &&
90+
q.get_device().is_gpu() &&
91+
q.get_backend() == sycl::backend::ext_oneapi_level_zero;
92+
}
93+
#endif
94+
95+
// Use Level Zero zeMemAllocDevice to avoid sycl::malloc_device triggering
96+
// DMA-buf/TTM system RAM staging in the xe kernel driver during multi-GPU inference.
97+
// The decision is made from the queue and runtime env because large buffers can be
98+
// allocated before ggml_check_sycl() initializes g_ggml_sycl_enable_level_zero.
99+
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
100+
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
101+
if (ggml_sycl_use_level_zero_device_alloc(q)) {
102+
void *ptr = nullptr;
103+
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
104+
auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
105+
#ifdef ZE_RELAXED_ALLOCATION_LIMITS_EXP_NAME
106+
ze_relaxed_allocation_limits_exp_desc_t relaxed_desc = {
107+
ZE_STRUCTURE_TYPE_RELAXED_ALLOCATION_LIMITS_EXP_DESC,
108+
nullptr,
109+
ZE_RELAXED_ALLOCATION_LIMITS_EXP_FLAG_MAX_SIZE,
110+
};
111+
ze_device_mem_alloc_desc_t alloc_desc = {
112+
ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC,
113+
&relaxed_desc,
114+
0,
115+
0,
116+
};
117+
#else
118+
ze_device_mem_alloc_desc_t alloc_desc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC, nullptr, 0, 0};
119+
#endif
120+
ze_result_t r = zeMemAllocDevice(ze_ctx, &alloc_desc, size, 64, ze_dev, &ptr);
121+
if (r == ZE_RESULT_SUCCESS && ptr) {
122+
return ptr;
123+
}
124+
return nullptr;
125+
}
126+
#endif
127+
return sycl::malloc_device(size, q);
128+
}
129+
130+
void ggml_sycl_free_device(void *ptr, sycl::queue &q) {
131+
if (!ptr) return;
132+
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
133+
if (ggml_sycl_use_level_zero_device_alloc(q)) {
134+
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
135+
zeMemFree(ze_ctx, ptr);
136+
return;
137+
}
138+
#endif
139+
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, q)));
140+
}
141+
69142
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams) {
70143
for (int i = 0; i < ggml_sycl_info().device_count; ++i) {
71144
for (int64_t is = 0; is < GGML_SYCL_MAX_STREAMS; ++is) {
@@ -75,8 +148,7 @@ void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> str
75148
}
76149
if (extra->data_device[i] != nullptr && streams.size()>0) {
77150
ggml_sycl_set_device(i);
78-
SYCL_CHECK(
79-
CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
151+
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(extra->data_device[i], *(streams[i]))));
80152
}
81153
}
82154
delete extra;

ggml/src/ggml-sycl/common.hpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,10 @@ struct ggml_tensor_extra_gpu {
310310
optimize_feature optimized_feature;
311311
};
312312

313+
extern int g_ggml_sycl_enable_level_zero;
314+
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q);
315+
void ggml_sycl_free_device(void *ptr, sycl::queue &q);
316+
313317
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams={});
314318

315319
namespace sycl_ex = sycl::ext::oneapi::experimental;

0 commit comments

Comments
 (0)