Skip to content

Commit 4fc4ec5

Browse files
authored
opencl: allow loading precompiled binary kernels from library (ggml-org#23042)
* opencl: allow loading binary kernel * opencl: add libdl.h * ggml-backend-dl is in ggml, which depends backend libs, thus ggml-opencl cannot depend on ggml-backend-dl * add libdl.h to break cyclic dep * opencl: allow loading bin kernel lib * opencl: load `gemm_moe_mxfp4_f32_ns` from kernel lib if available * opencl: load q8_0 gemm from kernel lib * opencl: load q4_0 moe gemm from kernel lib * opencl: load q4_1 moe gemm from kernel lib * opencl: load q4_k moe gemm from kernel lib * opencl: always declare `get_adreno_bin_kernel_func_t` * opencl: rephrase message * opencl: fix for rebase * opencl: update doc
1 parent a6647b1 commit 4fc4ec5

4 files changed

Lines changed: 448 additions & 47 deletions

File tree

docs/backend/OPENCL.md

Lines changed: 51 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,26 @@
11
# llama.cpp for OpenCL
22

3-
- [Background](#background)
4-
- [OS](#os)
5-
- [Hardware](#hardware)
6-
- [DataType Supports](#datatype-supports)
7-
- [Model Preparation](#model-preparation)
8-
- [CMake Options](#cmake-options)
9-
- [Android](#android)
10-
- [Windows 11 Arm64](#windows-11-arm64)
11-
- [Linux](#Linux)
12-
- [Known Issue](#known-issues)
13-
- [TODO](#todo)
3+
- [llama.cpp for OpenCL](#llamacpp-for-opencl)
4+
- [Background](#background)
5+
- [Llama.cpp + OpenCL](#llamacpp--opencl)
6+
- [OS](#os)
7+
- [Hardware](#hardware)
8+
- [Adreno GPU](#adreno-gpu)
9+
- [DataType Supports](#datatype-supports)
10+
- [Model Preparation](#model-preparation)
11+
- [Binary Kernel Library](#binary-kernel-library)
12+
- [CMake Options](#cmake-options)
13+
- [Android](#android)
14+
- [I. Setup Environment](#i-setup-environment)
15+
- [II. Build llama.cpp](#ii-build-llamacpp)
16+
- [Windows 11 Arm64](#windows-11-arm64)
17+
- [I. Setup Environment](#i-setup-environment-1)
18+
- [II. Build llama.cpp](#ii-build-llamacpp-1)
19+
- [Linux](#linux)
20+
- [I. Setup Environment](#i-setup-environment-2)
21+
- [II. Build llama.cpp](#ii-build-llamacpp-2)
22+
- [Known Issues](#known-issues)
23+
- [TODO](#todo)
1424

1525
## Background
1626

@@ -34,11 +44,13 @@ The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adren
3444

3545
**Verified devices**
3646

37-
| Adreno GPU | Status |
38-
|:------------------------------------:|:-------:|
39-
| Adreno 750 (Snapdragon 8 Gen 3) | Support |
40-
| Adreno 830 (Snapdragon 8 Elite) | Support |
41-
| Adreno X85 (Snapdragon X Elite) | Support |
47+
| Adreno GPU | Status |
48+
|:-------------------------------------:|:-------:|
49+
| Adreno 750 (Snapdragon 8 Gen 3) | Support |
50+
| Adreno 830 (Snapdragon 8 Elite) | Support |
51+
| Adreno 840 (Snapdragon 8 Elite Gen 5) | Support |
52+
| Adreno X1-85 (Snapdragon X Elite) | Support |
53+
| Adreno X2-90 (Snapdragon X2 Elite) | Support |
4254

4355
> A6x GPUs with a recent driver and compiler are supported; they are usually found in IoT platforms.
4456
However, A6x GPUs in phones are likely not supported due to the outdated driver and compiler.
@@ -47,42 +59,43 @@ However, A6x GPUs in phones are likely not supported due to the outdated driver
4759

4860
| DataType | Status |
4961
|:----------------------:|:--------------------------:|
62+
| Q1_0 | Support |
5063
| Q4_0 | Support |
51-
| Q6_K | Support, but not optimized |
64+
| Q4_1 | Support |
65+
| Q5_0 | Support |
66+
| Q5_1 | Support |
5267
| Q8_0 | Support |
68+
| Q4_K | Support |
69+
| Q5_K | Support |
70+
| Q6_K | Support |
5371
| MXFP4 | Support |
72+
| IQ4_NL | Support |
5473

5574
## Model Preparation
5675

57-
You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
76+
Since common quantizations are supported now, it is recommanded to download GGUF models directly from Huggingface.
5877

59-
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example,
78+
## Binary Kernel Library
6079

61-
```sh
62-
./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
63-
```
64-
65-
Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
66-
67-
### `MXFP4` MoE Models
68-
69-
OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in `MXFP4_MOE`, a mixture of `MXFP4` and `Q8_0`.
70-
For this quantization, there is no need to specify `--pure`.
71-
For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face.
80+
A prebuilt binary kernel library has been introduced for Adreno GPUs.
81+
It currently targets X2 GPUs (X2-90, X2-85 and X2-45) found in Snapdragon X2 SoC.
82+
The library currently contains kernels for MUL_MAT_ID with Q4_0, Q4_1, Q4_K, MXFP4.
83+
The library must be manually downloaded from https://softwarecenter.qualcomm.com/catalog/item/Adreno_Kernel_Library_GGML.
7284

73-
Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization.
74-
Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model.
85+
To allow using the kernel library, add `-DGGML_OPENCL_USE_ADRENO_BIN_KERNELS=ON` when configuring with CMake.
86+
Then, extract `adreno-opencl-kernels.dll` from the zip file downloaded from the above URL and put it alongside the executables.
87+
If kernels compatible with the current GPU are found in the library, they will be loaded and used.
7588

76-
> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
7789

7890
## CMake Options
7991

8092
The OpenCL backend has the following CMake options that control the behavior of the backend.
8193

82-
| CMake options | Default value | Description |
83-
|:---------------------------------:|:--------------:|:------------------------------------------|
84-
| `GGML_OPENCL_EMBED_KERNELS` | `ON` | Embed OpenCL kernels into the executable. |
85-
| `GGML_OPENCL_USE_ADRENO_KERNELS` | `ON` | Use kernels optimized for Adreno. |
94+
| CMake options | Default value | Description |
95+
|:------------------------------------:|:--------------:|:------------------------------------------|
96+
| `GGML_OPENCL_EMBED_KERNELS` | `ON` | Embed OpenCL kernels into the executable. |
97+
| `GGML_OPENCL_USE_ADRENO_KERNELS` | `ON` | Use kernels optimized for Adreno. |
98+
| `GGML_OPENCL_USE_ADRENO_BIN_KERNELS` | `OFF` | Allow using binary kernel lib for Adreno. |
8699

87100
## Android
88101

@@ -277,6 +290,5 @@ ninja
277290

278291
## TODO
279292

280-
- Optimization for Q6_K
281-
- Support and optimization for Q4_K
282293
- Improve flash attention
294+
- Improve OpenCL C kernels performance

ggml/src/ggml-opencl/CMakeLists.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@ if (GGML_OPENCL_EMBED_KERNELS)
3131
target_include_directories(${TARGET_NAME} PRIVATE "${CMAKE_CURRENT_BINARY_DIR}/autogenerated")
3232
endif ()
3333

34+
if (GGML_OPENCL_USE_ADRENO_BIN_KERNELS)
35+
message(STATUS "OpenCL will use precompiled binary kernels for Adreno (improved performance on some platforms)")
36+
add_compile_definitions(GGML_OPENCL_USE_ADRENO_BIN_KERNELS)
37+
endif ()
38+
3439
function(ggml_opencl_add_kernel KNAME)
3540
set(KERN_HDR ${CMAKE_CURRENT_BINARY_DIR}/autogenerated/${KNAME}.cl.h)
3641
set(KERN_SRC ${CMAKE_CURRENT_SOURCE_DIR}/kernels/${KNAME}.cl)

0 commit comments

Comments
 (0)