Skip to content

Commit e6123e2

Browse files
truecoder34plotnikov.v10
andauthored
docs : update ZenDNN docs for Q8 support (ggml-org#23791)
* docs zendnn added information about Q8 support * docs zendnn rm unnecessary data * docs update, links to ZenDNN docs provided * docs zenDNN update: clarified explanation * docs zenDNN update: one more explanation clarified --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
1 parent 22cadc1 commit e6123e2

2 files changed

Lines changed: 19 additions & 1 deletion

File tree

docs/backend/ZenDNN.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,13 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert-
7272
|:----------------------:|:-------:|:---------------------------------------------:|
7373
| FP32 | Support | Full precision floating point |
7474
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
75+
| Q8_0 | Support | 8-bit quantized weights via [dynamic quantization](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md) |
7576

7677
*Notes:*
7778

7879
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
80+
- **Q8_0** is available for quantized model weights since ZenDNN supports dynamic quantization [LowOHA MatMul operator](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md).
81+
- Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
7982

8083
## Linux
8184

@@ -140,6 +143,15 @@ Download LLaMA 3.1 8B Instruct BF16 model:
140143
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
141144
```
142145

146+
You can also use a Q8_0 GGUF model:
147+
148+
```sh
149+
# Download a Q8_0 GGUF model from Hugging Face
150+
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF \
151+
Llama-3.1-8B-Instruct-Q8_0.gguf \
152+
--local-dir models/
153+
```
154+
143155
#### 2. Start Server
144156

145157
Run llama.cpp server with ZenDNN acceleration:
@@ -176,6 +188,10 @@ export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended)
176188

177189
For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).
178190

191+
### Q8_0 Performance Notes
192+
193+
Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology.
194+
179195
### Profiling and Debugging
180196

181197
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).
@@ -184,6 +200,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen
184200

185201
- **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
186202
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
203+
- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats still fall back to the standard CPU backend.
187204
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
188205

189206
## Q&A
@@ -202,7 +219,7 @@ A: ZenDNN is optimized specifically for AMD processors. While it may work on oth
202219

203220
**Q: Does ZenDNN support quantized models?**
204221

205-
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
222+
A: Yes. The ZenDNN backend supports Q8_0 quantized models for supported matrix multiplication operations. FP32 and BF16 are also supported. Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
206223

207224
**Q: Why is my inference not faster with ZenDNN?**
208225

docs/build.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The following sections describe how to build with different backends and options
2222
* [HIP](#hip)
2323
* [Vulkan](#vulkan)
2424
* [CANN](#cann)
25+
* [ZenDNN](#zendnn)
2526
* [Arm® KleidiAI™](#arm-kleidiai)
2627
* [OpenCL](#opencl)
2728
* [Android](#android-1)

0 commit comments

Comments
 (0)