You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| FP32 | Support | Full precision floating point |
74
74
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
75
+
| Q8_0 | Support | 8-bit quantized weights via [dynamic quantization](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md)|
75
76
76
77
*Notes:*
77
78
78
79
-**BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
80
+
-**Q8_0** is available for quantized model weights since ZenDNN supports dynamic quantization [LowOHA MatMul operator](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md).
81
+
- Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).
178
190
191
+
### Q8_0 Performance Notes
192
+
193
+
Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology.
194
+
179
195
### Profiling and Debugging
180
196
181
197
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).
@@ -184,6 +200,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen
184
200
185
201
-**Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
186
202
-**BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
203
+
-**Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats still fall back to the standard CPU backend.
187
204
-**NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
188
205
189
206
## Q&A
@@ -202,7 +219,7 @@ A: ZenDNN is optimized specifically for AMD processors. While it may work on oth
202
219
203
220
**Q: Does ZenDNN support quantized models?**
204
221
205
-
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
222
+
A: Yes. The ZenDNN backend supports Q8_0 quantized models for supported matrix multiplication operations. FP32 and BF16 are also supported. Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
206
223
207
224
**Q: Why is my inference not faster with ZenDNN?**
0 commit comments