Skip to content

Commit 42f8358

Browse files
committed
support flash-attention for fp32/fp16/Q4/Q5/Q8
1 parent a8b192b commit 42f8358

65 files changed

Lines changed: 20090 additions & 8593 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/backend/SYCL.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- [Linux](#linux)
1010
- [Windows](#windows)
1111
- [Environment Variable](#environment-variable)
12+
- [Design Rule](#design-rule)
1213
- [Known Issue](#known-issues)
1314
- [Q&A](#qa)
1415
- [TODO](#todo)
@@ -41,6 +42,9 @@ The following releases are verified and recommended:
4142

4243
## News
4344

45+
- 2026.03
46+
- Support Flash-Attention: less memory usage, performance impact depends on LLM.
47+
4448
- 2026.02
4549
- Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. User can't build up the software for Nvidia & AMD GPU.
4650

@@ -685,13 +689,36 @@ use 1 SYCL GPUs: [0] with Max compute units:512
685689
| Name | Value | Function |
686690
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
687691
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
692+
| GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.|
688693
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) |
689694
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
690695
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
691696
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
692697
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
693698

699+
## Design Rule
700+
701+
- Open to all contributors.
702+
703+
- All code change should be useful to user:
704+
- Fix bug.
705+
- Add new function.
706+
- Improve the performance/usage.
707+
- Make code be easy to maintain.
708+
- ...
709+
710+
- Don't accept the codes of following cases:
711+
- Break legacy function.
712+
- Reduce the performance of legacy case in default.
713+
- Not completed work/the functionality cannot be demonstrated.
714+
715+
- Encourage to use environment variable to control features to be opened/closed.
716+
- User can evaluate the feature without rebuild the code.
717+
- Recommend the best features to user by setting them be opened as default.
718+
719+
- Design the code based on the published official releases of oneAPI packages: compiler, library, driver, OS kernel.
694720

721+
- Developers need to maintain the code they submit.
695722

696723
## Known Issues
697724

docs/ops.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Legend:
4545
| EXP |||| 🟡 | 🟡 ||| 🟡 ||||
4646
| EXPM1 |||| 🟡 | 🟡 |||||||
4747
| FILL ||||||||||||
48-
| FLASH_ATTN_EXT || 🟡 || 🟡 | 🟡 | 🟡 | | 🟡 | 🟡 |||
48+
| FLASH_ATTN_EXT || 🟡 || 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |||
4949
| FLOOR |||| 🟡 ||| 🟡 | 🟡 ||||
5050
| GATED_LINEAR_ATTN ||||||||||||
5151
| GEGLU ||||| 🟡 ||| 🟡 ||||

docs/ops/SYCL.csv

Lines changed: 15123 additions & 8565 deletions
Large diffs are not rendered by default.

ggml/src/ggml-sycl/CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,11 @@ ggml_add_backend_library(ggml-sycl
2525

2626
file(GLOB GGML_HEADERS_SYCL "*.hpp")
2727
file(GLOB GGML_SOURCES_SYCL "*.cpp")
28+
file(GLOB SRCS "template-instances/fattn-tile*.cpp")
29+
list(APPEND GGML_SOURCES_SYCL ${SRCS})
30+
file(GLOB SRCS "template-instances/fattn-vec*.cpp")
31+
list(APPEND GGML_SOURCES_SYCL ${SRCS})
32+
2833
target_sources(ggml-sycl PRIVATE ${GGML_HEADERS_SYCL} ${GGML_SOURCES_SYCL})
2934

3035
if (WIN32)
@@ -145,6 +150,7 @@ else()
145150
endif()
146151

147152
if (GGML_SYCL_GRAPH)
153+
message(STATUS "find GGML_SYCL_GRAPH")
148154
target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_GRAPH)
149155
endif()
150156

ggml/src/ggml-sycl/backend.hpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include "dequantize.hpp"
2424
#include "dmmv.hpp"
2525
#include "element_wise.hpp"
26+
#include "fattn.hpp"
2627
#include "gla.hpp"
2728
#include "im2col.hpp"
2829
#include "mmq.hpp"

0 commit comments

Comments
 (0)