Commit 865ab2b
[rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline
(#6209)
## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.
## Changes:
- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350
## Performance results
1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300
## Test/Verify
1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2. Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```
## Discussion1 parent b2ea5fd commit 865ab2b
12 files changed
Lines changed: 2875 additions & 799 deletions
File tree
- include/ck_tile
- host
- ops
- fmha
- kernel
- pipeline
- gemm/block
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
35 | 112 | | |
36 | 113 | | |
37 | 114 | | |
| |||
74 | 151 | | |
75 | 152 | | |
76 | 153 | | |
| 154 | + | |
77 | 155 | | |
78 | | - | |
79 | 156 | | |
80 | 157 | | |
81 | 158 | | |
82 | 159 | | |
83 | 160 | | |
| 161 | + | |
84 | 162 | | |
85 | 163 | | |
86 | 164 | | |
| |||
441 | 519 | | |
442 | 520 | | |
443 | 521 | | |
444 | | - | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
445 | 525 | | |
446 | 526 | | |
447 | 527 | | |
| |||
894 | 974 | | |
895 | 975 | | |
896 | 976 | | |
897 | | - | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
898 | 980 | | |
899 | 981 | | |
900 | 982 | | |
| |||
1036 | 1118 | | |
1037 | 1119 | | |
1038 | 1120 | | |
| 1121 | + | |
1039 | 1122 | | |
1040 | 1123 | | |
1041 | 1124 | | |
| |||
1094 | 1177 | | |
1095 | 1178 | | |
1096 | 1179 | | |
| 1180 | + | |
1097 | 1181 | | |
1098 | 1182 | | |
1099 | 1183 | | |
| |||
1155 | 1239 | | |
1156 | 1240 | | |
1157 | 1241 | | |
| 1242 | + | |
1158 | 1243 | | |
1159 | 1244 | | |
1160 | 1245 | | |
| |||
1213 | 1298 | | |
1214 | 1299 | | |
1215 | 1300 | | |
| 1301 | + | |
1216 | 1302 | | |
1217 | 1303 | | |
1218 | 1304 | | |
| |||
1599 | 1685 | | |
1600 | 1686 | | |
1601 | 1687 | | |
| 1688 | + | |
| 1689 | + | |
| 1690 | + | |
| 1691 | + | |
1602 | 1692 | | |
1603 | 1693 | | |
1604 | 1694 | | |
| |||
1609 | 1699 | | |
1610 | 1700 | | |
1611 | 1701 | | |
1612 | | - | |
1613 | | - | |
1614 | | - | |
1615 | | - | |
| 1702 | + | |
| 1703 | + | |
| 1704 | + | |
| 1705 | + | |
1616 | 1706 | | |
1617 | 1707 | | |
1618 | 1708 | | |
| |||
1631 | 1721 | | |
1632 | 1722 | | |
1633 | 1723 | | |
1634 | | - | |
1635 | | - | |
1636 | | - | |
1637 | | - | |
| 1724 | + | |
| 1725 | + | |
| 1726 | + | |
| 1727 | + | |
| 1728 | + | |
| 1729 | + | |
| 1730 | + | |
| 1731 | + | |
| 1732 | + | |
| 1733 | + | |
| 1734 | + | |
| 1735 | + | |
| 1736 | + | |
| 1737 | + | |
| 1738 | + | |
1638 | 1739 | | |
1639 | 1740 | | |
1640 | 1741 | | |
| |||
1646 | 1747 | | |
1647 | 1748 | | |
1648 | 1749 | | |
1649 | | - | |
1650 | | - | |
1651 | | - | |
1652 | | - | |
1653 | | - | |
1654 | | - | |
| 1750 | + | |
| 1751 | + | |
| 1752 | + | |
| 1753 | + | |
| 1754 | + | |
| 1755 | + | |
| 1756 | + | |
| 1757 | + | |
1655 | 1758 | | |
1656 | | - | |
1657 | | - | |
1658 | | - | |
1659 | | - | |
1660 | | - | |
| 1759 | + | |
| 1760 | + | |
| 1761 | + | |
| 1762 | + | |
| 1763 | + | |
| 1764 | + | |
| 1765 | + | |
| 1766 | + | |
| 1767 | + | |
| 1768 | + | |
| 1769 | + | |
| 1770 | + | |
| 1771 | + | |
| 1772 | + | |
1661 | 1773 | | |
1662 | 1774 | | |
1663 | 1775 | | |
| |||
1680 | 1792 | | |
1681 | 1793 | | |
1682 | 1794 | | |
1683 | | - | |
1684 | | - | |
| 1795 | + | |
1685 | 1796 | | |
1686 | 1797 | | |
1687 | 1798 | | |
1688 | 1799 | | |
1689 | 1800 | | |
1690 | | - | |
1691 | | - | |
1692 | | - | |
1693 | | - | |
| 1801 | + | |
| 1802 | + | |
| 1803 | + | |
| 1804 | + | |
| 1805 | + | |
| 1806 | + | |
| 1807 | + | |
| 1808 | + | |
| 1809 | + | |
| 1810 | + | |
| 1811 | + | |
| 1812 | + | |
| 1813 | + | |
| 1814 | + | |
| 1815 | + | |
| 1816 | + | |
1694 | 1817 | | |
1695 | 1818 | | |
1696 | 1819 | | |
| |||
1840 | 1963 | | |
1841 | 1964 | | |
1842 | 1965 | | |
1843 | | - | |
| 1966 | + | |
| 1967 | + | |
| 1968 | + | |
| 1969 | + | |
1844 | 1970 | | |
1845 | 1971 | | |
1846 | 1972 | | |
| |||
2798 | 2924 | | |
2799 | 2925 | | |
2800 | 2926 | | |
2801 | | - | |
| 2927 | + | |
| 2928 | + | |
| 2929 | + | |
| 2930 | + | |
2802 | 2931 | | |
2803 | 2932 | | |
2804 | 2933 | | |
| |||
Lines changed: 46 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
12 | 58 | | |
13 | 59 | | |
14 | 60 | | |
| |||
0 commit comments