[ET-VK][ez] Fix 8 bit linear compute shader dispatch#9531
[ET-VK][ez] Fix 8 bit linear compute shader dispatch#9531facebook-github-bot merged 3 commits intogh/SS-JIA/200/basefrom
Conversation
## Context
Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.
Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.
However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B
Compare this to a local work group size of `{8, 8, 1}`
* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
## Changes
Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.
Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9531
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit a3a3d85 with merge base 7159650 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Context
Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.
Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.
However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B
Compare this to a local work group size of `{8, 8, 1}`
* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
## Changes
Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.
Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
ghstack-source-id: 273548740
Pull Request resolved: #9531
|
This pull request was exported from Phabricator. Differential Revision: D71706489 |
This PR needs a
|
## Context
Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.
Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.
However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B
Compare this to a local work group size of `{8, 8, 1}`
* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
## Changes
Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.
Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
[ghstack-poisoned]
Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274198011 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
|
This pull request was exported from Phabricator. Differential Revision: D71706489 |
## Context
Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting.
Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`.
However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
* Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
* All work groups will be requesting the same row of B
* One work group will load 65 unique rows from A and B
Compare this to a local work group size of `{8, 8, 1}`
* Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
* One work group will load 16 unique rows total from A and B
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
## Changes
Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations.
Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
[ghstack-poisoned]
Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274260277 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
|
This pull request was exported from Phabricator. Differential Revision: D71706489 |
e918ec2
into
gh/SS-JIA/200/base
Stack from ghstack (oldest at bottom):
Context
Currently, for the
q_8w_linearshader, both the texture and the buffer variants use the same global work group and local work group setting.Specially, the global work group is set to
{out.numel(), 1, 1}and the local work group is set to{64, 1, 1}.However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
Compare this to a local work group size of
{8, 8, 1}Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
Changes
Modify the
q_8w_linearshader to use{8, 8, 1}local wg if possible. IfMis small, then instead use{4, 16, 1}or{2, 32, 1}to reduce the number of inactive invocations.Differential Revision: D71706489