[PTX][MMA] Added support for migrating m16n8k16 by TejaX-Alaghari · Pull Request #2821 · oneapi-src/SYCLomatic

TejaX-Alaghari · 2025-05-07T08:43:14Z

This PR adds support for below configs of m16n8k16

.f32.f16.f16.f32
.s32.s8.s8.s32

tomflinda

LGTM

tomflinda

Pls address the comments and verify your update with e2e test cases

zhimingwang36 · 2025-05-09T08:16:03Z

+        auto rb = reinterpret_cast<MulType *>(recv_b);
+
+        for (int j = 0; j < 4; j++) {
+          c[0] += static_cast<CDType>(ra[j]) * static_cast<CDType>(rb[j]);


c matrix should not be updated.

Changed the logic, to do

d = c;
d += a * b;

tomflinda · 2025-05-09T09:19:48Z

+        for (int j = 0; j < 4; j++) {
+          c[0] += static_cast<CDType>(ra[j]) * static_cast<CDType>(rb[j]);
+          c[1] += static_cast<CDType>(ra[j]) * static_cast<CDType>(rb[j + 4]);
+          c[2] += static_cast<CDType>(ra[j + 4]) * static_cast<CDType>(rb[j]);
+          c[3] +=
+              static_cast<CDType>(ra[j + 4]) * static_cast<CDType>(rb[j + 4]);
+        }


Pls add more comments to explain the code piece here and the reason offset 4 is used.

Added comments to clarify the reason for using '4' offset and
that how this wouldn't overflow

tomflinda · 2025-05-09T09:42:30Z

+/// \tparam [in] M The rows of A, C & D matrix
+/// \tparam [in] N The columns of B, C, D matrix
+/// \tparam [in] K The columns & rows of A & B matrices respectively
+/// \tparam [in] MulType The type used to multiply A and B matrix elements as


MulType is confusing to ABType; pls add more comments to explain it.

Modified the comment to explain better

tomflinda · 2025-05-11T06:44:01Z

+/// Multiplies 2 matrices (A & B) and adds the result to C matrix and
+/// accumulates the result to a D matrix (MAD). Requires the sub-group size of


The functionality description for this helper function is not accurate; this helper function is called by one work item of a subgroup('the size of the subgroup is limited to 32'), the current work item i(i=0,1,..,31) only calculates the four elements of the result matrix D(e,g: D = A*B + C, where the shape of D=16x8, shape of A=16x16, shape of B=16x8, shape of C=16x8) for shape and type:m16n8k16 (f32.f16.f16.f32), pls update the description for this helper function.

Added more description to the algo functionality

tomflinda · 2025-05-11T07:02:03Z

+        // d2 += row8{ a0, a1, a8, a9 } * col0{ b0, b1, b8, b9 }
+        // d3 += row8{ a1, a1, a8, a9 } * col1{ b0, b1, b8, b9 }
+        for (int j = 0; j < 4; j++) {
+          *d[0] += static_cast<CDType>(ra[j]) * static_cast<CDType>(rb[j]);


d0~d3 is the four results of result D (D=AxB+C for m16n8k16 (f32.f16.f16.f32)), from the algorithm of matrix multiplication, d0(e.g., the position of d0 in matrix D where [i, j]) is the accumulation of dot multiplication of the whole i row of matrix A, and the whole j column of matrix B. In the subgroup level, for the current work item, pls explain how the whole i row of matrix A, and the whole j column of matrix B are loaded. For example, from the parameter of void *a_mat, void *b_mat, void *c_mat shown in the lit test mmu.cu:

asm("mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 " " { %0, %1, %2, %3 }, " " { %4, %5, %6, %7 }, " " { %8, %9 }, " " { %0, %1, %2, %3 };" : "+f"(fc[0]), "+f"(fc[1]), "+f"(fc[2]), "+f"(fc[3]) : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1]));

only 8 elements of matrix A and 4 elements of matrix B are passed into ASM instruction, while the result of this ASM is that the four elements of in result D are calculated, so for each one of the four elements, pls explain in the helper function, how the whole i row of matrix A, and the whole j column of matrix B are loaded.

Changed the description to reflect Added more description to the algo functionality

* f32.f16.f16.f32 * s32.s8.s8.s32

…rix elements

tomflinda · 2025-05-13T07:57:42Z

+template <typename T> struct MMAType {
+  using PackType = uint32_t;
+};


If only uint32_t is enough, we can use uint32_t directly instead of introducing MMAType

Some shapes involving f64 require a pack type of double. So, suggesting to keep this

tomflinda · 2025-05-13T08:14:04Z

+      // Each work item Wi (i=0...31) gathers 2 row & 2 col matrix fragments
+      // of length k (8) from A & B matrices respectively into recv_a & recv_b
+      // across 4 iterations using 4 neighboring work items with below mapping


Could you refine this comment block? it is difficult for users to understand.

Simplified it

tomflinda · 2025-05-13T08:15:01Z

+      // logic:
+      // row0 = (lane >> 2)    & row1 = (lane >> 2) + 8
+      // col0 = (lane % 4) * 2 & col1 = (lane % 4) * 2 + 1
+      for (int i = 0; i < 4; i++) {


Could explain the meaning of 4?

Added comments to describe the distribution of rows & cols across 4 work items

tomflinda

Pls address the comment I left.

tomflinda

LGTM

TejaX-Alaghari requested a review from a team as a code owner May 7, 2025 08:43

TejaX-Alaghari requested review from the-slow-one and zhimingwang36 May 7, 2025 08:43

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

tomflinda approved these changes May 7, 2025

View reviewed changes

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 reviewed May 7, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 approved these changes May 7, 2025

View reviewed changes

tomflinda reviewed May 9, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp

tomflinda requested changes May 9, 2025

View reviewed changes

zhimingwang36 reviewed May 9, 2025

View reviewed changes

tomflinda reviewed May 9, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

tomflinda reviewed May 9, 2025

View reviewed changes

TejaX-Alaghari force-pushed the mma_m16n8k16 branch from 2dda4c7 to 9e2c234 Compare May 9, 2025 10:25

tomflinda reviewed May 11, 2025

View reviewed changes

zhiweij1 reviewed May 12, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

zhiweij1 reviewed May 12, 2025

View reviewed changes

Comment thread clang/test/dpct/asm/mma.cu Outdated

tomflinda reviewed May 12, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

TejaX-Alaghari added 5 commits May 13, 2025 11:20

Added support for mma m16n8k16 migration

11eb1d7

* f32.f16.f16.f32 * s32.s8.s8.s32

Created helper function for m16n8k16

d32895e

Added new type logic for A & B matrix elements

2a98af9

Fixed format & addressed comments

d385cb3

Changed the interface to accept void *

e5cf736

TejaX-Alaghari added 3 commits May 13, 2025 11:20

Refined comments

685a7b8

Added more inline commenst for loops and added volatile type to D mat…

e915a0b

…rix elements

Merged MulType and ABType into 1

7527809

TejaX-Alaghari force-pushed the mma_m16n8k16 branch from 9e2c234 to 7527809 Compare May 13, 2025 03:48

tomflinda reviewed May 13, 2025

View reviewed changes

Comment thread clang/runtime/dpct-rt/include/dpct/math.hpp Outdated

tomflinda reviewed May 13, 2025

View reviewed changes

tomflinda requested changes May 13, 2025

View reviewed changes

Added comments to describe the algo better

eced506

tomflinda approved these changes May 13, 2025

View reviewed changes

zhimingwang36 approved these changes May 14, 2025

View reviewed changes

zhimingwang36 merged commit 8f31872 into oneapi-src:SYCLomatic May 14, 2025
5 of 7 checks passed

TejaX-Alaghari deleted the mma_m16n8k16 branch May 17, 2025 02:57

		/// Multiplies 2 matrices (A & B) and adds the result to C matrix and
		/// accumulates the result to a D matrix (MAD). Requires the sub-group size of

Conversation

TejaX-Alaghari commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomflinda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomflinda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomflinda May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomflinda May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomflinda left a comment

Choose a reason for hiding this comment

Uh oh!

tomflinda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tomflinda May 11, 2025 •

edited

Loading

tomflinda May 11, 2025 •

edited

Loading