Skip to content

[SYCL][Matrix] Add get-coord API and general query example#7964

Merged
dm-vodopyanov merged 53 commits into
intel:syclfrom
dkhaldi:get-coord-doc
Aug 28, 2023
Merged

[SYCL][Matrix] Add get-coord API and general query example#7964
dm-vodopyanov merged 53 commits into
intel:syclfrom
dkhaldi:get-coord-doc

Conversation

@dkhaldi

@dkhaldi dkhaldi commented Jan 9, 2023

Copy link
Copy Markdown
Contributor

…he llvm-test-suite

- Add get coord API and remove it from TODO list
- Remove the local memory future API looking as it is no more relevant

@bader bader left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few fixes: markdown linter issues and one typo.

@dkhaldi dkhaldi marked this pull request as ready for review January 10, 2023 16:41
@dkhaldi dkhaldi requested a review from a team as a code owner January 10, 2023 16:41
@dkhaldi dkhaldi requested review from JackAKirk and gmlueck January 10, 2023 16:44
@JackAKirk

Copy link
Copy Markdown
Contributor

Looks pretty good to me. For the Query interface, I think it would be good to try to get some community feedback if possible. I suppose the argument for the general query is that with several backends, it could be easier to ask the API for the set of valid combinations rather than search for the documentation. This is fair. Although I think we should still make an effort to make documentation of supported sizes/types for different backends as accessible and clear as possible; so that people are not forced to use the general query when they may prefer just looking at the docs.

At the moment the documentation for supported types is in e.g. sycl_ext_intel_matrix doc. For the Nvidia case a current slight problem is that we don't actually have any Nvidia only features at the moment, so it is a bit of a misnomer to have a e.g. sycl_ext_cuda_matrix doc similar to what I have here (https://github.com/intel/llvm/pull/6968/files) which currently only lists the currently supported values of sycl_ext_oneapi_matrix APIs in the ext_oneapi_cuda backend. In the future even if we do add the cuda only matrix features there can be other backends that encounter the situation where they need to document supported values of sycl_ext_oneapi_matrix APIs in that backend but don't have a backend specific matrix features extension.

I thought there could be two better options.

a) I can rename sycl_ext_oneapi_matrix_cuda.asciidoc sycl_ext_oneapi_matrix_cuda_supported_vals.asciidoc or similar, remove all the dpc++ extension boilerplate docs from that doc, just leaving the supported value information. Then move the "Supported Combinations Per Hardware" section in sycl_ext_intel_matrix.asciidoc to a similar file like sycl_ext_oneapi_matrix_intel_supported_vals.asciidoc.

OR

b) we just move the "Supported Combinations Per Hardware" section for all backends to the main sycl_ext_oneapi_matrix.asciidoc doc and I just delete this file completely: https://github.com/intel/llvm/blob/e50a2f5f97acb12db1de78c9ad739b931c77b03f/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_cuda_matrix.asciidoc.

What do you think? cc @gmlueck also.

@gmlueck

gmlueck commented Jan 11, 2023

Copy link
Copy Markdown
Contributor

we just move the "Supported Combinations Per Hardware" section for all backends to the main sycl_ext_oneapi_matrix.asciidoc doc

If we want to document all the matrix constraints for each device, I think it probably makes sense to document them all in a single table (i.e. in the same document). For now, this could be a non-normative appendix in the main "sycl_ext_oneapi_matrix.asciidoc" document. If the matrix API is eventually adopted into the core SYCL language (and the extension goes away), we will need to find some other place to list these constraints, but we can worry about this later.

@JackAKirk

Copy link
Copy Markdown
Contributor

we just move the "Supported Combinations Per Hardware" section for all backends to the main sycl_ext_oneapi_matrix.asciidoc doc

If we want to document all the matrix constraints for each device, I think it probably makes sense to document them all in a single table (i.e. in the same document). For now, this could be a non-normative appendix in the main "sycl_ext_oneapi_matrix.asciidoc" document. If the matrix API is eventually adopted into the core SYCL language (and the extension goes away), we will need to find some other place to list these constraints, but we can worry about this later.

Sounds good to me.

@dkhaldi

dkhaldi commented Jan 12, 2023

Copy link
Copy Markdown
Contributor Author

@JackAKirk

At the moment the documentation for supported types is in e.g. sycl_ext_intel_matrix doc. For the Nvidia case a current slight problem is that we don't actually have any Nvidia only features at the moment, so it is a bit of a misnomer to have a e.g. sycl_ext_cuda_matrix doc similar to what I have here (https://github.com/intel/llvm/pull/6968/files) which currently only lists the currently supported values of sycl_ext_oneapi_matrix APIs in the ext_oneapi_cuda backend. In the future even if we do add the cuda only matrix features there can be other backends that encounter the situation where they need to document supported values of sycl_ext_oneapi_matrix APIs in that backend but don't have a backend specific matrix features extension.

You should document what your implementation is supporting, not what Nvidia hardware supports. In the joint matrix code in the CUDA backend, there are very specific combinations that are allowed, this is what should be documented and returned by this query. It is worth mentioning that what we specify in the documentation and the query is not what the hardware supports (note that the XMX sizes are disclosed information). We document what the implementation can do in an optimal way. You can refer to them as logical sizes rather than hardware sizes. In all cases, performance kernels should care about the maximum load it can do at a time not about the matrix hardware mad instruction. Then, reuse that in an optimal way and feet it to mad instruction.

A specific use case appears in one of our performance kernels: a SG should do more than one DPAS instruction to get optimal results. In most cases, especially when matrix sizes are large, the optimal size MxN is 32x64 on PVC, so instead of the user having to fully unroll 32x64 loop and then create multiple joint_matrix_mad operations, the implementation can provide such combination, document it in the document and in the query. In this case, the user will have one iteration in the SG to worry about.

I thought there could be two better options.

Having all the combinations per backend (AMX, XMX8, XMX16, different SM versions for Nvidia) in the main document is fine, especially that the query interface is in the main document. So the combinations will complement the query API so the user knows what to expect when they use the query interface.

- Put all combinations in appendix
- move get_coord to the main document
- Correct the example by converting USM pointers to multi_ptr
@JackAKirk

JackAKirk commented Jan 13, 2023

Copy link
Copy Markdown
Contributor

You should document what your implementation is supporting, not what Nvidia hardware supports. In the joint matrix code in the CUDA backend, there are very specific combinations that are allowed, this is what should be documented and returned by this query.

Yes the table here, https://github.com/intel/llvm/blob/e50a2f5f97acb12db1de78c9ad739b931c77b03f/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_cuda_matrix.asciidoc#valid-joint_matrix-types-and-shapes, is up to date with what the implementation is supporting. I can add it to the Appendix following what you did here in a subsequent PR. Or if you prefer to add it directly in this PR, feel free.

It is worth mentioning that what we specify in the documentation and the query is not what the hardware supports (note that the XMX sizes are disclosed information). We document what the implementation can do in an optimal way. You can refer to them as logical sizes rather than hardware sizes. In all cases, performance kernels should care about the maximum load it can do at a time not about the matrix hardware mad instruction. Then, reuse that in an optimal way and feet it to mad instruction.

A specific use case appears in one of our performance kernels: a SG should do more than one DPAS instruction to get optimal results. In most cases, especially when matrix sizes are large, the optimal size MxN is 32x64 on PVC, so instead of the user having to fully unroll 32x64 loop and then create multiple joint_matrix_mad operations, the implementation can provide such combination, document it in the document and in the query. In this case, the user will have one iteration in the SG to worry about.

I see what you mean by logical vs hardware sizes. The initial sycl-blas commit on joint_matrix also has some relevance to your point I think: https://github.com/codeplaysoftware/sycl-blas . BTW the initial sycl-dnn joint_matrix accelerated commit will follow shortly (it is quite a bit larger so review takes a while).


While this document presents the core API that unifies Intel AMX,
Intel XMX, and Nvidia Tensor Cores, the implementations support
slightly different versions of the API. For this reason, we introduce

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is different now from what this paragraph states; because this document is specifically for the unified matrix interfaces which are portable. So I think it is best to replace this paragraph completely with e.g. the standard template for feature macro versioning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk, the standard template for feature macro versioning if not for experimental feature.
Once this moves out of experimental and becomes supported, this whole section "Matrix API versions" and SYCL_EXT_ONEAPI_MATRIX_VERSION macro will be removed. We won't need to keep the legacy API and tests. Right now, we only keep them to ensure current users have something working while we guide them through all these changes until we have something final (hopefully in this PR).

Do you suggest I remove this now?

@JackAKirk JackAKirk Feb 10, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think that anyone reading this now as documentation on joint_matrix will be thrown by this (wrong) statement

"the implementations support
slightly different versions of the API"

And since this does seem to be the main place that people will arrive at for joint_matrix documentation currently, it makes sense to address this now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it especially that we made value 4 as the default already.


IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only
to matrix with `use::accumulator`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to matrix with `use::accumulator`
to `joint_matrix` with `use::accumulator`

#### Use
Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
==== Use
Specifying the usage of the matrix: matrix left (A), matrix right (B)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matrix left and matrix right are not defined (or equivalently A / B aren't defined).

} // namespace sycl::ext::oneapi::experimental::matrix
```
This function copies `Rows x Cols` elements of type `T` from joint
matrix `src` to joint matrix `dest`. The two matrcies must have the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matrix `src` to joint matrix `dest`. The two matrcies must have the
matrix `src` to joint matrix `dest`. The two matrices must have the

the user whether a specific combination is valid or not. This takes
place when the user specifies all template parameters.

- Default values: this provides a default shape if the user does not

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the implementation choose default values with portability in mind? e.g return a value for XMX that matches the default value for AMX (If I remember correctly there is a unique case satisfying this?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no such case that satisfies XMX of DG2 and XMX of PVC. But it can be added. Currently the default is the max.

@bader bader left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl/ReleaseNotes.md changes look good to me.

@gmlueck

gmlueck commented Jul 31, 2023

Copy link
Copy Markdown
Contributor

Hi @dkhaldi. I just wanted to let you know that I have 4 unresolved comments above. I'm not pushing to resolve them faster, but I wanted to make sure you weren't waiting for me to do something. Some of the comments are hidden and you need to click "Load more" to see them. They are:

  • Two broken links
  • One table formatting problem
  • An unresolved issue with the TF32 overload of joint_matrix_load

@dkhaldi

dkhaldi commented Aug 2, 2023

Copy link
Copy Markdown
Contributor Author

Hi @dkhaldi. I just wanted to let you know that I have 4 unresolved comments above. I'm not pushing to resolve them faster, but I wanted to make sure you weren't waiting for me to do something. Some of the comments are hidden and you need to click "Load more" to see them. They are:

  • Two broken links
  • One table formatting problem
  • An unresolved issue with the TF32 overload of joint_matrix_load

Hi @gmlueck, I just fixed the 4 unresolved comments and added clarifications for joint_matrix_copy.
Do they look good to you now?

Comment thread sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc Outdated
Comment thread sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc Outdated
Comment thread sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc Outdated
@dkhaldi

dkhaldi commented Aug 28, 2023

Copy link
Copy Markdown
Contributor Author

@intel/llvm-gatekeepers can you please merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants