New rocblas hipblaslt integration#8082
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #8082 +/- ##
===========================================
+ Coverage 61.53% 61.58% +0.05%
===========================================
Files 2095 2095
Lines 361435 361501 +66
Branches 54717 54717
===========================================
+ Hits 222391 222625 +234
+ Misses 120198 119997 -201
- Partials 18846 18879 +33
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
TorreZuk
left a comment
There was a problem hiding this comment.
Some style comments from earlier
There was a problem hiding this comment.
Pull request overview
This PR updates rocBLAS’s hipBLASLt backend integration to use the standard hipBLASLt matmul APIs for General Batched GEMM when hipBLASLt is new enough, while retaining the existing extension-API-based integration as a fallback for older hipBLASLt versions.
Changes:
- Adds a new hipBLASLt matmul code path guarded by a hipBLASLt version check (intended for >= 1.4.1).
- Introduces helper macros/utilities for hipBLASLt status checking, solution selection, and workspace size validation.
- Adds an alpha/beta type-mapping helper to match hipBLASLt expectations.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a233172 to
46c697c
Compare
TorreZuk
left a comment
There was a problem hiding this comment.
My EXPECT to simpler CHECK_ was not addressed, all these that are just success remove the status and just use the CHECK_ pattern we use elsewhere in the library
evedovelli
left a comment
There was a problem hiding this comment.
I left a few comments. Some might be too nitpicking so I'll let you decide what to change or keep as is.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…ng a new device pointer array with offset update before passing into hipblasLtMatmul
…ion, changed error handling
b8170a5 to
b76e86c
Compare
TorreZuk
left a comment
There was a problem hiding this comment.
More questions on design
…allocators and memcpys with async equivalents
…d corresponding tests (#8412) ## Motivation The GEMM kernel implementation in hipBLASLt initializes the SrdA and SrdB from AddressA and AddressB values passed into the kernel as arguments for each workgroup but the dereferencing of the addresses in SrdA and SrdB only happens when alpha != 0. But with General Batched GEMM, the AddressA and AddressB is pointer to a pointer array on the device side. In order to initialize the SrdA and SrdB, the AddressA and AddressB needs to be dereferenced before the alpha != 0 check. This gap was exposed by rocblas batched gemm bad args tests. The PR at #8082 has a dependency on this fix. ## Technical Details When K = 0, the size of A and B matrices will be 0 as well. We have two scenarios which can happen here for the bad args: 1. A = 0, B = 0 and alpha = 0: This is the case where as long as beta != 0, C/D != 0, then it's still a valid problem. The current implementation failed here since the SrdA and SrdB initialization was involving dereferencing of the pointer array passed in AddressA and AddressB even before the alpha != 0 check. This PR adds this additional check of AddressA/B != 0 before dereferencing in Kernel Assembly. 2. A = 0, B = 0 and alpha != 0: This will result invalid data pointer error in rocblaslt_matmul() function (handled on the host side itself). ## Test Plan hipBLASLt didn't have a test for this earlier. Added a new test with K=0 and the client code for General Batched GEMM checks the size of A and B matrices to decide if it should pass the pointer array or just nullptr into the kernel. This mimics the rocblas test's behavior. ## Test Result All tests are passing locally ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Madhusoodhanan Prabha <amadhuso@ctr2-alola-ctrl-01.amd.com>
evedovelli
left a comment
There was a problem hiding this comment.
@mpanoop thanks for the changes! I left a few more suggestions in the PR.
TorreZuk
left a comment
There was a problem hiding this comment.
Looking better, our we getting downstream testing on these changes?
…evisted the error handling to avoid duplicate code
…_HIPBLASLT_ERROR to not overwrite previous error status
TorreZuk
left a comment
There was a problem hiding this comment.
Okay I think all concerns now been addressed. Need to discuss timeline to merge in meeting
New rocblas hipblaslt integration ## Motivation rocBLAS integration to hipBLASLt currently uses the hipBLASLt extension APIs. This was because unlike non-strided and strided batched GEMM, the batched GEMM wasn't supported using standard hipBLASLt APIs. Instead, the batched GEMM in rocBLAS was routed to hipBLASLt extension Grouped GEMM APIs. In order to keep the integration code consistent across all 3 categories of GEMM, hipBLASLt extension API based integration was chosen. With hipBLASLt version 1.3.0, General Batched GEMM support is introduced. PR at ROCm/rocm-libraries#7464 adds missing support for StreamK = 3 + Parallel Reduction as well along with new modifications needed in hipBLASLt side to make the rocblas hipblaslt integration work. The new integration also has a dependency on a newly introduced hipblaslt-ext API isSupportSolution() will be invoked from the rocBLAS side. Hence the new hipblaslt integration is currently guarded for compile time dependency on hipblaslt version 1.4.1 or above. Otherwise, the older hipblaslt integration will be exercised. ## Technical Details With the newly introduced support for hipblasLtBatchMode_t enum which enables support for General Batched GEMM workflow using the standard hipBLASLt APIs, the new hipBLASLt integration code will mimic any other hipBLASLt customer code. This is also a performant alternative for General Batched GEMM since the previous approach was routing to Grouped GEMM which wasn't taking advantage of the properties of the General Batched GEMM, the solution space wasn't exhaustive to cover all data types across GPUs, and the solution selection approach was getAllSolutions() instead of heuristic approach with no CacheLibrary support. The new approach mitigates these performance bottlenecks. ## Test Plan No new tests are added. Existing testcases will exercise the new integration when the hipBLASLt version is 1.4.1 or above. ## Test Result All the tests were passing when run on MI350 node where the BF16 and FP16 GEMMs are defaulted to hipBLASLt backend by default. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
…8746) ## Motivation The PR #8082 introduces a new hipBLASLt integration from rocBLAS. Previous hipBLASLt integration for non-batched, strided-batched and general batched GEMMs was routed via hipblaslt-ext APIs. This was because hipBLASLt didn't support for General Batched GEMM and rocBLAS was routing the General Batched GEMM APIs to Grouped GEMM APIs which was exposed via hipblaslt-ext. Now that General Batched GEMM is supported in hipBLASLt, the hipBLASLt integration in rocBLAS can be streamlined to look similar to any other customer code which directly consumes hipBLASLt APIs. ## Technical Details There are still some rocBLAS specific scenarios like given a solution index in hipBLASLt Solution Library, validate if this solution is supported for a given GPU and problem type were the hipBLASLt APIs currently exposed won't suffice. Going via the hipblasLtMatmulAlgoGetHeuristics() API will mean we have to use the hipblasLtMatmulHeuristicResult_t parameter to pass the solution-index as input. But as per the API contract, this parameter is strictly output parameter. That's why we decided to expose an extension API which can directly route this to equivalent rocblaslt layer API. ## Test Plan rocBLAS tests in PR 8082 were using this new API for the scenarios where the solution-index were explicitly passed in. ## Test Result All tests are passing. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: randyh62 <42045079+randyh62@users.noreply.github.com>
Motivation
rocBLAS integration to hipBLASLt currently uses the hipBLASLt extension APIs. This was because unlike non-strided and strided batched GEMM, the batched GEMM wasn't supported using standard hipBLASLt APIs. Instead, the batched GEMM in rocBLAS was routed to hipBLASLt extension Grouped GEMM APIs. In order to keep the integration code consistent across all 3 categories of GEMM, hipBLASLt extension API based integration was chosen.
With hipBLASLt version 1.3.0, General Batched GEMM support is introduced. PR at #7464 adds missing support for StreamK = 3 + Parallel Reduction as well along with new modifications needed in hipBLASLt side to make the rocblas hipblaslt integration work. The new integration also has a dependency on a newly introduced hipblaslt-ext API isSupportSolution() will be invoked from the rocBLAS side. Hence the new hipblaslt integration is currently guarded for compile time dependency on hipblaslt version 1.4.1 or above. Otherwise, the older hipblaslt integration will be exercised.
Technical Details
With the newly introduced support for hipblasLtBatchMode_t enum which enables support for General Batched GEMM workflow using the standard hipBLASLt APIs, the new hipBLASLt integration code will mimic any other hipBLASLt customer code. This is also a performant alternative for General Batched GEMM since the previous approach was routing to Grouped GEMM which wasn't taking advantage of the properties of the General Batched GEMM, the solution space wasn't exhaustive to cover all data types across GPUs, and the solution selection approach was getAllSolutions() instead of heuristic approach with no CacheLibrary support. The new approach mitigates these performance bottlenecks.
Test Plan
No new tests are added. Existing testcases will exercise the new integration when the hipBLASLt version is 1.4.1 or above.
Test Result
All the tests were passing when run on MI350 node where the BF16 and FP16 GEMMs are defaulted to hipBLASLt backend by default.
Submission Checklist