Skip to content

[JIRA: AIHPBLAS-1456] - 64-bit offset support in hipBLASLt#7585

Open
shbae wants to merge 27 commits into
developfrom
users/sbae/64bit_offset_support
Open

[JIRA: AIHPBLAS-1456] - 64-bit offset support in hipBLASLt#7585
shbae wants to merge 27 commits into
developfrom
users/sbae/64bit_offset_support

Conversation

@shbae

@shbae shbae commented May 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

hipBLASLt currently lacks support for 64-bit batch offsets in matrix operations. This feature enables batched GEMM operations to specify element-level offsets for input/output matrices, allowing computation on specific regions within larger buffers without requiring data copies. This is critical for applications that manage large pre-allocated memory pools or need to operate on sub-matrices within batched operations, which is directly related to rocblas backend unification efforts. The feature is currently supported in rocblas and used by rocsolver, and it requires hipMemcpy overhead with hipblaslt backend, and this new feature would avoid that unnecessary hipMemcpy overhead.

Technical Details

This PR implements end-to-end 64-bit batch offset support across the hipBLASLt stack:

API Layer:

  • Extended HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to accept 64-bit offset values for A, B, C, D matrices
  • Batch offsets are specified in elements (not bytes) for consistency with matrix dimensions
  • Offsets are applied per-batch in pointer array mode (batch_mode=1)

Host Implementation:

  • Modified tensile_host.cpp to pass offset values as kernel arguments
  • Placed batch offset arguments at the tail of the kernarg buffer for backward compatibility
  • Updated kernel dispatch logic to handle 64-bit offset arithmetic

Kernel Generation (TensileLite):

  • Updated kernel signature generation to include offset parameters in kernarg buffer
  • Modified KernelWriterAssembly.py to:
    • Use only 2 additional temporary SGPRs for 64-bit offset handling (minimal register pressure)
    • Generate s_load_b64 instructions to load 64-bit offset values
    • Insert proper s_waitcnt synchronization after scalar loads
    • Apply 64-bit address arithmetic when computing buffer pointers
  • Extended KernelWriterConversion.py for Conversion kernel types
  • Updated computeStoreSrd() to properly handle offset calculations

Test Infrastructure:

  • Created dedicated test suite testing_matmul_batch_offset.hpp with dual-validation approach:
    a. Offset API results vs manual pointer adjustment (validates implementation correctness)
    b. GPU results vs CPU reference (validates numerical accuracy)
  • Added 5 test categories in matmul_gtest.yaml:
    • matmul_batch_offset_quick: smoke test (category: quick)
    • matmul_batch_offset_values: various offset values 0-512 (category: pre_checkin)
    • matmul_batch_offset_transpose: transposed matrix combinations (category: pre_checkin)
    • matmul_batch_offset_alpha_beta: various alpha/beta combinations (category: pre_checkin)
    • matmul_batch_offset_large: matrices with very large offsets which requires 64-bit integer type (category: nightly)
  • Scoped large tests to tested GPU architectures to avoid CI failures due to limited device memory resources.

Misc.

  • Fixed minor typo of an internal function name:
    • rocblaslt_matrix_layout_destory() --> rocblaslt_matrix_layout_destroy().

Test Plan

  1. Unit tests: Run new matmul_batch_offset test suite across quick/pre_checkin/nightly categories
  2. Precision coverage: All tests execute across f32, f16, bf16 data types
  3. Transpose modes: Validated with NN, NT, TN, TT matrix configurations
  4. Alpha/Beta combinations: Tested all GEMM modes (alpha-only, beta-only, alpha+beta)
  5. Offset values: Validated with offsets of various number of elements, including very large offset values, which actually requires 64-bit integer type.
  6. Batch counts: Tested with 1-4 batches
  7. Locally build and run relevant tests as well as look at the CI test results.

Test Result

  • All matmul_batch_offset tests passing across all categories
  • No regressions in existing test suites
  • Successful builds and test execution on gfx942 / gfx950 locally
  • CI tests PASSED

Risk level

Low

  • Changes are feature-additive (no modification to existing behavior when offset=0)
  • Kernel changes are scoped to new offset parameter handling
  • Minimal register pressure impact (only 2 extra temporary SGPRs)
  • Offset arguments placed at kernarg buffer tail to avoid breaking existing kernel binaries

Submission Checklist

Associated ticket: AIHPBLAS-1456

@codecov-commenter

codecov-commenter commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.50000% with 63 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...tail/rocblaslt/src/include/rocblaslt_mat_utils.hpp 30.95% 21 Missing and 8 partials ⚠️
...c/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp 46.88% 15 Missing and 2 partials ⚠️
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp 60.00% 11 Missing and 1 partial ⚠️
...laslt/library/src/amd_detail/include/auxiliary.hpp 54.55% 1 Missing and 4 partials ⚠️

❌ Your project status has failed because the head coverage (76.92%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff            @@
##           develop    #7585    +/-   ##
=========================================
  Coverage    71.33%   71.33%            
=========================================
  Files         2628     2628            
  Lines       413043   413216   +173     
  Branches     61875    61905    +30     
=========================================
+ Hits        294613   294744   +131     
- Misses       96656    96678    +22     
- Partials     21774    21794    +20     
Flag Coverage Δ *Carryforward flag
TensileLite 76.67% <100.00%> (+0.01%) ⬆️
hipBLAS 90.81% <ø> (ø) Carriedforward from 0ff7418
hipBLASLt 41.62% <59.09%> (+0.27%) ⬆️
hipCUB 82.68% <ø> (ø) Carriedforward from 0ff7418
hipDNN 85.91% <ø> (ø) Carriedforward from 0ff7418
hipFFT 50.17% <ø> (ø) Carriedforward from 0ff7418
hipRAND 76.12% <ø> (ø) Carriedforward from 0ff7418
hipSOLVER 69.18% <ø> (ø) Carriedforward from 0ff7418
hipSPARSE 86.55% <ø> (ø) Carriedforward from 0ff7418
rocBLAS 48.06% <ø> (ø) Carriedforward from 0ff7418
rocFFT 46.30% <ø> (ø) Carriedforward from 0ff7418
rocRAND 57.07% <ø> (ø) Carriedforward from 0ff7418
rocSOLVER 76.92% <ø> (ø) Carriedforward from 0ff7418
rocSPARSE 72.37% <ø> (ø) Carriedforward from 0ff7418
rocThrust 91.36% <ø> (ø) Carriedforward from 0ff7418

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...ts/hipblaslt/library/include/hipblaslt/hipblaslt.h 75.00% <ø> (ø)
...cts/hipblaslt/library/src/amd_detail/hipblaslt.cpp 47.54% <100.00%> (ø)
...rary/src/amd_detail/rocblaslt/src/include/handle.h 84.44% <ø> (ø)
...ary/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp 83.59% <100.00%> (+0.43%) ⬆️
...t/library/src/amd_detail/rocblaslt/src/utility.cpp 28.17% <100.00%> (+0.81%) ⬆️
...blaslt/tensilelite/Tensile/Components/Signature.py 92.12% <100.00%> (+0.38%) ⬆️
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py 70.84% <100.00%> (+0.02%) ⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 69.08% <100.00%> (+0.04%) ⬆️
...aslt/tensilelite/Tensile/KernelWriterConversion.py 83.73% <100.00%> (+0.08%) ⬆️
...laslt/library/src/amd_detail/include/auxiliary.hpp 2.36% <54.55%> (+2.36%) ⬆️
... and 3 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 6312fcd to 8009839 Compare May 20, 2026 17:59
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from a5d7f24 to 6175079 Compare May 29, 2026 02:14
@mpanoop

mpanoop commented May 29, 2026

Copy link
Copy Markdown
Contributor

@shbae, we need account for post GSU scenario and KernelOutputConversion.py should be updated to add the offsets for General Batched GEMM scenario.

@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 90e3d97 to a4a15d7 Compare May 29, 2026 23:18
@KKyang KKyang requested a review from jichangjichang May 30, 2026 09:43
@KKyang

KKyang commented May 30, 2026

Copy link
Copy Markdown
Contributor

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

@randyh62 randyh62 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 1e8f008 to a68b688 Compare June 3, 2026 00:09
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch 2 times, most recently from 59fabe4 to 0da219f Compare June 12, 2026 02:57
@shbae shbae changed the title [Draft] 64-bit offset support in hipBLASLt 64-bit offset support in hipBLASLt Jun 12, 2026
@shbae shbae marked this pull request as ready for review June 12, 2026 21:39
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from d8b4b88 to 3bb065b Compare June 15, 2026 21:46
@mpanoop mpanoop mentioned this pull request Jun 16, 2026
1 task
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 58f2d93 to 669e2d1 Compare June 17, 2026 16:58
@shbae

shbae commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

Hi @KKyang and @jichangjichang, this PR is ready to be reviewed, and I've implemented it with minimum usage of SGPR, which requires only 2 temporary SGPRs during updating corresponding offset to each matrix pointer. Please, let me know if you have any comments or questions for this PR. Thank you!

@jichangjichang

Copy link
Copy Markdown
Contributor

Could you add test to verify it with all solution for some small sizes for batch offset test?
You can refer to "matmul_heuristic_all_solutions"

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end 64-bit batch offset support for hipBLASLt general-batched (pointer-array) GEMM by plumbing new matrix-layout offset attributes through the rocblaslt/hipblaslt API layers, TensileLite host argument packing, and TensileLite kernel generation/assembly address calculations, plus introducing a dedicated test suite.

Changes:

  • Extend matrix layout descriptors and validation to carry per-matrix 64-bit batch offsets (A/B/C/D) and pass them through rocblaslt → TensileLite inputs/args.
  • Update TensileLite kernel signature generation and assembly/kernel writers to load/apply 64-bit offsets when computing per-batch base addresses in pointer-array mode.
  • Add new matmul_batch_offset gtest entry + YAML coverage and a dedicated client-side test implementation.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
projects/hipblaslt/tensilelite/Tensile/KernelWriterConversion.py Adds offset args to conversion kernel signature and applies offsets when indexing pointer arrays for C/D.
projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Loads batch offsets from kernargs and applies 64-bit address arithmetic for A/B loads and C/D stores in pointer-array mode.
projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Tracks kernarg byte offsets for batchOffset* fields in writer state.
projects/hipblaslt/tensilelite/Tensile/Components/Signature.py Appends batchOffsetA/B/C/D u64 args to kernarg tail and records their byte offsets for assembly loaders.
projects/hipblaslt/tensilelite/src/ContractionSolution.cpp Appends batchOffset* args to kernel invocations (SupportUserArgs and conversion paths).
projects/hipblaslt/tensilelite/rocisa/rocisa/src/code.cpp Exposes signature offset metadata to Python bindings.
projects/hipblaslt/tensilelite/include/Tensile/ContractionProblem.hpp Extends ContractionInputs with batchOffsetA/B/C/D.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/utility.cpp Updates layout-attribute stringification and adds OFFSET attribute name.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/tensile_host.cpp Converts user offsets (elements) to byte offsets for kernel consumption.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp Plumbs batch_offset_* through problem construction and kernel selection paths.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp Implements matrix layout OFFSET attribute and fixes destroy API typo in implementation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/rocblaslt_mat_utils.hpp Adds offset validation rules (incl. MX-type restriction) and plumbs offsets through arg validation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/handle.h Adds batch_offset field to matrix layout descriptor.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-types.h Adds ROCBLASLT_MATRIX_LAYOUT_OFFSET enum and batch_offset_* fields to RocblasltContractionProblem.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-auxiliary.h Renames rocblaslt_matrix_layout_destoryrocblaslt_matrix_layout_destroy in public header.
projects/hipblaslt/library/src/amd_detail/include/auxiliary.hpp Adds hip_datatype_is_mxtype helper for sub-byte datatype checks.
projects/hipblaslt/library/src/amd_detail/hipblaslt.cpp Updates hipblasLt wrapper to call the corrected destroy function name.
projects/hipblaslt/library/include/hipblaslt/hipblaslt.h Adds HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to public hipblasLt API.
projects/hipblaslt/clients/tests/src/matmul_gtest.cpp Wires new matmul_batch_offset test function into gtest dispatch/filter.
projects/hipblaslt/clients/tests/data/matmul_gtest.yaml Adds quick/pre_checkin/nightly batch-offset test cases (including very large offsets).
projects/hipblaslt/clients/tests/data/hipblaslt_common.yaml Adds CLI/YAML argument definitions and defaults for batch_offset_{a,b,c,d}.
projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp New test implementation validating offset behavior vs CPU reference.
projects/hipblaslt/clients/common/include/hipblaslt_arguments.hpp Adds batch_offset_{a,b,c,d} fields to Arguments struct and serialization macros.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp Outdated
shbae added 24 commits July 1, 2026 03:58
… to add arguments appropriately and to use only two extra SGPRs
…d excludes custom kernels solutions for General batched GEMM.
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 24de4c7 to 05fff7d Compare July 1, 2026 07:59
@therock-pr-bot

therock-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

❌ PR Check — Action Required

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ❌ Fail Error: Title does not follow Conventional Commits style.
Expected: start with a valid type (feat, fix, docs, …).
Desired format: type(optional-scope): short description
───
Error: PR description must reference a JIRA ID, ISSUE ID, or a GitHub closing keyword.
Expected: include a JIRA ID / ISSUE ID line (separator : or -, or omitted; value may be a JIRA key, a number with/without #, or a link), OR a closing keyword + issue reference. Accepted examples:
JIRA ID : TESTAUTO-6039
JIRA ID - #330
JIRA ID #330
ISSUE ID : TESTUTO-3334
ISSUE ID #3334
ISSUE ID - TESTAUTO-3433
ISSUE ID : https://github.com/<org_name>/<repo_name>/issues/1234
Closes #10
Fixes octo-org/octo-repo#100
Resolves: #123
#123
https://github.com/<org_name>/<repo_name>/issues/123
Current: no valid JIRA/ISSUE/closing-keyword reference found
Forbidden Files ✅ Pass
🧪 Unit Test ❌ Fail Error: Source/code files changed without an accompanying unit test.
Expected: add at least one test file named like test_<name>.py / test_<name>.cpp (or <name>_test.*).
Current: code file(s) changed: projects/hipblaslt/clients/common/include/hipblaslt_arguments.hpp, projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp, projects/hipblaslt/clients/tests/src/matmul_gtest.cpp, projects/hipblaslt/library/include/hipblaslt/hipblaslt.h, projects/hipblaslt/library/src/amd_detail/hipblaslt.cpp (+18 more); no test file found
🔎 pre-commit ✅ Pass
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled

⚠️ 2 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

  • ❌ PR Title/Description
  • ❌ Unit Test

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot

therock-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🚫 Please fix the failed policies before requesting reviews.

The following policy checks failed:

  • ❌ PR Title/Description
  • ❌ Unit Test

The Not ready to Review label has been added to this PR.
Once all policies pass, the label will be removed automatically.

@shbae shbae changed the title 64-bit offset support in hipBLASLt JIRA ID : AIHPBLAS-1456 -- 64-bit offset support in hipBLASLt Jul 1, 2026
@shbae shbae changed the title JIRA ID : AIHPBLAS-1456 -- 64-bit offset support in hipBLASLt [JIRA: AIHPBLAS-1456] - 64-bit offset support in hipBLASLt Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants