Add swizzled local memory accessors for FP16 and INT8 on the PTX backend#841
Merged
stratika merged 4 commits intoMay 28, 2026
Merged
Conversation
9 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds PTX-only swizzled local/shared-memory accessors to KernelContext for FP16 and INT8 tile layouts, with backend plugin registration and unit coverage for the new API.
Changes:
- Adds public
KernelContextswizzled load/store helpers for FP16 stride-32, FP16 stride-16, and INT8. - Implements PTX graph builder plugins, Graal nodes, and PTX LIR emission for the new accessors.
- Registers unsupported-backend stubs and adds unit tests to the Tornado test suite.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tornado-api/src/main/java/uk/ac/manchester/tornado/api/KernelContext.java |
Adds Java API and fallback implementations for swizzled local-memory accessors. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/compiler/plugins/PTXGraphBuilderPlugins.java |
Registers PTX lowering plugins for the new accessors. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/lir/PTXLIRStmt.java |
Emits PTX shared-memory swizzle address calculations and load/store instructions. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledLoadFP16Stride32Node.java |
Adds FP16 stride-32 swizzled load node. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledStoreFP16Stride32Node.java |
Adds FP16 stride-32 swizzled store node. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledLoadFP16Stride16Node.java |
Adds FP16 stride-16 swizzled load node. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledStoreFP16Stride16Node.java |
Adds FP16 stride-16 swizzled store node. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledLoadInt8Node.java |
Adds INT8 swizzled load node. |
tornado-drivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/SwizzledStoreInt8Node.java |
Adds INT8 swizzled store node. |
tornado-drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/graal/compiler/plugins/OCLGraphBuilderPlugins.java |
Registers unsupported stubs for the new PTX-only API. |
tornado-drivers/spirv/src/main/java/uk/ac/manchester/tornado/drivers/spirv/graal/compiler/plugins/SPIRVGraphBuilderPlugins.java |
Registers unsupported stubs for the new PTX-only API. |
tornado-drivers/metal/src/main/java/uk/ac/manchester/tornado/drivers/metal/graal/compiler/plugins/MetalGraphBuilderPlugins.java |
Registers unsupported stubs for the new PTX-only API. |
tornado-unittests/src/main/java/uk/ac/manchester/tornado/unittests/kernelcontext/local/memory/TestSwizzledLocalArrays.java |
Adds PTX-focused unit tests for FP16 and INT8 swizzled local-memory round trips. |
tornado-assembly/src/bin/tornado-test |
Adds the new swizzled local-array test class to the test suite. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| r.register(new InvocationPlugin("swizzleStoreFp16Stride32", InvocationPlugin.Receiver.class, HalfFloat[].class, int.class, int.class, int.class, HalfFloat.class) { | ||
| @Override | ||
| public boolean apply(GraphBuilderContext b, ResolvedJavaMethod targetMethod, Receiver receiver, ValueNode local_array, ValueNode row, ValueNode column, ValueNode stride, ValueNode value) { | ||
| b.addPush(JavaKind.Object, new SwizzledStoreFP16Stride32Node(local_array, row, column, stride, value)); |
| r.register(new InvocationPlugin("swizzleStoreFp16Stride16", InvocationPlugin.Receiver.class, HalfFloat[].class, int.class, int.class, int.class, HalfFloat.class) { | ||
| @Override | ||
| public boolean apply(GraphBuilderContext b, ResolvedJavaMethod targetMethod, Receiver receiver, ValueNode local_array, ValueNode row, ValueNode column, ValueNode stride, ValueNode value) { | ||
| b.addPush(JavaKind.Object, new SwizzledStoreFP16Stride16Node(local_array, row, column, stride, value)); |
| @Override | ||
| public boolean apply(GraphBuilderContext b, ResolvedJavaMethod targetMethod, Receiver receiver, | ||
| ValueNode local_array, ValueNode row, ValueNode column, ValueNode stride, ValueNode value) { | ||
| b.addPush(JavaKind.Byte, new SwizzledStoreInt8Node(local_array, row, column, stride, value)); |
stratika
reviewed
May 27, 2026
Collaborator
stratika
left a comment
There was a problem hiding this comment.
LGTM, please consider and try the comment from co-pilot.
…d instead of addPush for stores, extend the unittest and include the swizzledstore* nodes in the TornadoHalfFloatReplacement)
Collaborator
Author
|
The comments have been applied on both this PR and PR #843 |
9 tasks
stratika
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds swizzled shared-memory accessors to the
KernelContextfor the PTX backend, providing bank-conflict-free layouts for FP16 and INT8 matrix tiles. The new accessors exposed through theKernelContextare:swizzleLoadFp16Stride32/swizzleStoreFp16Stride32swizzleLoadFp16Stride16/swizzleStoreFp16Stride16swizzleLoadInt8/swizzleStoreInt8Each applies an involutive XOR permutation to the logical
(row, col)coordinate before accessing shared memory, so that the resulting access pattern spreads across distinct memory banks instead of colliding. On NVIDIA GPUs sharedmemory has 32 banks of 4 bytes each. A naive row-major tile layout causes many threads in a warp to hit the same bank, serializing the access. The XOR rotates each row's bank assignment to avoid this.
The constants follow the CUTLASS
Swizzle<>, parameterized per layout:The layout is primarily intended for staging matrix tiles for future Tensor Core (MMA) work, but is a general bank conflict-avoidance mechanism usable by any kernel with a matching access pattern.
Problem description
Efficient Tensor Core (MMA) matrix multiplication requires shared-memory tiles to be laid out so that warp-level matrix loads do not incur bank conflicts. Currently shared arrays are addressed linearly, which produces heavy bank conflicts for the strided access patterns matrix loads use. This PR adds the swizzled-layout support needed to lay those tiles out in shared memory conflict-free, ahead of the MMA work that will consume it.
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?
The load/store functionality can be verified by running the unittest:
tornado-test -V uk.ac.manchester.tornado.unittests.kernelcontext.local.memory.TestSwizzledLocalArraysEach kernel in the test was also profiled with Nsight Compute and produces zero shared-memory bank conflicts (
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld/st = 0), with non-zerosmsp__inst_executed_op_shared_ld/st, confirming the accesses execute.