-The solution implements a parallel row-wise sum reduction for a 2D matrix using LayoutTensor. Here's a comprehensive breakdown:
+The solution implements a parallel row-wise sum reduction for a 2D matrix using TileTensor. Here's a comprehensive breakdown:
### Matrix layout and block mapping
```txt
-Input Matrix (4ร6) with LayoutTensor: Block Assignment:
+Input Matrix (4ร6) with TileTensor: Block Assignment:
[[ a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] ] โ Block(0,0)
[ a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] ] โ Block(0,1)
[ a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] ] โ Block(0,2)
@@ -156,9 +156,9 @@ Input Matrix (4ร6) with LayoutTensor: Block Assignment:
- Each block processes one complete row
2. **Memory Access Pattern**:
- - LayoutTensor 2D indexing for input: `a[batch, local_i]`
+ - TileTensor 2D indexing for input: `a[batch, local_i]`
- Shared memory for efficient reduction
- - LayoutTensor 2D indexing for output: `output[batch, 0]`
+ - TileTensor 2D indexing for output: `output[batch, 0]`
3. **Parallel Reduction Logic**:
@@ -196,7 +196,7 @@ Input Matrix (4ร6) with LayoutTensor: Block Assignment:
### Performance optimizations
1. **Memory Efficiency**:
- - Coalesced memory access through LayoutTensor
+ - Coalesced memory access through TileTensor
- Shared memory for fast reduction
- Single write per row result
diff --git "a/book/src/puzzle_16/na\303\257ve.md" "b/book/src/puzzle_16/na\303\257ve.md"
index 936f65aa..429e5d27 100644
--- "a/book/src/puzzle_16/na\303\257ve.md"
+++ "b/book/src/puzzle_16/na\303\257ve.md"
@@ -24,9 +24,9 @@ The key insight is understanding how to map 2D thread indices to matrix elements
Layout configuration:
-- Input A: `Layout.row_major(SIZE, SIZE)`
-- Input B: `Layout.row_major(SIZE, SIZE)`
-- Output: `Layout.row_major(SIZE, SIZE)`
+- Input A: `row_major[SIZE, SIZE]()`
+- Input B: `row_major[SIZE, SIZE]()`
+- Output: `row_major[SIZE, SIZE]()`
## Code to complete
@@ -108,7 +108,7 @@ expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
-## Solution: Idiomatic LayoutTensor tiling
+## Solution: Idiomatic TileTensor tiling
@@ -335,14 +335,14 @@ This implementation achieves high performance through:
-The idiomatic tiled matrix multiplication leverages Mojo's LayoutTensor API and asynchronous memory operations for a beautifully clean implementation.
+The idiomatic tiled matrix multiplication leverages Mojo's TileTensor API and asynchronous memory operations for a beautifully clean implementation.
**๐ Key Point: This implementation performs standard matrix multiplication A ร B using coalesced loading for both matrices.**
**What this implementation does:**
- **Matrix operation**: Standard \\(A \times B\\) multiplication (not \\(A \times B^T\\))
-- **Loading pattern**: Both matrices use `Layout.row_major(1, TPB)` for coalesced access
+- **Loading pattern**: Both matrices use `row_major[1, TPB]()` for coalesced access
- **Computation**: `acc += a_shared[local_row, k] * b_shared[k, local_col]`
- **Data layout**: No transposition during loading - both matrices loaded in same orientation
@@ -354,7 +354,7 @@ The idiomatic tiled matrix multiplication leverages Mojo's LayoutTensor API and
With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates all boundary checks:
-1. **LayoutTensor tile API**
+1. **TileTensor tile API**
```mojo
out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
@@ -362,7 +362,7 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a
b_tile = b.tile[TPB, TPB](idx, block_idx.x)
```
- This directly expresses "get the tile at position (block_idx.y, block_idx.x)" without manual coordinate calculation. See the [documentation](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile) for more details.
+ This directly expresses "get the tile at position (block_idx.y, block_idx.x)" without manual coordinate calculation. See the [documentation](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile) for more details.
2. **Asynchronous memory operations**
@@ -393,14 +393,14 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a
3. **Optimized memory access layouts**
```mojo
- comptime load_a_layout = Layout.row_major(1, TPB) # Coalesced loading
- comptime load_b_layout = Layout.row_major(1, TPB) # Coalesced loading
+ comptime load_a_layout = row_major[1, TPB]() # Coalesced loading
+ comptime load_b_layout = row_major[1, TPB]() # Coalesced loading
# Note: Both matrices use the same layout for standard A ร B multiplication
```
**Memory Access Analysis for Current Implementation:**
- Both matrices use `Layout.row_major(1, TPB)` for coalesced loading from global memory:
+ Both matrices use `row_major[1, TPB]()` for coalesced loading from global memory:
- `load_a_layout`: Threads cooperate to load consecutive elements from matrix A rows
- `load_b_layout`: Threads cooperate to load consecutive elements from matrix B rows
- **Key insight**: Thread layout determines how threads cooperate during copy, not the final data layout
@@ -422,11 +422,11 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a
- Matrix A tile: threads load A[block_row, k], A[block_row, k+1], A[block_row, k+2]... (consecutive)
- Matrix B tile: threads load B[k, block_col], B[k, block_col+1], B[k, block_col+2]... (consecutive)
- Both patterns are coalesced with Layout.row_major(1, TPB)
+ Both patterns are coalesced with row_major[1, TPB]()
```
**Three separate memory concerns:**
- 1. **Global-to-shared coalescing**: `Layout.row_major(1, TPB)` ensures coalesced global memory access
+ 1. **Global-to-shared coalescing**: `row_major[1, TPB]()` ensures coalesced global memory access
2. **Shared memory computation**: `a_shared[local_row, k] * b_shared[k, local_col]` avoids bank conflicts
3. **Matrix operation**: The computation pattern determines this is A ร B, not A ร B^T
@@ -465,7 +465,7 @@ This implementation shows how high-level abstractions can express complex GPU al
| Feature | Manual Tiling | Idiomatic Tiling |
|---------|--------------|------------------|
-| Memory access | Direct indexing with bounds checks | LayoutTensor tile API |
+| Memory access | Direct indexing with bounds checks | TileTensor tile API |
| Tile loading | Explicit element-by-element copying | Dedicated copy engine bulk transfers |
| Shared memory | Manual initialization (defensive) | Managed by copy functions |
| Code complexity | More verbose with explicit indexing | More concise with higher-level APIs |
@@ -481,7 +481,7 @@ The current implementation does NOT use transposed loading. This section is pure
**Current implementation recap:**
-- Uses `Layout.row_major(1, TPB)` for both matrices
+- Uses `row_major[1, TPB]()` for both matrices
- Performs standard A ร B multiplication
- No data transposition during copy
@@ -492,8 +492,8 @@ While this puzzle uses standard coalesced loading for both matrices, the layout
```mojo
# Example: Loading pre-transposed matrix B^T to compute A ร B
# (This is NOT what the current implementation does)
-comptime load_b_layout = Layout.row_major(TPB, 1) # Load B^T with coalesced access
-comptime store_b_layout = Layout.row_major(1, TPB) # Store as B in shared memory
+comptime load_b_layout = row_major[TPB, 1]() # Load B^T with coalesced access
+comptime store_b_layout = row_major[1, TPB]() # Store as B in shared memory
copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)
```
@@ -506,7 +506,7 @@ copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store
**Key distinction:**
-- **Current implementation**: Both matrices use `Layout.row_major(1, TPB)` for standard \\(A \times B\\) multiplication
+- **Current implementation**: Both matrices use `row_major[1, TPB]()` for standard \\(A \times B\\) multiplication
- **Transposed loading example**: Would use different layouts to handle pre-transposed data or different matrix operations
This demonstrates Mojo's philosophy: providing low-level control when needed while maintaining high-level abstractions for common cases.
@@ -518,13 +518,13 @@ This demonstrates Mojo's philosophy: providing low-level control when needed whi
**What the idiomatic tiled implementation actually does:**
1. **Matrix Operation**: Standard A ร B multiplication
-2. **Memory Loading**: Both matrices use `Layout.row_major(1, TPB)` for coalesced access
+2. **Memory Loading**: Both matrices use `row_major[1, TPB]()` for coalesced access
3. **Computation Pattern**: `acc += a_shared[local_row, k] * b_shared[k, local_col]`
4. **Data Layout**: No transposition during loading
**Why this is optimal:**
-- **Coalesced global memory access**: `Layout.row_major(1, TPB)` ensures efficient loading
+- **Coalesced global memory access**: `row_major[1, TPB]()` ensures efficient loading
- **Bank conflict avoidance**: Shared memory access pattern avoids conflicts
- **Standard algorithm**: Implements the most common matrix multiplication pattern
diff --git a/book/src/puzzle_17/puzzle_17.md b/book/src/puzzle_17/puzzle_17.md
index ef8f2ec5..fd7d0fe1 100644
--- a/book/src/puzzle_17/puzzle_17.md
+++ b/book/src/puzzle_17/puzzle_17.md
@@ -141,7 +141,7 @@ Let's break down how this works in the larger context:
3. **Custom op registration**:
- The `@compiler.register("conv1d")` decorator exposes our operation to MAX Graph. See [@compiler.register](https://docs.modular.com/mojo/manual/decorators/compiler-register/)
- The `execute` method parameters define the interface (inputs, outputs, context)
- - Input/output tensors are converted to LayoutTensors for use in our kernel
+ - Input/output tensors are converted to TileTensors for use in our kernel
- Device context manages GPU memory allocation and kernel execution
4. **Kernel execution**:
@@ -180,7 +180,7 @@ Let's break down how this works in the larger context:
kernel_tensor = kernel.to_layout_tensor()
```
- - MAX Graph tensors are converted to Mojo LayoutTensors
+ - MAX Graph tensors are converted to Mojo TileTensors
- This allows our kernel to work with them directly
- The layouts are extracted for compile-time optimization
diff --git a/book/src/puzzle_18/puzzle_18.md b/book/src/puzzle_18/puzzle_18.md
index b21e84e8..e1884e5c 100644
--- a/book/src/puzzle_18/puzzle_18.md
+++ b/book/src/puzzle_18/puzzle_18.md
@@ -42,8 +42,8 @@ Our GPU implementation uses parallel reduction for both finding the maximum valu
Layout configuration:
-- Input tensor: `Layout.row_major(SIZE)`
-- Output tensor: `Layout.row_major(SIZE)`
+- Input tensor: `row_major[SIZE]()`
+- Output tensor: `row_major[SIZE]()`
- Custom op parameters: `{"input_size": input_tensor.shape[0]}`
Key aspects of this puzzle include:
@@ -257,8 +257,8 @@ def softmax_gpu_kernel[
input_size: Int,
dtype: DType = DType.float32,
](
- output: LayoutTensor[mut=True, dtype, layout],
- input: LayoutTensor[mut=False, dtype, layout],
+ output: TileTensor[mut=True, dtype, layout],
+ input: TileTensor[mut=False, dtype, layout],
)
```
@@ -273,8 +273,8 @@ The kernel is parameterized with:
#### Shared memory allocation
```mojo
-shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
-shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+shared_max = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]())
+shared_sum = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]())
```
The kernel allocates two shared memory buffers:
diff --git a/book/src/puzzle_19/puzzle_19.md b/book/src/puzzle_19/puzzle_19.md
index 53cea305..b581784e 100644
--- a/book/src/puzzle_19/puzzle_19.md
+++ b/book/src/puzzle_19/puzzle_19.md
@@ -81,10 +81,10 @@ Our GPU implementation **reuses and combines optimized kernels from previous puz
Layout configuration:
-- Query tensor: `Layout.row_major(d)`
-- Key tensor: `Layout.row_major(seq_len, d)`
-- Value tensor: `Layout.row_major(seq_len, d)`
-- Output tensor: `Layout.row_major(d)`
+- Query tensor: `row_major[d]()`
+- Key tensor: `row_major[seq_len, d]()`
+- Value tensor: `row_major[seq_len, d]()`
+- Output tensor: `row_major[d]()`
- Custom op parameters: `{"seq_len": seq_len, "d": d, "dtype": dtype}`
Key aspects of this puzzle include:
@@ -121,7 +121,7 @@ To complete this puzzle, we'll leverage the tiled matmul kernel from [Puzzle 16]
**Transpose Kernel Implementation Guide:**
-1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` ร `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
+1. **Shared Memory Setup**: Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY]())` to create a square `TRANSPOSE_BLOCK_DIM_XY` ร `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
2. **Thread Indexing**: Map threads to matrix elements:
- `local_row = thread_idx.y`, `local_col = thread_idx.x` (position within the block)
diff --git a/book/src/puzzle_23/elementwise.md b/book/src/puzzle_23/elementwise.md
index 0942d44d..cb0ad5df 100644
--- a/book/src/puzzle_23/elementwise.md
+++ b/book/src/puzzle_23/elementwise.md
@@ -10,7 +10,7 @@ This puzzle covers:
- **Functional GPU programming** with `elementwise`
- **Automatic SIMD vectorization** within GPU threads
-- **LayoutTensor operations** for safe memory access
+- **TileTensor operations** for safe memory access
- **GPU thread hierarchy** vs SIMD operations
- **Capturing semantics** in nested functions
@@ -24,7 +24,7 @@ The implementation covers fundamental patterns applicable to all GPU functional
- Vector size: `SIZE = 1024`
- Data type: `DType.float32`
- SIMD width: Target-dependent (determined by GPU architecture and data type)
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
## Code to complete
diff --git a/book/src/puzzle_23/puzzle_23.md b/book/src/puzzle_23/puzzle_23.md
index 47aeff5d..c34ebc2d 100644
--- a/book/src/puzzle_23/puzzle_23.md
+++ b/book/src/puzzle_23/puzzle_23.md
@@ -71,7 +71,7 @@ Before diving into functional patterns, ensure you're comfortable with:
- **Basic GPU concepts**: Memory hierarchy, thread execution, SIMD operations
- **Mojo fundamentals**: Parameter functions, compile-time specialization, capturing semantics
-- **LayoutTensor operations**: Loading, storing, and tensor manipulation
+- **TileTensor operations**: Loading, storing, and tensor manipulation
- **GPU memory management**: Buffer allocation, host-device synchronization
## Learning path
@@ -86,7 +86,7 @@ Start with the foundation: automatic thread management and SIMD vectorization.
- Functional GPU programming with `elementwise`
- Automatic SIMD vectorization within GPU threads
-- LayoutTensor operations for safe memory access
+- TileTensor operations for safe memory access
- Capturing semantics in nested functions
**Key pattern:**
diff --git a/book/src/puzzle_23/tile.md b/book/src/puzzle_23/tile.md
index 2a26d174..bd0000b1 100644
--- a/book/src/puzzle_23/tile.md
+++ b/book/src/puzzle_23/tile.md
@@ -31,7 +31,7 @@ But with a completely different execution strategy optimized for memory hierarch
- Tile size: `TILE_SIZE = 32`
- Data type: `DType.float32`
- SIMD width: GPU-dependent (for operations within tiles)
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
## Code to complete
@@ -58,7 +58,7 @@ For a 1024-element vector with `TILE_SIZE=32`: `1024 รท 32 = 32` tiles exactly.
### 2. **Tile extraction pattern**
-Check out the [LayoutTensor `.tile` documentation](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile).
+Check out the [TileTensor `.tile` documentation](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile).
```mojo
tile_id = indices[0] # Each thread gets one tile to process
diff --git a/book/src/puzzle_23/vectorize.md b/book/src/puzzle_23/vectorize.md
index 6098ac58..5dd31e81 100644
--- a/book/src/puzzle_23/vectorize.md
+++ b/book/src/puzzle_23/vectorize.md
@@ -32,7 +32,7 @@ But with sophisticated vectorization strategies for maximum performance.
- Tile size: `TILE_SIZE = 32`
- Data type: `DType.float32`
- SIMD width: GPU-dependent
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
## 1. Manual vectorization approach
diff --git a/book/src/puzzle_24/puzzle_24.md b/book/src/puzzle_24/puzzle_24.md
index 59733f5d..d261d862 100644
--- a/book/src/puzzle_24/puzzle_24.md
+++ b/book/src/puzzle_24/puzzle_24.md
@@ -49,9 +49,9 @@ Learn the core warp primitives from `gpu.primitives.warp`:
```mojo
# 1. Reduction through shared memory
# Complex pattern we have seen earlier (from p12.mojo):
-shared = LayoutTensor[
+shared = TileTensor[
dtype,
- Layout.row_major(WARP_SIZE),
+ row_major[WARP_SIZE](),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
@@ -93,7 +93,7 @@ Before diving into warp programming, ensure you're comfortable with:
- **Part V functional patterns**: Elementwise, tiled, and vectorized approaches
- **GPU thread hierarchy**: Understanding blocks, warps, and threads
-- **LayoutTensor operations**: Loading, storing, and tensor manipulation
+- **TileTensor operations**: Loading, storing, and tensor manipulation
- **Shared memory concepts**: Why barriers and tree reduction are complex
## Learning path
diff --git a/book/src/puzzle_24/warp_sum.md b/book/src/puzzle_24/warp_sum.md
index 1289baff..c04db78f 100644
--- a/book/src/puzzle_24/warp_sum.md
+++ b/book/src/puzzle_24/warp_sum.md
@@ -25,7 +25,7 @@ But the implementation teaches fundamental patterns for all warp-level GPU progr
- Data type: `DType.float32`
- Block configuration: `(WARP_SIZE, 1)` threads per block
- Grid configuration: `(1, 1)` blocks per grid
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
## The traditional complexity (from Puzzle 12)
diff --git a/book/src/puzzle_25/puzzle_25.md b/book/src/puzzle_25/puzzle_25.md
index ac88c36f..8f0b7798 100644
--- a/book/src/puzzle_25/puzzle_25.md
+++ b/book/src/puzzle_25/puzzle_25.md
@@ -48,9 +48,9 @@ Learn the core communication primitives from `gpu.primitives.warp`:
```mojo
# Complex neighbor access pattern (traditional approach):
-shared = LayoutTensor[
+shared = TileTensor[
dtype,
- Layout.row_major(WARP_SIZE),
+ row_major[WARP_SIZE](),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
@@ -89,7 +89,7 @@ Before diving into warp communication, ensure you're comfortable with:
- **Part VII warp fundamentals**: Understanding SIMT execution and basic warp operations (see [Puzzle 24](../puzzle_24/puzzle_24.md))
- **GPU thread hierarchy**: Blocks, warps, and lane numbering
-- **LayoutTensor operations**: Loading, storing, and tensor manipulation
+- **TileTensor operations**: Loading, storing, and tensor manipulation
- **Boundary condition handling**: Managing edge cases in parallel algorithms
## Learning path
diff --git a/book/src/puzzle_25/warp_broadcast.md b/book/src/puzzle_25/warp_broadcast.md
index b6729a70..25d9f6d1 100644
--- a/book/src/puzzle_25/warp_broadcast.md
+++ b/book/src/puzzle_25/warp_broadcast.md
@@ -80,7 +80,7 @@ Implement a basic broadcast pattern where lane 0 computes a block-level statisti
- Grid configuration: `(1, 1)` blocks per grid
- Block configuration: `(WARP_SIZE, 1)` threads per block
- Data type: `DType.float32`
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
### Code to complete
diff --git a/book/src/puzzle_25/warp_shuffle_down.md b/book/src/puzzle_25/warp_shuffle_down.md
index 55664f9e..e034fc18 100644
--- a/book/src/puzzle_25/warp_shuffle_down.md
+++ b/book/src/puzzle_25/warp_shuffle_down.md
@@ -29,7 +29,7 @@ This transforms complex neighbor access patterns into simple warp-level operatio
- Grid configuration: `(1, 1)` blocks per grid
- Block configuration: `(WARP_SIZE, 1)` threads per block
- Data type: `DType.float32`
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
### The shuffle_down concept
diff --git a/book/src/puzzle_26/puzzle_26.md b/book/src/puzzle_26/puzzle_26.md
index 962a606f..6ddad887 100644
--- a/book/src/puzzle_26/puzzle_26.md
+++ b/book/src/puzzle_26/puzzle_26.md
@@ -48,9 +48,9 @@ Learn the sophisticated communication primitives from `gpu.primitives.warp`:
```mojo
# Complex parallel reduction (traditional approach - from Puzzle 14):
-shared = LayoutTensor[
+shared = TileTensor[
dtype,
- Layout.row_major(WARP_SIZE),
+ row_major[WARP_SIZE](),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
diff --git a/book/src/puzzle_26/warp_prefix_sum.md b/book/src/puzzle_26/warp_prefix_sum.md
index 4a0d6574..bf861e87 100644
--- a/book/src/puzzle_26/warp_prefix_sum.md
+++ b/book/src/puzzle_26/warp_prefix_sum.md
@@ -26,7 +26,7 @@ This transforms multi-phase shared memory algorithms into elegant single-functio
- Grid configuration: `(1, 1)` blocks per grid
- Block configuration: `(WARP_SIZE, 1)` threads per block
- Data type: `DType.float32`
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
### The `prefix_sum` advantage
diff --git a/book/src/puzzle_26/warp_shuffle_xor.md b/book/src/puzzle_26/warp_shuffle_xor.md
index 925b48d6..51415b98 100644
--- a/book/src/puzzle_26/warp_shuffle_xor.md
+++ b/book/src/puzzle_26/warp_shuffle_xor.md
@@ -29,7 +29,7 @@ This transforms complex parallel algorithms into elegant butterfly communication
- Grid configuration: `(1, 1)` blocks per grid
- Block configuration: `(WARP_SIZE, 1)` threads per block
- Data type: `DType.float32`
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
### The `shuffle_xor` concept
diff --git a/book/src/puzzle_27/block_broadcast.md b/book/src/puzzle_27/block_broadcast.md
index 8e0a9fd9..6edcaefc 100644
--- a/book/src/puzzle_27/block_broadcast.md
+++ b/book/src/puzzle_27/block_broadcast.md
@@ -25,7 +25,7 @@ Each thread contributes to the mean calculation, then receives the broadcast mea
- Data type: `DType.float32`
- Block configuration: `(128, 1)` threads per block (`TPB = 128`)
- Grid configuration: `(1, 1)` blocks per grid
-- Layout: `Layout.row_major(SIZE)` (1D row-major for input and output)
+- Layout: `row_major[SIZE]()` (1D row-major for input and output)
- Test data: Values cycling 1-8, so mean = 4.5
- Expected output: Normalized vector with mean = 1.0
@@ -100,7 +100,7 @@ The algorithm follows the perfect block operations pattern:
### 2. **Data loading and sum computation (familiar patterns)**
-Load your element using the established LayoutTensor pattern:
+Load your element using the established TileTensor pattern:
```mojo
var my_value: Scalar[dtype] = 0.0
@@ -255,7 +255,7 @@ Thread indexing (consistent across all puzzles):
global_i = block_dim.x * block_idx.x + thread_idx.x // Maps to input array position
local_i = thread_idx.x // Position within block (0-127)
-Parallel element loading using LayoutTensor pattern:
+Parallel element loading using TileTensor pattern:
Thread 0: my_value = input_data[0][0] = 1.0 // First cycle value
Thread 1: my_value = input_data[1][0] = 2.0 // Second cycle value
Thread 7: my_value = input_data[7][0] = 8.0 // Last cycle value
@@ -373,10 +373,10 @@ Mathematical proof of correctness:
Algorithm produces provably correct mathematical result.
```
-### **Connection to [Puzzle 12](../puzzle_12/layout_tensor.md) (foundational patterns):**
+### **Connection to [Puzzle 12](../puzzle_12/tile_tensor.md) (foundational patterns):**
- **Thread coordination evolution**: Same `global_i`, `local_i` patterns but with block primitives
-- **Memory access patterns**: Same LayoutTensor SIMD extraction `[0]` but optimized workflow
+- **Memory access patterns**: Same TileTensor SIMD extraction `[0]` but optimized workflow
- **Complexity elimination**: Replaces 20+ lines of manual barriers with 2 block operations
- **Educational progression**: Manual โ automated, complex โ simple, error-prone โ reliable
@@ -454,7 +454,7 @@ Mean normalization is the perfect educational example of this fundamental patter
**Complete block operations progression:**
-1. **Manual coordination** ([Puzzle 12](../puzzle_12/layout_tensor.md)): Understand parallel fundamentals
+1. **Manual coordination** ([Puzzle 12](../puzzle_12/tile_tensor.md)): Understand parallel fundamentals
2. **Warp primitives** ([Puzzle 24](../puzzle_24/warp_sum.md)): Learn hardware-accelerated patterns
3. **Block reduction** ([`block.sum()`](./block_sum.md)): Learn allโone communication
4. **Block scan** ([`block.prefix_sum()`](./block_prefix_sum.md)): Learn allโeach communication
diff --git a/book/src/puzzle_27/block_prefix_sum.md b/book/src/puzzle_27/block_prefix_sum.md
index afdd50bf..d9cbf3c6 100644
--- a/book/src/puzzle_27/block_prefix_sum.md
+++ b/book/src/puzzle_27/block_prefix_sum.md
@@ -26,7 +26,7 @@ Each thread determines its element's bin assignment, with `block.prefix_sum()` c
- Block configuration: `(128, 1)` threads per block (`TPB = 128`)
- Grid configuration: `(1, 1)` blocks per grid
- Number of bins: `NUM_BINS = 8` (ranges [0.0, 0.125), [0.125, 0.25), etc.)
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
- Warps per block: `128 / WARP_SIZE` (2 or 4 warps depending on GPU)
## The challenge: Parallel bin extraction
@@ -132,7 +132,7 @@ if belongs_to_target == 1:
bin_output[Int(offset[0])] = my_value # Convert SIMD to Int for indexing
```
-This is just like the bounds checking pattern from [Puzzle 12](../puzzle_12/layout_tensor.md), but now the condition is "belongs to target bin."
+This is just like the bounds checking pattern from [Puzzle 12](../puzzle_12/tile_tensor.md), but now the condition is "belongs to target bin."
### 6. **Final count computation**
@@ -150,7 +150,7 @@ if local_i == tpb - 1: # Last thread in block
Remember the patterns from previous puzzles:
-- `LayoutTensor` indexing returns SIMD: `input_data[i][0]`
+- `TileTensor` indexing returns SIMD: `input_data[i][0]`
- `block.prefix_sum()` returns SIMD: `offset[0]` to extract
- Array indexing needs `Int`: `Int(offset[0])` for `bin_output[...]`
@@ -252,14 +252,14 @@ The `block.prefix_sum()` kernel demonstrates advanced parallel coordination patt
## **Step-by-step algorithm walkthrough:**
-### **Phase 1: Element processing (like [Puzzle 12](../puzzle_12/layout_tensor.md) dot product)**
+### **Phase 1: Element processing (like [Puzzle 12](../puzzle_12/tile_tensor.md) dot product)**
```
Thread indexing (familiar pattern):
global_i = block_dim.x * block_idx.x + thread_idx.x // Global element index
local_i = thread_idx.x // Local thread index
-Element loading (like LayoutTensor pattern):
+Element loading (like TileTensor pattern):
Thread 0: my_value = input_data[0][0] = 0.00
Thread 1: my_value = input_data[1][0] = 0.01
Thread 13: my_value = input_data[13][0] = 0.13
@@ -328,11 +328,11 @@ Last thread computes total (not thread 0!):
## **Why this advanced algorithm works:**
-### **Connection to [Puzzle 12](../puzzle_12/layout_tensor.md) (Traditional dot product):**
+### **Connection to [Puzzle 12](../puzzle_12/tile_tensor.md) (Traditional dot product):**
- **Same thread indexing**: `global_i` and `local_i` patterns
- **Same bounds checking**: `if global_i < size` validation
-- **Same data loading**: LayoutTensor SIMD extraction with `[0]`
+- **Same data loading**: TileTensor SIMD extraction with `[0]`
### **Connection to [`block.sum()`](./block_sum.md) (earlier in this puzzle):**
diff --git a/book/src/puzzle_27/block_sum.md b/book/src/puzzle_27/block_sum.md
index 179e7d4f..89317757 100644
--- a/book/src/puzzle_27/block_sum.md
+++ b/book/src/puzzle_27/block_sum.md
@@ -25,12 +25,12 @@ But the implementation teaches fundamental patterns for all block-level GPU prog
- Data type: `DType.float32`
- Block configuration: `(128, 1)` threads per block (`TPB = 128`)
- Grid configuration: `(1, 1)` blocks per grid
-- Layout: `Layout.row_major(SIZE)` (1D row-major)
+- Layout: `row_major[SIZE]()` (1D row-major)
- Warps per block: `128 / WARP_SIZE` (4 warps on NVIDIA, 2 or 4 warps on AMD)
## The traditional complexity (from Puzzle 12)
-Recall the complex approach from [Puzzle 12](../puzzle_12/layout_tensor.md) that required shared memory, barriers, and tree reduction:
+Recall the complex approach from [Puzzle 12](../puzzle_12/tile_tensor.md) that required shared memory, barriers, and tree reduction:
```mojo
{{#include ../../../solutions/p27/p27.mojo:traditional_dot_product_solution}}
@@ -171,9 +171,9 @@ Every block reduction follows the same conceptual pattern:
Each thread should handle one element pair from vectors `a` and `b`. What operation combines these into a "partial result" that can be summed across threads?
-### 3. **LayoutTensor indexing patterns**
+### 3. **TileTensor indexing patterns**
-When accessing `LayoutTensor` elements, remember that indexing returns SIMD values. You'll need to extract the scalar value for arithmetic operations.
+When accessing `TileTensor` elements, remember that indexing returns SIMD values. You'll need to extract the scalar value for arithmetic operations.
### 4. **[block.sum()](https://docs.modular.com/mojo/std/gpu/primitives/block/sum) API concepts**
diff --git a/book/src/puzzle_28/puzzle_28.md b/book/src/puzzle_28/puzzle_28.md
index 8a1fb7c3..9976e13d 100644
--- a/book/src/puzzle_28/puzzle_28.md
+++ b/book/src/puzzle_28/puzzle_28.md
@@ -162,7 +162,7 @@ This concept becomes particularly important when implementing async copy operati
- Grid configuration: `(VECTOR_SIZE // CONV_TILE_SIZE, 1)` blocks per grid (64 blocks)
- Kernel size: `KERNEL_SIZE = 5` (simple 1D convolution, same as Puzzle 13)
- Data type: `DType.float32`
-- Layout: `Layout.row_major(VECTOR_SIZE)` (1D row-major)
+- Layout: `row_major[VECTOR_SIZE]()` (1D row-major)
### The async copy opportunity
@@ -325,13 +325,13 @@ The async copy overlap solution demonstrates how to hide memory latency by overl
```mojo
# Phase 1: Launch async copy for input tile
input_tile = input.tile[CONV_TILE_SIZE](block_idx.x)
-comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC)
+comptime load_layout = row_major[THREADS_PER_BLOCK_ASYNC]()
copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile)
```
-- **Tile Creation**: `input.tile[CONV_TILE_SIZE](block_idx.x)` creates a 256-element view of the input array starting at `block_idx.x * 256`. The Mojo [`tile` method](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile) does **NOT** perform bounds checking or zero-padding. Accessing out-of-bounds indices results in undefined behavior. The implementation must ensure the tile size and offset remain within valid array bounds.
+- **Tile Creation**: `input.tile[CONV_TILE_SIZE](block_idx.x)` creates a 256-element view of the input array starting at `block_idx.x * 256`. The Mojo [`tile` method](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile) does **NOT** perform bounds checking or zero-padding. Accessing out-of-bounds indices results in undefined behavior. The implementation must ensure the tile size and offset remain within valid array bounds.
-- **Thread Layout**: `Layout.row_major(THREADS_PER_BLOCK_ASYNC, 1)` creates a `256 x 1` layout that matches our block organization. This is **critical** - the layout must match the physical thread arrangement for optimal coalesced memory access. When layouts mismatch, threads may access non-contiguous memory addresses, breaking coalescing and severely degrading performance.
+- **Thread Layout**: `row_major[THREADS_PER_BLOCK_ASYNC, 1]()` creates a `256 x 1` layout that matches our block organization. This is **critical** - the layout must match the physical thread arrangement for optimal coalesced memory access. When layouts mismatch, threads may access non-contiguous memory addresses, breaking coalescing and severely degrading performance.
- **Async Copy Launch**: `copy_dram_to_sram_async` initiates a background transfer from DRAM to shared memory. The hardware copies 256 floats (1KB) while the block continues executing.
@@ -410,7 +410,7 @@ Total Time = MAX(Input_Transfer_Time, Kernel_Transfer_Time) + Compute_Time
#### **Key technical insights**
-1. **Thread Layout Matching**: The `Layout.row_major(256, 1)` layout precisely matches the block's `(256, 1)` thread organization, enabling optimal memory coalescing.
+1. **Thread Layout Matching**: The `row_major[256, 1]()` layout precisely matches the block's `(256, 1)` thread organization, enabling optimal memory coalescing.
2. **Race Condition Avoidance**: Proper sequencing (async copy โ kernel load โ wait โ barrier โ compute) eliminates all race conditions that could corrupt shared memory.
diff --git a/book/src/puzzle_32/conflict_free_patterns.md b/book/src/puzzle_32/conflict_free_patterns.md
index dcda0edf..08cc5550 100644
--- a/book/src/puzzle_32/conflict_free_patterns.md
+++ b/book/src/puzzle_32/conflict_free_patterns.md
@@ -354,7 +354,7 @@ constant = shared[0] # All threads read same address - hardware optimized
**3. Padding techniques:**
```mojo
-shared = LayoutTensor[dtype, Layout.row_major(TPB + 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() # Shift access patterns
+shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + 1]()) # Shift access patterns
```
**4. Access pattern analysis:**
diff --git a/book/src/puzzle_33/puzzle_33.md b/book/src/puzzle_33/puzzle_33.md
index f4abfe51..254c0b26 100644
--- a/book/src/puzzle_33/puzzle_33.md
+++ b/book/src/puzzle_33/puzzle_33.md
@@ -176,9 +176,9 @@ Your task is to complete the `tensor_core_matrix_multiplication` function. The s
Layout configuration:
-- Input A: `Layout.row_major(SIZE, SIZE)`
-- Input B: `Layout.row_major(SIZE, SIZE)`
-- Output C: `Layout.row_major(SIZE, SIZE)`
+- Input A: `row_major[SIZE, SIZE]()`
+- Input B: `row_major[SIZE, SIZE]()`
+- Output C: `row_major[SIZE, SIZE]()`
- Shared Memory: Block-sized tiles with async copy operations
## The challenge
diff --git a/book/src/puzzle_34/advanced_cluster_patterns.md b/book/src/puzzle_34/advanced_cluster_patterns.md
index 0e5ad190..a5b9cdcf 100644
--- a/book/src/puzzle_34/advanced_cluster_patterns.md
+++ b/book/src/puzzle_34/advanced_cluster_patterns.md
@@ -37,7 +37,7 @@ Real-world GPU algorithms often require **hierarchical coordination** where diff
- **Warp Size**: `WARP_SIZE = 32` threads per warp (NVIDIA standard)
- **Warps per Block**: `TPB / WARP_SIZE = 8` warps
- **Data Type**: `DType.float32`
-- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(CLUSTER_SIZE)`
+- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[CLUSTER_SIZE]()`
**Processing Distribution:**
diff --git a/book/src/puzzle_34/cluster_collective_ops.md b/book/src/puzzle_34/cluster_collective_ops.md
index da6d0588..4e2965d5 100644
--- a/book/src/puzzle_34/cluster_collective_ops.md
+++ b/book/src/puzzle_34/cluster_collective_ops.md
@@ -40,8 +40,8 @@ Single blocks (as learned in [Puzzle 27](../puzzle_27/puzzle_27.md)) are limited
- **Block Configuration**: `TPB = 256` threads per block `(256, 1)`
- **Grid Configuration**: `CLUSTER_SIZE = 4` blocks per cluster `(4, 1)`
- **Data Type**: `DType.float32`
-- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(1)`
-- **Temporary Storage**: `Layout.row_major(CLUSTER_SIZE)` for partial results
+- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[1]()`
+- **Temporary Storage**: `row_major[CLUSTER_SIZE]()` for partial results
**Expected Result**: Sum of sequence `0, 0.01, 0.02, ..., 10.23` = **523,776**
diff --git a/book/src/puzzle_34/cluster_coordination_basics.md b/book/src/puzzle_34/cluster_coordination_basics.md
index feac8b14..bbe99959 100644
--- a/book/src/puzzle_34/cluster_coordination_basics.md
+++ b/book/src/puzzle_34/cluster_coordination_basics.md
@@ -35,7 +35,7 @@ Traditional single-block algorithms like those in [Puzzle 27](../puzzle_27/puzzl
- **Block Configuration**: `TPB = 256` threads per block `(256, 1)`
- **Grid Configuration**: `CLUSTER_SIZE = 4` blocks per cluster `(4, 1)`
- **Data Type**: `DType.float32`
-- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(CLUSTER_SIZE)`
+- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[CLUSTER_SIZE]()`
**Thread Block Distribution:**
@@ -65,7 +65,7 @@ Traditional single-block algorithms like those in [Puzzle 27](../puzzle_27/puzzl
### **Shared memory coordination**
-- Allocate shared memory using `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` (see [shared memory basics from Puzzle 8](../puzzle_08/puzzle_08.md))
+- Allocate shared memory using `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())` (see [shared memory basics from Puzzle 8](../puzzle_08/puzzle_08.md))
- Process input data scaled by `block_id + 1` to create distinct scaling per block
- Use bounds checking when accessing input data (pattern from [guards in Puzzle 3](../puzzle_03/puzzle_03.md))
@@ -153,7 +153,7 @@ block_id = Int(block_idx.x) # Block index for reliable
**Shared memory allocation and data processing:**
-- Each block allocates its own shared memory workspace: `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
+- Each block allocates its own shared memory workspace: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())`
- **Scaling strategy**: `data_scale = Float32(block_id + 1)` ensures each block processes data differently
- Block 0: multiplies by 1.0, Block 1: by 2.0, Block 2: by 3.0, Block 3: by 4.0
- **Bounds checking**: `if global_i < size:` prevents out-of-bounds memory access
diff --git a/pixi.toml b/pixi.toml
index c886bba0..90cd7c4f 100644
--- a/pixi.toml
+++ b/pixi.toml
@@ -83,24 +83,21 @@ p03 = "mojo problems/p03/p03.mojo"
viz03 = "cd book/src/puzzle_03 && python puzzle_03_viz.py"
p04 = "mojo problems/p04/p04.mojo"
-p04_layout_tensor = "mojo problems/p04/p04_layout_tensor.mojo"
+p04_tile_tensor = "mojo problems/p04/p04_tile_tensor.mojo"
viz04 = "cd book/src/puzzle_04 && python puzzle_04_viz.py"
thread_indexing = "cd book/src/puzzle_04 && python thread_indexing_viz.py"
-layout_tensor_intro = "mojo book/src/puzzle_04/intro.mojo"
+tile_tensor_intro = "mojo book/src/puzzle_04/intro.mojo"
p05 = "mojo problems/p05/p05.mojo"
-p05_layout_tensor = "mojo problems/p05/p05_layout_tensor.mojo"
viz05 = "cd book/src/puzzle_05 && python puzzle_05_viz.py"
p06 = "mojo problems/p06/p06.mojo"
viz06 = "cd book/src/puzzle_06 && python puzzle_06_viz.py"
p07 = "mojo problems/p07/p07.mojo"
-p07_layout_tensor = "mojo problems/p07/p07_layout_tensor.mojo"
viz07 = "cd book/src/puzzle_07 && python puzzle_07_viz.py"
p08 = "mojo problems/p08/p08.mojo"
-p08_layout_tensor = "mojo problems/p08/p08_layout_tensor.mojo"
viz08 = "cd book/src/puzzle_08 && python puzzle_08_viz.py"
p09 = "mojo problems/p09/p09.mojo"
@@ -108,11 +105,9 @@ p09 = "mojo problems/p09/p09.mojo"
p10 = "mojo problems/p10/p10.mojo"
p11 = "mojo problems/p11/p11.mojo"
-p11_layout_tensor = "mojo problems/p11/p11_layout_tensor.mojo"
viz11 = "cd book/src/puzzle_11 && python puzzle_11_viz.py"
p12 = "mojo problems/p12/p12.mojo"
-p12_layout_tensor = "mojo problems/p12/p12_layout_tensor.mojo"
viz12 = "cd book/src/puzzle_12 && python puzzle_12_viz.py"
p13 = "mojo problems/p13/p13.mojo"
diff --git a/problems/p04/p04_layout_tensor.mojo b/problems/p04/p04_tile_tensor.mojo
similarity index 72%
rename from problems/p04/p04_layout_tensor.mojo
rename to problems/p04/p04_tile_tensor.mojo
index ad8aff51..f4f963f0 100644
--- a/problems/p04/p04_layout_tensor.mojo
+++ b/problems/p04/p04_tile_tensor.mojo
@@ -1,19 +1,21 @@
from std.gpu import thread_idx
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
-# ANCHOR: add_10_2d_layout_tensor
+# ANCHOR: add_10_2d_tile_tensor
comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
def add_10_2d(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
size: Int,
):
var row = thread_idx.y
@@ -21,15 +23,15 @@ def add_10_2d(
# FILL ME IN (roughly 2 lines)
-# ANCHOR_END: add_10_2d_layout_tensor
+# ANCHOR_END: add_10_2d_tile_tensor
def main() raises:
with DeviceContext() as ctx:
var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out_buf)
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
+ var out_tensor = TileTensor(out_buf, layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
expected.enqueue_fill(0)
@@ -41,7 +43,7 @@ def main() raises:
a_host[i] = Scalar[dtype](i)
expected[i] = a_host[i] + 10
- var a_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a)
+ var a_tensor = TileTensor(a, layout)
ctx.enqueue_function[add_10_2d, add_10_2d](
out_tensor,
diff --git a/problems/p05/p05.mojo b/problems/p05/p05.mojo
index 8dc43341..335f6feb 100644
--- a/problems/p05/p05.mojo
+++ b/problems/p05/p05.mojo
@@ -1,6 +1,7 @@
-from std.memory import UnsafePointer
from std.gpu import thread_idx
from std.gpu.host import DeviceContext
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
# ANCHOR: broadcast_add
@@ -8,12 +9,18 @@ comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
+comptime out_layout = row_major[SIZE, SIZE]()
+comptime a_layout = row_major[1, SIZE]()
+comptime b_layout = row_major[SIZE, 1]()
+comptime OutLayout = type_of(out_layout)
+comptime ALayout = type_of(a_layout)
+comptime BLayout = type_of(b_layout)
def broadcast_add(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin],
size: Int,
):
var row = thread_idx.y
@@ -24,10 +31,15 @@ def broadcast_add(
# ANCHOR_END: broadcast_add
def main() raises:
with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out.enqueue_fill(0)
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected.enqueue_fill(0)
+ var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
+ out_buf.enqueue_fill(0)
+ var out_tensor = TileTensor(out_buf, out_layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
+
+ var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
+ expected_buf.enqueue_fill(0)
+ var expected_tensor = TileTensor(expected_buf, out_layout)
+
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(0)
var b = ctx.enqueue_create_buffer[dtype](SIZE)
@@ -39,12 +51,15 @@ def main() raises:
for i in range(SIZE):
for j in range(SIZE):
- expected[i * SIZE + j] = a_host[j] + b_host[i]
+ expected_tensor[i, j] = a_host[j] + b_host[i]
+
+ var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout)
+ var b_tensor = TileTensor[mut=False, dtype, BLayout](b, b_layout)
ctx.enqueue_function[broadcast_add, broadcast_add](
- out,
- a,
- b,
+ out_tensor,
+ a_tensor,
+ b_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -52,10 +67,12 @@ def main() raises:
ctx.synchronize()
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
+ with out_buf.map_to_host() as out_buf_host:
+ print("out:", out_buf_host)
+ print("expected:", expected_buf)
for i in range(SIZE):
for j in range(SIZE):
- assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j])
+ assert_equal(
+ out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
+ )
print("Puzzle 05 complete โ
")
diff --git a/problems/p05/p05_layout_tensor.mojo b/problems/p05/p05_layout_tensor.mojo
deleted file mode 100644
index 1e65f5a0..00000000
--- a/problems/p05/p05_layout_tensor.mojo
+++ /dev/null
@@ -1,81 +0,0 @@
-from std.gpu import thread_idx
-from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-# ANCHOR: broadcast_add_layout_tensor
-comptime SIZE = 2
-comptime BLOCKS_PER_GRID = 1
-comptime THREADS_PER_BLOCK = (3, 3)
-comptime dtype = DType.float32
-comptime out_layout = Layout.row_major(SIZE, SIZE)
-comptime a_layout = Layout.row_major(1, SIZE)
-comptime b_layout = Layout.row_major(SIZE, 1)
-
-
-def broadcast_add[
- out_layout: Layout,
- a_layout: Layout,
- b_layout: Layout,
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
- size: Int,
-):
- var row = thread_idx.y
- var col = thread_idx.x
- # FILL ME IN (roughly 2 lines)
-
-
-# ANCHOR_END: broadcast_add_layout_tensor
-def main() raises:
- with DeviceContext() as ctx:
- var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf)
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
-
- var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected_buf.enqueue_fill(0)
- var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- expected_buf
- )
-
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
- var b = ctx.enqueue_create_buffer[dtype](SIZE)
- b.enqueue_fill(0)
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i + 1)
- b_host[i] = Scalar[dtype](i * 10)
-
- for i in range(SIZE):
- for j in range(SIZE):
- expected_tensor[i, j] = a_host[j] + b_host[i]
-
- var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, b_layout, ImmutAnyOrigin](b)
-
- comptime kernel = broadcast_add[out_layout, a_layout, b_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- b_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- ctx.synchronize()
-
- with out_buf.map_to_host() as out_buf_host:
- print("out:", out_buf_host)
- print("expected:", expected_buf)
- for i in range(SIZE):
- for j in range(SIZE):
- assert_equal(
- out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
- )
- print("Puzzle 05 complete โ
")
diff --git a/problems/p07/p07.mojo b/problems/p07/p07.mojo
index f6eaa3bb..51e2de6a 100644
--- a/problems/p07/p07.mojo
+++ b/problems/p07/p07.mojo
@@ -1,6 +1,7 @@
-from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim
from std.gpu.host import DeviceContext
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
# ANCHOR: add_10_blocks_2d
@@ -8,11 +9,15 @@ comptime SIZE = 5
comptime BLOCKS_PER_GRID = (2, 2)
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
+comptime out_layout = row_major[SIZE, SIZE]()
+comptime a_layout = row_major[SIZE, SIZE]()
+comptime OutLayout = type_of(out_layout)
+comptime ALayout = type_of(a_layout)
def add_10_blocks_2d(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
size: Int,
):
var row = block_dim.y * block_idx.y + thread_idx.y
@@ -25,10 +30,13 @@ def add_10_blocks_2d(
def main() raises:
with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out.enqueue_fill(0)
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected.enqueue_fill(1)
+ var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
+ out_buf.enqueue_fill(0)
+ var out_tensor = TileTensor(out_buf, out_layout)
+
+ var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
+ expected_buf.enqueue_fill(1)
+
var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
a.enqueue_fill(1)
@@ -37,11 +45,13 @@ def main() raises:
for i in range(SIZE):
var k = j * SIZE + i
a_host[k] = Scalar[dtype](k)
- expected[k] = Scalar[dtype](k + 10)
+ expected_buf[k] = Scalar[dtype](k + 10)
+
+ var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout)
ctx.enqueue_function[add_10_blocks_2d, add_10_blocks_2d](
- out,
- a,
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -49,10 +59,17 @@ def main() raises:
ctx.synchronize()
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
+ var expected_tensor = TileTensor(expected_buf, out_layout)
+
+ with out_buf.map_to_host() as out_buf_host:
+ print(
+ "out:",
+ TileTensor(out_buf_host, out_layout),
+ )
+ print("expected:", expected_tensor)
for i in range(SIZE):
for j in range(SIZE):
- assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j])
+ assert_equal(
+ out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
+ )
print("Puzzle 07 complete โ
")
diff --git a/problems/p07/p07_layout_tensor.mojo b/problems/p07/p07_layout_tensor.mojo
deleted file mode 100644
index 604ac552..00000000
--- a/problems/p07/p07_layout_tensor.mojo
+++ /dev/null
@@ -1,78 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim
-from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-# ANCHOR: add_10_blocks_2d_layout_tensor
-comptime SIZE = 5
-comptime BLOCKS_PER_GRID = (2, 2)
-comptime THREADS_PER_BLOCK = (3, 3)
-comptime dtype = DType.float32
-comptime out_layout = Layout.row_major(SIZE, SIZE)
-comptime a_layout = Layout.row_major(SIZE, SIZE)
-
-
-def add_10_blocks_2d[
- out_layout: Layout,
- a_layout: Layout,
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
- size: Int,
-):
- var row = block_dim.y * block_idx.y + thread_idx.y
- var col = block_dim.x * block_idx.x + thread_idx.x
- # FILL ME IN (roughly 2 lines)
-
-
-# ANCHOR_END: add_10_blocks_2d_layout_tensor
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf)
-
- var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected_buf.enqueue_fill(1)
-
- var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- a.enqueue_fill(1)
-
- with a.map_to_host() as a_host:
- for j in range(SIZE):
- for i in range(SIZE):
- var k = j * SIZE + i
- a_host[k] = Scalar[dtype](k)
- expected_buf[k] = Scalar[dtype](k + 10)
-
- var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a)
-
- comptime kernel = add_10_blocks_2d[out_layout, a_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- ctx.synchronize()
-
- var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- expected_buf
- )
-
- with out_buf.map_to_host() as out_buf_host:
- print(
- "out:",
- LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf_host),
- )
- print("expected:", expected_tensor)
- for i in range(SIZE):
- for j in range(SIZE):
- assert_equal(
- out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
- )
- print("Puzzle 07 complete โ
")
diff --git a/problems/p08/p08.mojo b/problems/p08/p08.mojo
index 2f994b19..b89c6fc2 100644
--- a/problems/p08/p08.mojo
+++ b/problems/p08/p08.mojo
@@ -1,7 +1,9 @@
-from std.memory import UnsafePointer, stack_allocation
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
# ANCHOR: add_10_shared
@@ -10,26 +12,26 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
def add_10_shared(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
+ # Allocate shared memory using stack_allocation
var shared = stack_allocation[
- TPB,
- Scalar[dtype],
- address_space=AddressSpace.SHARED,
- ]()
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- # Load local data into shared memory
+
if global_i < size:
shared[local_i] = a[global_i]
- # wait for all threads to complete
- # works within a thread block
barrier()
# FILL ME IN (roughly 2 lines)
@@ -44,9 +46,13 @@ def main() raises:
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(1)
+
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+
ctx.enqueue_function[add_10_shared, add_10_shared](
- out,
- a,
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -54,7 +60,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
expected.enqueue_fill(11)
-
ctx.synchronize()
with out.map_to_host() as out_host:
diff --git a/problems/p08/p08_layout_tensor.mojo b/problems/p08/p08_layout_tensor.mojo
deleted file mode 100644
index 4856d2c3..00000000
--- a/problems/p08/p08_layout_tensor.mojo
+++ /dev/null
@@ -1,73 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-# ANCHOR: add_10_shared_layout_tensor
-comptime TPB = 4
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (2, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-
-
-def add_10_shared_layout_tensor[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- size: Int,
-):
- # Allocate shared memory using LayoutTensor with explicit address_space
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var global_i = block_dim.x * block_idx.x + thread_idx.x
- var local_i = thread_idx.x
-
- if global_i < size:
- shared[local_i] = a[global_i]
-
- barrier()
-
- # FILL ME IN (roughly 2 lines)
-
-
-# ANCHOR_END: add_10_shared_layout_tensor
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(1)
-
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
-
- comptime kernel = add_10_shared_layout_tensor[layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
- expected.enqueue_fill(11)
- ctx.synchronize()
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- for i in range(SIZE):
- assert_equal(out_host[i], expected[i])
- print("Puzzle 08 complete โ
")
diff --git a/problems/p09/p09.mojo b/problems/p09/p09.mojo
index 3455c9f0..38467f2c 100644
--- a/problems/p09/p09.mojo
+++ b/problems/p09/p09.mojo
@@ -2,7 +2,9 @@ from std.memory import UnsafePointer
from std.gpu import thread_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
from std.sys import argv
@@ -11,7 +13,8 @@ comptime MATRIX_SIZE = 3
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = SIZE
comptime dtype = DType.float32
-comptime vector_layout = Layout.row_major(SIZE)
+comptime vector_layout = row_major[SIZE]()
+comptime VectorLayout = type_of(vector_layout)
comptime ITER = 2
@@ -29,8 +32,8 @@ def add_10(
# ANCHOR: second_crash
def process_sliding_window(
- output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin],
):
var thread_id = thread_idx.x
@@ -52,18 +55,15 @@ def process_sliding_window(
# ANCHOR: third_crash
def collaborative_filter(
- output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin],
):
var thread_id = thread_idx.x
# Shared memory workspace for collaborative processing
- var shared_workspace = LayoutTensor[
- dtype,
- Layout.row_major(SIZE - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_workspace = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[SIZE - 1]())
# Phase 1: Initialize shared workspace (all threads participate)
if thread_id < SIZE - 1:
@@ -139,13 +139,11 @@ def main() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i)
- # Create LayoutTensors for structured access
- input_tensor = LayoutTensor[dtype, vector_layout, ImmutAnyOrigin](
- input_buf
- )
- output_tensor = LayoutTensor[dtype, vector_layout, MutAnyOrigin](
- output_buf
+ # Create TileTensors for structured access
+ input_tensor = TileTensor[mut=False, dtype, VectorLayout](
+ input_buf, vector_layout
)
+ output_tensor = TileTensor(output_buf, vector_layout)
print("Input array: [0, 1, 2, 3]")
print("Computing sliding window sums (window size = 3)...")
@@ -216,13 +214,11 @@ def main() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- # Create LayoutTensors
- input_tensor = LayoutTensor[dtype, vector_layout, ImmutAnyOrigin](
- input_buf
- )
- output_tensor = LayoutTensor[dtype, vector_layout, MutAnyOrigin](
- output_buf
+ # Create TileTensors
+ input_tensor = TileTensor[mut=False, dtype, VectorLayout](
+ input_buf, vector_layout
)
+ output_tensor = TileTensor(output_buf, vector_layout)
print("Input array: [1, 2, 3, 4]")
print("Applying collaborative filter using shared memory...")
diff --git a/problems/p10/p10.mojo b/problems/p10/p10.mojo
index d64fd3b4..e42e37e0 100644
--- a/problems/p10/p10.mojo
+++ b/problems/p10/p10.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
from std.sys import argv
@@ -11,23 +13,21 @@ comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
def shared_memory_race(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var row = thread_idx.y
var col = thread_idx.x
- var shared_sum = LayoutTensor[
- dtype,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_sum = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[1]())
if row < size and col < size:
shared_sum[0] += a[row, col]
@@ -43,8 +43,8 @@ def shared_memory_race(
# ANCHOR: add_10_2d_no_guard
def add_10_2d(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var row = thread_idx.y
@@ -68,10 +68,8 @@ def main() raises:
with DeviceContext() as ctx:
var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- out_buf
- ).reshape[layout]()
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
+ var out_tensor = TileTensor(out_buf, layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
expected.enqueue_fill(0)
@@ -81,9 +79,7 @@ def main() raises:
for i in range(SIZE * SIZE):
a_host[i] = Scalar[dtype](i)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a).reshape[
- layout
- ]()
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
if flag == "--memory-bug":
print("Running memory bug example (bounds checking issue)...")
diff --git a/problems/p11/p11.mojo b/problems/p11/p11.mojo
index 06b82580..c6ed142f 100644
--- a/problems/p11/p11.mojo
+++ b/problems/p11/p11.mojo
@@ -1,7 +1,9 @@
-from std.memory import UnsafePointer, stack_allocation
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
# ANCHOR: pooling
@@ -10,21 +12,23 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
def pooling(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
+ # Allocate shared memory using stack_allocation
var shared = stack_allocation[
- TPB,
- Scalar[dtype],
- address_space=AddressSpace.SHARED,
- ]()
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- # FILL ME IN (roughly 10 lines)
+ # FIX ME IN (roughly 10 lines)
# ANCHOR_END: pooling
@@ -36,13 +40,17 @@ def main() raises:
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(0)
+
with a.map_to_host() as a_host:
for i in range(SIZE):
a_host[i] = Scalar[dtype](i)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+
ctx.enqueue_function[pooling, pooling](
- out,
- a,
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -50,7 +58,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
expected.enqueue_fill(0)
-
ctx.synchronize()
with a.map_to_host() as a_host:
@@ -59,7 +66,6 @@ def main() raises:
var s = Scalar[dtype](0)
for j in range(max(i - 2, 0), i + 1):
s += ptr[j]
-
expected[i] = s
with out.map_to_host() as out_host:
diff --git a/problems/p11/p11_layout_tensor.mojo b/problems/p11/p11_layout_tensor.mojo
deleted file mode 100644
index f7e293f1..00000000
--- a/problems/p11/p11_layout_tensor.mojo
+++ /dev/null
@@ -1,78 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-# ANCHOR: pooling_layout_tensor
-comptime TPB = 8
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (1, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-
-
-def pooling[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- size: Int,
-):
- # Allocate shared memory using tensor builder
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var global_i = block_dim.x * block_idx.x + thread_idx.x
- var local_i = thread_idx.x
- # FIX ME IN (roughly 10 lines)
-
-
-# ANCHOR_END: pooling_layout_tensor
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
-
- with a.map_to_host() as a_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i)
-
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
-
- ctx.enqueue_function[pooling[layout], pooling[layout]](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
- expected.enqueue_fill(0)
- ctx.synchronize()
-
- with a.map_to_host() as a_host:
- var ptr = a_host
- for i in range(SIZE):
- var s = Scalar[dtype](0)
- for j in range(max(i - 2, 0), i + 1):
- s += ptr[j]
- expected[i] = s
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- for i in range(SIZE):
- assert_equal(out_host[i], expected[i])
- print("Puzzle 11 complete โ
")
diff --git a/problems/p12/p12.mojo b/problems/p12/p12.mojo
index 4b2e1153..bdd36908 100644
--- a/problems/p12/p12.mojo
+++ b/problems/p12/p12.mojo
@@ -1,21 +1,29 @@
-from std.memory import UnsafePointer, stack_allocation
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
from std.testing import assert_equal
+from std.gpu.host import DeviceContext
# ANCHOR: dot_product
+from std.gpu import thread_idx, block_idx, block_dim, barrier
+from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
+
+
comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime out_layout = row_major[1]()
+comptime LayoutType = type_of(layout)
+comptime OutLayout = type_of(out_layout)
def dot_product(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
# FILL ME IN (roughly 13 lines)
@@ -33,15 +41,20 @@ def main() raises:
a.enqueue_fill(0)
var b = ctx.enqueue_create_buffer[dtype](SIZE)
b.enqueue_fill(0)
+
with a.map_to_host() as a_host, b.map_to_host() as b_host:
for i in range(SIZE):
a_host[i] = Scalar[dtype](i)
b_host[i] = Scalar[dtype](i)
+ var out_tensor = TileTensor(out, out_layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout)
+
ctx.enqueue_function[dot_product, dot_product](
- out,
- a,
- b,
+ out_tensor,
+ a_tensor,
+ b_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -49,7 +62,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](1)
expected.enqueue_fill(0)
-
ctx.synchronize()
with a.map_to_host() as a_host, b.map_to_host() as b_host:
diff --git a/problems/p12/p12_layout_tensor.mojo b/problems/p12/p12_layout_tensor.mojo
deleted file mode 100644
index 691730cf..00000000
--- a/problems/p12/p12_layout_tensor.mojo
+++ /dev/null
@@ -1,74 +0,0 @@
-from std.testing import assert_equal
-from std.gpu.host import DeviceContext
-
-# ANCHOR: dot_product_layout_tensor
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-
-
-comptime TPB = 8
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (1, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
-
-
-def dot_product[
- in_layout: Layout, out_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- size: Int,
-):
- # FILL ME IN (roughly 13 lines)
- ...
-
-
-# ANCHOR_END: dot_product_layout_tensor
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](1)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
- var b = ctx.enqueue_create_buffer[dtype](SIZE)
- b.enqueue_fill(0)
-
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i)
- b_host[i] = Scalar[dtype](i)
-
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b)
-
- comptime kernel = dot_product[layout, out_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- b_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](1)
- expected.enqueue_fill(0)
- ctx.synchronize()
-
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- expected[0] += a_host[i] * b_host[i]
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- assert_equal(out_host[0], expected[0])
- print("Puzzle 12 complete โ
")
diff --git a/problems/p13/p13.mojo b/problems/p13/p13.mojo
index 1da7b0b0..2430bc85 100644
--- a/problems/p13/p13.mojo
+++ b/problems/p13/p13.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
@@ -12,17 +14,18 @@ comptime CONV = 3
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(SIZE)
-comptime conv_layout = Layout.row_major(CONV)
-
-
-def conv_1d_simple[
- in_layout: Layout, out_layout: Layout, conv_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
+comptime in_layout = row_major[SIZE]()
+comptime InLayout = type_of(in_layout)
+comptime out_layout = row_major[SIZE]()
+comptime OutLayout = type_of(out_layout)
+comptime conv_layout = row_major[CONV]()
+comptime ConvLayout = type_of(conv_layout)
+
+
+def conv_1d_simple(
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -36,17 +39,18 @@ comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
-comptime in_2_layout = Layout.row_major(SIZE_2)
-comptime out_2_layout = Layout.row_major(SIZE_2)
-comptime conv_2_layout = Layout.row_major(CONV_2)
-
-
-def conv_1d_block_boundary[
- in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
+comptime in_2_layout = row_major[SIZE_2]()
+comptime In2Layout = type_of(in_2_layout)
+comptime out_2_layout = row_major[SIZE_2]()
+comptime Out2Layout = type_of(out_2_layout)
+comptime conv_2_layout = row_major[CONV_2]()
+comptime Conv2Layout = type_of(conv_2_layout)
+
+
+def conv_1d_block_boundary(
+ output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -84,11 +88,12 @@ def main() raises:
)
if argv()[1] == "--simple":
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, conv_layout, ImmutAnyOrigin](b)
- comptime kernel = conv_1d_simple[in_layout, out_layout, conv_layout]
- ctx.enqueue_function[kernel, kernel](
+ var out_tensor = TileTensor(out, out_layout)
+ var a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout)
+ var b_tensor = TileTensor[mut=False, dtype, ConvLayout](
+ b, conv_layout
+ )
+ ctx.enqueue_function[conv_1d_simple, conv_1d_simple](
out_tensor,
a_tensor,
b_tensor,
@@ -96,15 +101,16 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
else:
- var out_tensor = LayoutTensor[dtype, out_2_layout, MutAnyOrigin](
- out
+ var out_tensor = TileTensor(out, out_2_layout)
+ var a_tensor = TileTensor[mut=False, dtype, In2Layout](
+ a, in_2_layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, Conv2Layout](
+ b, conv_2_layout
)
- var a_tensor = LayoutTensor[dtype, in_2_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, conv_2_layout, ImmutAnyOrigin](b)
- comptime kernel = conv_1d_block_boundary[
- in_2_layout, out_2_layout, conv_2_layout, dtype
- ]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[
+ conv_1d_block_boundary, conv_1d_block_boundary
+ ](
out_tensor,
a_tensor,
b_tensor,
diff --git a/problems/p14/p14.mojo b/problems/p14/p14.mojo
index e48e0c5f..a3673699 100644
--- a/problems/p14/p14.mojo
+++ b/problems/p14/p14.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.math import log2
from std.testing import assert_equal
@@ -12,14 +14,13 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
-def prefix_sum_simple[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def prefix_sum_simple(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
@@ -34,16 +35,16 @@ comptime SIZE_2 = 15
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime EXTENDED_SIZE = SIZE_2 + 2 # up to 2 blocks
-comptime layout_2 = Layout.row_major(SIZE_2)
-comptime extended_layout = Layout.row_major(EXTENDED_SIZE)
+comptime layout_2 = row_major[SIZE_2]()
+comptime Layout2Type = type_of(layout_2)
+comptime extended_layout = row_major[EXTENDED_SIZE]()
+comptime ExtendedLayoutType = type_of(extended_layout)
# Kernel 1: Compute local prefix sums and store block sums in out
-def prefix_sum_local_phase[
- out_layout: Layout, in_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+def prefix_sum_local_phase(
+ output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
@@ -52,9 +53,10 @@ def prefix_sum_local_phase[
# Kernel 2: Add block sums to their respective blocks
-def prefix_sum_block_sum_phase[
- layout: Layout
-](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: Int):
+def prefix_sum_block_sum_phase(
+ output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin],
+ size: Int,
+):
var global_i = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 3 lines)
@@ -91,11 +93,10 @@ def main() raises:
a_host[i] = Scalar[dtype](i)
if use_simple:
- a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ out_tensor = TileTensor(out, layout)
- comptime kernel = prefix_sum_simple[layout]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[prefix_sum_simple, prefix_sum_simple](
out_tensor,
a_tensor,
size,
@@ -103,15 +104,16 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
else:
- var a_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](a)
- var out_tensor = LayoutTensor[dtype, extended_layout, MutAnyOrigin](
- out
+ var a_tensor = TileTensor[mut=False, dtype, Layout2Type](
+ a, layout_2
)
+ var out_tensor = TileTensor(out, extended_layout)
# ANCHOR: prefix_sum_complete_block_level_sync
# Phase 1: Local prefix sums
- comptime kernel = prefix_sum_local_phase[extended_layout, layout_2]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[
+ prefix_sum_local_phase, prefix_sum_local_phase
+ ](
out_tensor,
a_tensor,
size,
@@ -123,8 +125,9 @@ def main() raises:
# No explicit ctx.synchronize() needed in this case.
# Phase 2: Add block sums
- comptime kernel2 = prefix_sum_block_sum_phase[extended_layout]
- ctx.enqueue_function[kernel2, kernel2](
+ ctx.enqueue_function[
+ prefix_sum_block_sum_phase, prefix_sum_block_sum_phase
+ ](
out_tensor,
size,
grid_dim=BLOCKS_PER_GRID_2,
diff --git a/problems/p15/p15.mojo b/problems/p15/p15.mojo
index 4a4f79a2..c9f7ead5 100644
--- a/problems/p15/p15.mojo
+++ b/problems/p15/p15.mojo
@@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext
# ANCHOR: axis_sum
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
comptime TPB = 8
@@ -13,15 +15,15 @@ comptime SIZE = 6
comptime BLOCKS_PER_GRID = (1, BATCH)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(BATCH, SIZE)
-comptime out_layout = Layout.row_major(BATCH, 1)
+comptime in_layout = row_major[BATCH, SIZE]()
+comptime InLayout = type_of(in_layout)
+comptime out_layout = row_major[BATCH, 1]()
+comptime OutLayout = type_of(out_layout)
-def axis_sum[
- in_layout: Layout, out_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+def axis_sum(
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
@@ -44,11 +46,10 @@ def main() raises:
for col in range(SIZE):
inp_host[row * SIZE + col] = Scalar[dtype](row * SIZE + col)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](inp)
+ var out_tensor = TileTensor(out, out_layout)
+ var inp_tensor = TileTensor[mut=False, dtype, InLayout](inp, in_layout)
- comptime kernel = axis_sum[in_layout, out_layout]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[axis_sum, axis_sum](
out_tensor,
inp_tensor,
SIZE,
diff --git a/problems/p16/p16.mojo b/problems/p16/p16.mojo
index 8ff079ad..e16cf873 100644
--- a/problems/p16/p16.mojo
+++ b/problems/p16/p16.mojo
@@ -5,7 +5,9 @@ from std.gpu.host import DeviceContext
# ANCHOR: naive_matmul
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
comptime TPB = 3
@@ -13,15 +15,14 @@ comptime SIZE = 2
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, TPB)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
-def naive_matmul[
- layout: Layout, size: Int
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def naive_matmul(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
@@ -32,12 +33,10 @@ def naive_matmul[
# ANCHOR: single_block_matmul
-def single_block_matmul[
- layout: Layout, size: Int
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def single_block_matmul(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
@@ -52,15 +51,14 @@ def single_block_matmul[
comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3) # each block convers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
-comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
+comptime layout_tiled = row_major[SIZE_TILED, SIZE_TILED]()
+comptime LayoutTiledType = type_of(layout_tiled)
-def matmul_tiled[
- layout: Layout, size: Int
-](
- output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
- a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
+def matmul_tiled(
+ output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
var local_row = thread_idx.y
var local_col = thread_idx.x
@@ -109,13 +107,12 @@ def main() raises:
inp1_host[i * size + k] * inp2_host[k * size + j]
)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](inp1, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](inp2, layout)
if argv()[1] == "--naive":
- comptime kernel = naive_matmul[layout, SIZE]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[naive_matmul, naive_matmul](
out_tensor,
a_tensor,
b_tensor,
@@ -123,8 +120,7 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
elif argv()[1] == "--single-block":
- comptime kernel = single_block_matmul[layout, SIZE]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[single_block_matmul, single_block_matmul](
out_tensor,
a_tensor,
b_tensor,
@@ -133,18 +129,15 @@ def main() raises:
)
elif argv()[1] == "--tiled":
# Need to update the layout of the tensors to the tiled layout
- var out_tensor_tiled = LayoutTensor[
- dtype, layout_tiled, MutAnyOrigin
- ](out)
- var a_tensor_tiled = LayoutTensor[
- dtype, layout_tiled, ImmutAnyOrigin
- ](inp1)
- var b_tensor_tiled = LayoutTensor[
- dtype, layout_tiled, ImmutAnyOrigin
- ](inp2)
-
- comptime kernel = matmul_tiled[layout_tiled, SIZE_TILED]
- ctx.enqueue_function[kernel, kernel](
+ var out_tensor_tiled = TileTensor(out, layout_tiled)
+ var a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp1, layout_tiled
+ )
+ var b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp2, layout_tiled
+ )
+
+ ctx.enqueue_function[matmul_tiled, matmul_tiled](
out_tensor_tiled,
a_tensor_tiled,
b_tensor_tiled,
diff --git a/problems/p17/op/conv1d.mojo b/problems/p17/op/conv1d.mojo
index 808de38c..97de1bb7 100644
--- a/problems/p17/op/conv1d.mojo
+++ b/problems/p17/op/conv1d.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
# ANCHOR: conv1d_kernel
comptime TPB = 15
@@ -9,32 +11,26 @@ comptime BLOCKS_PER_GRID = (2, 1)
def conv1d_kernel[
- in_layout: Layout,
- out_layout: Layout,
- conv_layout: Layout,
input_size: Int,
conv_size: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
+ ConvLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
- kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ kernel: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
# first: need to account for padding
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(TPB + conv_size - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(conv_size),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB + conv_size - 1]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[conv_size]())
if global_i < input_size:
shared_a[local_i] = input[global_i]
@@ -53,7 +49,7 @@ def conv1d_kernel[
barrier()
if global_i < input_size:
- var local_sum: output.element_type = 0
+ var local_sum: output.ElementType = 0
comptime for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
@@ -92,9 +88,6 @@ struct Conv1DCustomOp:
var output_tensor = output.to_layout_tensor()
var input_tensor = input.to_layout_tensor()
var kernel_tensor = kernel.to_layout_tensor()
- comptime in_layout = input_tensor.layout
- comptime out_layout = output_tensor.layout
- comptime conv_layout = kernel_tensor.layout
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
diff --git a/problems/p18/op/softmax.mojo b/problems/p18/op/softmax.mojo
index 8839d538..7b3025ec 100644
--- a/problems/p18/op/softmax.mojo
+++ b/problems/p18/op/softmax.mojo
@@ -4,26 +4,28 @@ from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.math import exp
from std.bit import log2_ceil
from std.utils.numerics import max_finite, min_finite
comptime SIZE = 128 # This must be equal to INPUT_SIZE in p18.py
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime GRID_DIM_X = 1
# Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness.
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)
def softmax_gpu_kernel[
- layout: Layout,
input_size: Int,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
@@ -37,12 +39,11 @@ def softmax_gpu_kernel[
# ANCHOR: softmax_cpu_kernel
def softmax_cpu_kernel[
- layout: Layout,
input_size: Int,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
@@ -71,12 +72,12 @@ struct SoftmaxCustomOp:
ctx: DeviceContextPtr,
) raises:
# Note: rebind is necessary now but it shouldn't be!
- var output_tensor = rebind[LayoutTensor[dtype, layout, MutAnyOrigin]](
- output.to_layout_tensor()
- )
- var input_tensor = rebind[LayoutTensor[dtype, layout, ImmutAnyOrigin]](
- input.to_layout_tensor()
- )
+ var output_tensor = rebind[
+ TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]
+ ](output.to_layout_tensor())
+ var input_tensor = rebind[
+ TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]
+ ](input.to_layout_tensor())
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -91,7 +92,7 @@ struct SoftmaxCustomOp:
0,
)
- comptime kernel = softmax_gpu_kernel[layout, input_size, dtype]
+ comptime kernel = softmax_gpu_kernel[input_size, dtype]
gpu_ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -100,8 +101,6 @@ struct SoftmaxCustomOp:
)
elif target == "cpu":
- softmax_cpu_kernel[layout, input_size, dtype](
- output_tensor, input_tensor
- )
+ softmax_cpu_kernel[input_size, dtype](output_tensor, input_tensor)
else:
raise Error("Unsupported target: " + target)
diff --git a/problems/p18/test/test_softmax.mojo b/problems/p18/test/test_softmax.mojo
index 70b25871..483a2321 100644
--- a/problems/p18/test/test_softmax.mojo
+++ b/problems/p18/test/test_softmax.mojo
@@ -1,12 +1,14 @@
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_almost_equal
from std.bit import log2_ceil
from op import softmax_gpu_kernel, softmax_cpu_kernel
comptime SIZE = 128
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime GRID_DIM_X = 1
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)
comptime dtype = DType.float32
@@ -21,9 +23,7 @@ def test_softmax() raises:
# for CPU testing
var expected = ctx.enqueue_create_host_buffer[DType.float32](SIZE)
expected.enqueue_fill(0)
- var expected_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- expected
- )
+ var expected_tensor = TileTensor(expected, layout)
# Initialize input with more reasonable values
with inp.map_to_host() as inp_host:
for i in range(SIZE):
@@ -34,21 +34,19 @@ def test_softmax() raises:
print(inp_host[i], end=" ")
print()
# Create layout tensors for CPU calculation
- input_host_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- inp_host
+ input_host_tensor = TileTensor[mut=False, dtype, LayoutType](
+ inp_host, layout
)
# for GPU testing
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
+ var output_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout)
# Compute expected results using our CPU kernel
- softmax_cpu_kernel[layout, SIZE, dtype](
- expected_tensor, input_host_tensor
- )
+ softmax_cpu_kernel[SIZE, dtype](expected_tensor, input_host_tensor)
# Run GPU kernel
- comptime kernel = softmax_gpu_kernel[layout, SIZE, dtype]
+ comptime kernel = softmax_gpu_kernel[SIZE, dtype]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/problems/p19/op/attention.mojo b/problems/p19/op/attention.mojo
index 3e71bdb2..99aca727 100644
--- a/problems/p19/op/attention.mojo
+++ b/problems/p19/op/attention.mojo
@@ -2,7 +2,9 @@ from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.memory import AddressSpace, async_copy_wait_all
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
from std.math import exp
from std.bit import log2_ceil
@@ -22,24 +24,24 @@ comptime SOFTMAX_BLOCK_DIM_X = 1 << log2_ceil(SEQ_LEN)
# Tiled matrix multiplication (from p16), updated to:
-# 1) Support different layouts for input (a, b) and output LayoutTensors.
+# 1) Support different layouts for input (a, b) and output TileTensors.
# 2) Handle cases where the inner dimension is not a multiple of MATMUL_BLOCK_DIM_XY.
# 3) Explicitly check for out-of-bounds elements.
-# The approach still tiles all three LayoutTensors (a, b, and output) into identical square tiles
+# The approach still tiles all three TileTensors (a, b, and output) into identical square tiles
# of size (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) with each thread loading one element
# from a and b, and writing one element to output.
def matmul_idiomatic_tiled[
- a_layout: Layout,
- b_layout: Layout,
- out_layout: Layout,
rows: Int,
cols: Int,
inner: Int,
+ OutLayout: TensorLayout,
+ ALayout: TensorLayout,
+ BLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, MutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, BLayout, MutAnyOrigin],
):
"""Updated idiomatic tiled matrix multiplication from p16."""
var local_row = thread_idx.y
@@ -51,26 +53,23 @@ def matmul_idiomatic_tiled[
var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
block_idx.y, block_idx.x
)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(
+ comptime shared_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
- comptime load_b_layout = Layout.row_major(
+ ]()
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var acc: output.ElementType = 0
+
+ comptime load_a_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
+ ]() # Coalesced loading
+ comptime load_b_layout = row_major[
+ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
+ ]() # Coalesced loading
comptime for idx in range(
(inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY
@@ -118,14 +117,14 @@ def matmul_idiomatic_tiled[
# ANCHOR: transpose_kernel
def transpose_kernel[
- layout_in: Layout, # Layout for input matrix (seq_len, d)
- layout_out: Layout, # Layout for output matrix (d, seq_len)
rows: Int,
cols: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
):
# FILL ME IN (roughly 18 lines)
...
@@ -136,28 +135,23 @@ def transpose_kernel[
# Apply softmax to attention scores taken from p16
def softmax_gpu_kernel[
- layout: Layout,
input_size: Int,
+ LayoutType: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
), "dtype must be a floating-point type"
- var shared_max = LayoutTensor[
- dtype,
- Layout.row_major(SOFTMAX_BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_sum = LayoutTensor[
- dtype,
- Layout.row_major(SOFTMAX_BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ comptime softmax_layout = row_major[SOFTMAX_BLOCK_DIM_X]()
+ var shared_max = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](softmax_layout)
+ var shared_sum = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](softmax_layout)
var global_i = thread_idx.x
# Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum
@@ -208,18 +202,18 @@ def softmax_gpu_kernel[
# CPU implementation for vector attention
def attention_cpu_kernel[
- layout_q: Layout,
- layout_k: Layout,
- layout_v: Layout,
- layout_out: Layout,
seq_len: Int,
d: Int,
+ OutLayout: TensorLayout,
+ QLayout: TensorLayout,
+ KLayout: TensorLayout,
+ VLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- q: LayoutTensor[dtype, layout_q, MutAnyOrigin],
- k: LayoutTensor[dtype, layout_k, ImmutAnyOrigin],
- v: LayoutTensor[dtype, layout_v, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ q: TileTensor[mut=False, dtype, QLayout, MutAnyOrigin],
+ k: TileTensor[mut=False, dtype, KLayout, ImmutAnyOrigin],
+ v: TileTensor[mut=False, dtype, VLayout, MutAnyOrigin],
):
"""CPU implementation of vector attention."""
var scores = List[Float32]()
@@ -273,25 +267,30 @@ struct AttentionCustomOp:
ctx: DeviceContextPtr,
) raises:
# Define layouts
- comptime layout_q = Layout.row_major(d)
- comptime layout_k = Layout.row_major(seq_len, d)
- comptime layout_v = Layout.row_major(seq_len, d)
- comptime layout_out = Layout.row_major(d)
- comptime layout_scores = Layout.row_major(seq_len)
+ comptime layout_q = row_major[d]()
+ comptime layout_k = row_major[seq_len, d]()
+ comptime layout_v = row_major[seq_len, d]()
+ comptime layout_out = row_major[d]()
+ comptime layout_scores = row_major[seq_len]()
+ comptime QLayout = type_of(layout_q)
+ comptime KLayout = type_of(layout_k)
+ comptime VLayout = type_of(layout_v)
+ comptime OutLayout = type_of(layout_out)
+ comptime ScoresLayout = type_of(layout_scores)
# Convert to layout tensors
var output_tensor = rebind[
- LayoutTensor[dtype, layout_out, MutAnyOrigin]
+ TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin]
](output.to_layout_tensor())
- var q_tensor = rebind[LayoutTensor[dtype, layout_q, MutAnyOrigin]](
- q.to_layout_tensor()
- )
- var k_tensor = rebind[LayoutTensor[dtype, layout_k, ImmutAnyOrigin]](
- k.to_layout_tensor()
- )
- var v_tensor = rebind[LayoutTensor[dtype, layout_v, MutAnyOrigin]](
- v.to_layout_tensor()
- )
+ var q_tensor = rebind[
+ TileTensor[mut=False, dtype, QLayout, MutAnyOrigin]
+ ](q.to_layout_tensor())
+ var k_tensor = rebind[
+ TileTensor[mut=False, dtype, KLayout, ImmutAnyOrigin]
+ ](k.to_layout_tensor())
+ var v_tensor = rebind[
+ TileTensor[mut=False, dtype, VLayout, MutAnyOrigin]
+ ](v.to_layout_tensor())
comptime if target == "gpu":
# ANCHOR: attention_orchestration
@@ -299,15 +298,20 @@ struct AttentionCustomOp:
# Define layouts for matrix multiplication
# Q reshaped to (1, d)
- comptime layout_q_2d = Layout.row_major(1, d)
+ comptime layout_q_2d = row_major[1, d]()
+ comptime Q2DLayout = type_of(layout_q_2d)
# K^T is (d, seq_len)
- comptime layout_k_t = Layout.row_major(d, seq_len)
+ comptime layout_k_t = row_major[d, seq_len]()
+ comptime KTLayout = type_of(layout_k_t)
# Scores as (1, seq_len)
- comptime layout_scores_2d = Layout.row_major(1, seq_len)
+ comptime layout_scores_2d = row_major[1, seq_len]()
+ comptime Scores2DLayout = type_of(layout_scores_2d)
# Weights as (1, seq_len)
- comptime layout_weights_2d = Layout.row_major(1, seq_len)
+ comptime layout_weights_2d = row_major[1, seq_len]()
+ comptime Weights2DLayout = type_of(layout_weights_2d)
# Result as (1, d)
- comptime layout_result_2d = Layout.row_major(1, d)
+ comptime layout_result_2d = row_major[1, d]()
+ comptime Result2DLayout = type_of(layout_result_2d)
# Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks
comptime transpose_threads_per_block = (
@@ -344,7 +348,7 @@ struct AttentionCustomOp:
seq_len
) # Reused for scores and weights
- var k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf)
+ var k_t = TileTensor(k_t_buf, layout_k_t)
# Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
# FILL ME IN 1 line
@@ -373,9 +377,9 @@ struct AttentionCustomOp:
# ANCHOR_END: attention_orchestration
elif target == "cpu":
- attention_cpu_kernel[
- layout_q, layout_k, layout_v, layout_out, seq_len, d, dtype
- ](output_tensor, q_tensor, k_tensor, v_tensor)
+ attention_cpu_kernel[seq_len, d, dtype](
+ output_tensor, q_tensor, k_tensor, v_tensor
+ )
else:
raise Error("Unsupported target: " + target)
diff --git a/problems/p20/op/conv1d.mojo b/problems/p20/op/conv1d.mojo
index 07d29d9a..7c6bec92 100644
--- a/problems/p20/op/conv1d.mojo
+++ b/problems/p20/op/conv1d.mojo
@@ -2,7 +2,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
@@ -12,32 +14,26 @@ comptime BLOCKS_PER_GRID = (2, 1)
# ANCHOR: conv1d_kernel
def conv1d_kernel[
- in_layout: Layout,
- out_layout: Layout,
- conv_layout: Layout,
input_size: Int,
conv_size: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
+ ConvLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
- kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
+ kernel: TileTensor[mut=False, dtype, ConvLayout, MutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
# first: need to account for padding
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(TPB + conv_size - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(conv_size),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB + conv_size - 1]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[conv_size]())
if global_i < input_size:
shared_a[local_i] = input[global_i]
@@ -58,7 +54,7 @@ def conv1d_kernel[
barrier()
if global_i < input_size:
- var local_sum: output.element_type = 0
+ var local_sum: output.ElementType = 0
comptime for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
@@ -95,9 +91,6 @@ struct Conv1DCustomOp:
var out_tensor = output.to_layout_tensor()
var input_tensor = input.to_layout_tensor()
var kernel_tensor = kernel.to_layout_tensor()
- comptime in_layout = input_tensor.layout
- comptime out_layout = out_tensor.layout
- comptime conv_layout = kernel_tensor.layout
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -111,9 +104,7 @@ struct Conv1DCustomOp:
),
0,
)
- comptime kernel = conv1d_kernel[
- in_layout, out_layout, conv_layout, input_size, conv_size
- ]
+ comptime kernel = conv1d_kernel[input_size, conv_size]
gpu_ctx.enqueue_function[kernel, kernel](
out_tensor,
input_tensor,
diff --git a/problems/p21/op/embedding.mojo b/problems/p21/op/embedding.mojo
index 8108d7ba..22e73d51 100644
--- a/problems/p21/op/embedding.mojo
+++ b/problems/p21/op/embedding.mojo
@@ -1,7 +1,8 @@
from std.math import ceildiv
from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
from std.sys import argv
from std.testing import assert_equal
@@ -10,18 +11,18 @@ comptime THREADS_PER_BLOCK = 256
def embedding_kernel_coalesced[
- indices_layout: Layout,
- weights_layout: Layout,
- out_layout: Layout,
batch_size: Int,
seq_len: Int,
vocab_size: Int,
embed_dim: Int,
+ OutLayout: TensorLayout,
+ IndicesLayout: TensorLayout,
+ WeightsLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
- weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin],
+ weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin],
):
"""
Memory-coalescing focused embedding kernel.
@@ -54,18 +55,18 @@ def embedding_kernel_coalesced[
# ANCHOR: embedding_kernel_2d
def embedding_kernel_2d[
- indices_layout: Layout,
- weights_layout: Layout,
- out_layout: Layout,
batch_size: Int,
seq_len: Int,
vocab_size: Int,
embed_dim: Int,
+ OutLayout: TensorLayout,
+ IndicesLayout: TensorLayout,
+ WeightsLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
- weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin],
+ weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin],
):
"""
2D grid non-coalesced embedding kernel.
@@ -128,10 +129,6 @@ struct EmbeddingCustomOp:
var indices_tensor = indices.to_layout_tensor()
var weights_tensor = weights.to_layout_tensor()
- comptime indices_layout = indices_tensor.layout
- comptime weights_layout = weights_tensor.layout
- comptime out_layout = output_tensor.layout
-
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -152,9 +149,6 @@ struct EmbeddingCustomOp:
# Compile and launch optimized kernel
comptime kernel = embedding_kernel_coalesced[
- indices_layout,
- weights_layout,
- out_layout,
batch_size,
seq_len,
vocab_size,
@@ -210,10 +204,6 @@ struct Embedding2DCustomOp:
var indices_tensor = indices.to_layout_tensor()
var weights_tensor = weights.to_layout_tensor()
- comptime indices_layout = indices_tensor.layout
- comptime weights_layout = weights_tensor.layout
- comptime out_layout = output_tensor.layout
-
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -237,9 +227,6 @@ struct Embedding2DCustomOp:
# Compile and launch 2D kernel
comptime kernel = embedding_kernel_2d[
- indices_layout,
- weights_layout,
- out_layout,
batch_size,
seq_len,
vocab_size,
diff --git a/problems/p22/op/layernorm_linear.mojo b/problems/p22/op/layernorm_linear.mojo
index 8519c015..dc659b12 100644
--- a/problems/p22/op/layernorm_linear.mojo
+++ b/problems/p22/op/layernorm_linear.mojo
@@ -2,7 +2,9 @@ from std.math import sqrt
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace, async_copy_wait_all
from std.os.atomic import Atomic
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
import compiler
from std.runtime.asyncrt import DeviceContextPtr
@@ -20,17 +22,17 @@ comptime dtype = DType.float32
# ANCHOR: matmul_idiomatic_tiled
# Idiomatic tiled matmul from p19.mojo
def matmul_idiomatic_tiled[
- a_layout: Layout,
- b_layout: Layout,
- out_layout: Layout,
rows: Int,
cols: Int,
inner: Int,
+ OutLayout: TensorLayout,
+ ALayout: TensorLayout,
+ BLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, MutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, BLayout, MutAnyOrigin],
):
"""Idiomatic tiled matrix multiplication from p19."""
var local_row = thread_idx.y
@@ -42,26 +44,23 @@ def matmul_idiomatic_tiled[
var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
block_idx.y, block_idx.x
)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(
+ comptime shared_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
- comptime load_b_layout = Layout.row_major(
+ ]()
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var acc: output.ElementType = 0
+
+ comptime load_a_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
+ ]() # Coalesced loading
+ comptime load_b_layout = row_major[
+ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
+ ]() # Coalesced loading
comptime for idx in range(
(inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY
@@ -112,17 +111,17 @@ def matmul_idiomatic_tiled[
# ANCHOR: layernorm_kernel
def layernorm_kernel[
- input_layout: Layout,
- ln_params_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
+ ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
+ ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
):
var batch_idx = block_idx.x
var seq_idx = block_idx.y
@@ -147,24 +146,24 @@ def layernorm_kernel[
# ANCHOR: transpose_kernel
def transpose_kernel[
- layout_in: Layout,
- layout_out: Layout,
rows: Int,
cols: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
):
"""Transpose matrix using shared memory tiling for coalesced access.
We will learn more about coalesced access in the next part.
"""
- var shared_tile = LayoutTensor[
- dtype,
- Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ comptime shared_layout = row_major[
+ TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY
+ ]()
+ var shared_tile = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
var local_row = thread_idx.y
var local_col = thread_idx.x
@@ -191,16 +190,16 @@ def transpose_kernel[
# ANCHOR: add_bias_kernel
def add_bias_kernel[
- input_layout: Layout,
- bias_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
output_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ BiasLayout: TensorLayout,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, MutAnyOrigin],
- bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InputLayout, MutAnyOrigin],
+ bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin],
):
"""Simple bias addition."""
var batch_idx = block_idx.x
@@ -220,22 +219,22 @@ def add_bias_kernel[
# ANCHOR: minimal_fused_forward_kernel
def minimal_fused_kernel[
- input_layout: Layout,
- ln_params_layout: Layout,
- weight_layout: Layout,
- bias_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
output_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
+ WeightLayout: TensorLayout,
+ BiasLayout: TensorLayout,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
- linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
+ ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
+ ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
+ linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin],
+ linear_bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin],
):
"""Minimal fused kernel - one thread per sequence position to avoid redundancy.
"""
@@ -261,30 +260,32 @@ def minimal_fused_kernel[
# ANCHOR: minimal_fused_backward_kernel
def minimal_fused_kernel_backward[
- grad_output_layout: Layout,
- input_layout: Layout,
- ln_params_layout: Layout,
- weight_layout: Layout,
- grad_input_layout: Layout,
- grad_ln_weight_layout: Layout,
- grad_ln_bias_layout: Layout,
- grad_weight_layout: Layout,
- grad_bias_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
output_dim: Int,
+ GradInputLayout: TensorLayout,
+ GradLnWeightLayout: TensorLayout,
+ GradLnBiasLayout: TensorLayout,
+ GradWeightLayout: TensorLayout,
+ GradBiasLayout: TensorLayout,
+ GradOutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
+ WeightLayout: TensorLayout,
](
- grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin],
- grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin],
- grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin],
- grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin],
- grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin],
- grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
+ grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin],
+ grad_ln_weight: TileTensor[
+ mut=True, dtype, GradLnWeightLayout, MutAnyOrigin
+ ],
+ grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin],
+ grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin],
+ grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin],
+ grad_output: TileTensor[mut=False, dtype, GradOutputLayout, ImmutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
+ ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
+ ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
+ linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin],
):
"""Fused backward kernel using atomic operations for safe gradient accumulation.
"""
@@ -372,25 +373,30 @@ struct LayerNormLinearCustomOp:
comptime weight_layout = linear_weight.static_spec.to_layout()
comptime bias_layout = linear_bias.static_spec.to_layout()
comptime output_layout = output.static_spec.to_layout()
+ comptime InputLayout = type_of(input_layout)
+ comptime LnParamsLayout = type_of(ln_params_layout)
+ comptime WeightLayout = type_of(weight_layout)
+ comptime BiasLayout = type_of(bias_layout)
+ comptime OutputLayout = type_of(output_layout)
# Note: rebind is necessary now but it shouldn't be!
var output_tensor = rebind[
- LayoutTensor[dtype, output_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin]
](output.to_layout_tensor())
var input_tensor = rebind[
- LayoutTensor[dtype, input_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin]
](input.to_layout_tensor())
var ln_weight_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin]
](ln_weight.to_layout_tensor())
var ln_bias_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin]
](ln_bias.to_layout_tensor())
var linear_weight_tensor = rebind[
- LayoutTensor[dtype, weight_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin]
](linear_weight.to_layout_tensor())
var linear_bias_tensor = rebind[
- LayoutTensor[dtype, bias_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin]
](linear_bias.to_layout_tensor())
comptime if target == "gpu":
@@ -400,11 +406,6 @@ struct LayerNormLinearCustomOp:
comptime if algorithm == "fused":
# fused case - one thread per sequence position
comptime kernel = minimal_fused_kernel[
- input_layout,
- ln_params_layout,
- weight_layout,
- bias_layout,
- output_layout,
batch_size,
seq_len,
hidden_dim,
@@ -426,15 +427,12 @@ struct LayerNormLinearCustomOp:
var normalized_buffer = gpu_ctx.enqueue_create_buffer[dtype](
batch_size * seq_len * hidden_dim
)
- var normalized_tensor = LayoutTensor[
- dtype, input_layout, MutAnyOrigin
- ](normalized_buffer)
+ var normalized_tensor = TileTensor[
+ mut=True, dtype, InputLayout, MutAnyOrigin
+ ](normalized_buffer, input_layout)
# Step 1: LayerNorm kernel
comptime kernel = layernorm_kernel[
- input_layout,
- ln_params_layout,
- input_layout,
batch_size,
seq_len,
hidden_dim,
@@ -457,19 +455,26 @@ struct LayerNormLinearCustomOp:
var matmul_buffer = gpu_ctx.enqueue_create_buffer[dtype](
batch_size * seq_len * output_dim
)
- var matmul_tensor = LayoutTensor[
- dtype, output_layout, MutAnyOrigin
- ](matmul_buffer)
+ var matmul_tensor = TileTensor[
+ mut=True, dtype, OutputLayout, MutAnyOrigin
+ ](matmul_buffer, output_layout)
# Create transposed weight matrix: [output_dim, hidden_dim] -> [hidden_dim, output_dim]
var transposed_weight_buffer = gpu_ctx.enqueue_create_buffer[
dtype
](hidden_dim * output_dim)
- var transposed_weight_tensor = LayoutTensor[
+ comptime transposed_weight_layout = row_major[
+ hidden_dim, output_dim
+ ]()
+ comptime TransposedWeightLayout = type_of(
+ transposed_weight_layout
+ )
+ var transposed_weight_tensor = TileTensor[
+ mut=True,
dtype,
- Layout.row_major(hidden_dim, output_dim),
+ TransposedWeightLayout,
MutAnyOrigin,
- ](transposed_weight_buffer)
+ ](transposed_weight_buffer, transposed_weight_layout)
# Transpose the weight matrix
var transpose_blocks_x = (
@@ -479,8 +484,6 @@ struct LayerNormLinearCustomOp:
output_dim + TRANSPOSE_BLOCK_DIM_XY - 1
) // TRANSPOSE_BLOCK_DIM_XY
comptime kernel2 = transpose_kernel[
- weight_layout,
- transposed_weight_tensor.layout,
output_dim,
hidden_dim,
]
@@ -492,17 +495,20 @@ struct LayerNormLinearCustomOp:
)
# Reshape tensors for matmul: [batch*seq, hidden] @ [hidden, output] -> [batch*seq, output]
- var flat_normalized = normalized_tensor.reshape[
- Layout.row_major(batch_size * seq_len, hidden_dim)
+ comptime flat_normalized_layout = row_major[
+ batch_size * seq_len, hidden_dim
]()
- var flat_matmul = matmul_tensor.reshape[
- Layout.row_major(batch_size * seq_len, output_dim)
+ comptime FlatNormalizedLayout = type_of(flat_normalized_layout)
+ comptime flat_matmul_layout = row_major[
+ batch_size * seq_len, output_dim
]()
+ comptime FlatMatmulLayout = type_of(flat_matmul_layout)
+ var flat_normalized = normalized_tensor.reshape[
+ flat_normalized_layout
+ ]()
+ var flat_matmul = matmul_tensor.reshape[flat_matmul_layout]()
comptime kernel3 = matmul_idiomatic_tiled[
- flat_normalized.layout,
- transposed_weight_tensor.layout,
- flat_matmul.layout,
batch_size * seq_len,
output_dim,
hidden_dim,
@@ -516,14 +522,15 @@ struct LayerNormLinearCustomOp:
)
# Step 3: Add bias - reshape matmul result back to 3D for bias addition
+ comptime reshaped_matmul_layout = row_major[
+ batch_size, seq_len, output_dim
+ ]()
+ comptime ReshapedMatmulLayout = type_of(reshaped_matmul_layout)
var reshaped_matmul = matmul_tensor.reshape[
- Layout.row_major(batch_size, seq_len, output_dim)
+ reshaped_matmul_layout
]()
comptime kernel4 = add_bias_kernel[
- reshaped_matmul.layout,
- bias_layout,
- output_layout,
batch_size,
seq_len,
output_dim,
@@ -612,36 +619,45 @@ struct LayerNormLinearBackwardCustomOp:
comptime grad_ln_bias_layout = grad_ln_bias.static_spec.to_layout()
comptime grad_weight_layout = grad_weight.static_spec.to_layout()
comptime grad_bias_layout = grad_bias.static_spec.to_layout()
+ comptime GradOutputLayout = type_of(grad_output_layout)
+ comptime InputLayout = type_of(input_layout)
+ comptime LnParamsLayout = type_of(ln_params_layout)
+ comptime WeightLayout = type_of(weight_layout)
+ comptime GradInputLayout = type_of(grad_input_layout)
+ comptime GradLnWeightLayout = type_of(grad_ln_weight_layout)
+ comptime GradLnBiasLayout = type_of(grad_ln_bias_layout)
+ comptime GradWeightLayout = type_of(grad_weight_layout)
+ comptime GradBiasLayout = type_of(grad_bias_layout)
var grad_input_tensor = rebind[
- LayoutTensor[dtype, grad_input_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin]
](grad_input.to_layout_tensor())
var grad_ln_weight_tensor = rebind[
- LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, GradLnWeightLayout, MutAnyOrigin]
](grad_ln_weight.to_layout_tensor())
var grad_ln_bias_tensor = rebind[
- LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin]
](grad_ln_bias.to_layout_tensor())
var grad_weight_tensor = rebind[
- LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin]
](grad_weight.to_layout_tensor())
var grad_bias_tensor = rebind[
- LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin]
+ TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin]
](grad_bias.to_layout_tensor())
var grad_output_tensor = rebind[
- LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, GradOutputLayout, ImmutAnyOrigin]
](grad_output.to_layout_tensor())
var input_tensor = rebind[
- LayoutTensor[dtype, input_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin]
](input.to_layout_tensor())
var ln_weight_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin]
](ln_weight.to_layout_tensor())
var ln_bias_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin]
](ln_bias.to_layout_tensor())
var linear_weight_tensor = rebind[
- LayoutTensor[dtype, weight_layout, ImmutAnyOrigin]
+ TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin]
](linear_weight.to_layout_tensor())
comptime if target == "gpu":
@@ -649,15 +665,6 @@ struct LayerNormLinearBackwardCustomOp:
# Launch backward kernel
comptime kernel = minimal_fused_kernel_backward[
- grad_output_layout,
- input_layout,
- ln_params_layout,
- weight_layout,
- grad_input_layout,
- grad_ln_weight_layout,
- grad_ln_bias_layout,
- grad_weight_layout,
- grad_bias_layout,
batch_size,
seq_len,
hidden_dim,
diff --git a/problems/p23/p23.mojo b/problems/p23/p23.mojo
index 69843290..5b390bb4 100644
--- a/problems/p23/p23.mojo
+++ b/problems/p23/p23.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.host.compile import get_gpu_target
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from std.utils import IndexList
from std.math import log2
from std.algorithm.functional import elementwise, vectorize
@@ -12,17 +14,18 @@ from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep
# ANCHOR: elementwise_add
comptime SIZE = 1024
comptime rank = 1
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype, target=get_gpu_target()]()
def elementwise_add[
- layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
+ LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -34,7 +37,7 @@ def elementwise_add[
print("idx:", idx)
# FILL IN (2 to 4 lines)
- elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)
+ elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)
# ANCHOR_END: elementwise_add
@@ -45,16 +48,16 @@ comptime TILE_SIZE = 32
def tiled_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -79,7 +82,7 @@ def tiled_elementwise_add[
# ANCHOR: manual_vectorized_tiled_elementwise_add
def manual_vectorized_tiled_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
@@ -87,9 +90,9 @@ def manual_vectorized_tiled_elementwise_add[
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size groups of simd_width elements
@@ -120,7 +123,7 @@ def manual_vectorized_tiled_elementwise_add[
# ANCHOR: vectorize_within_tiles_elementwise_add
def vectorize_within_tiles_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
@@ -128,9 +131,9 @@ def vectorize_within_tiles_elementwise_add[
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size elements (not SIMD groups)
@@ -171,7 +174,8 @@ def benchmark_elementwise_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -184,20 +188,20 @@ def benchmark_elementwise_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin](
- a.unsafe_ptr()
- )
- var b_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin](
- b_buf.unsafe_ptr()
- )
- var out_tensor = LayoutTensor[mut=True, dtype, layout, MutAnyOrigin](
- out.unsafe_ptr()
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
)
@parameter
@always_inline
def elementwise_workflow(ctx: DeviceContext) raises:
- elementwise_add[layout, dtype, SIMD_WIDTH, rank, test_size](
+ elementwise_add[BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size](
out_tensor, a_tensor, b_tensor, ctx
)
@@ -212,7 +216,8 @@ def benchmark_tiled_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -225,15 +230,21 @@ def benchmark_tiled_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def tiled_workflow(ctx: DeviceContext) raises:
tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[tiled_workflow](bench_ctx)
@@ -247,7 +258,8 @@ def benchmark_manual_vectorized_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -260,15 +272,21 @@ def benchmark_manual_vectorized_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def manual_vectorized_workflow(ctx: DeviceContext) raises:
manual_vectorized_tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[manual_vectorized_workflow](bench_ctx)
@@ -282,7 +300,8 @@ def benchmark_vectorized_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -295,15 +314,21 @@ def benchmark_vectorized_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def vectorized_workflow(ctx: DeviceContext) raises:
vectorize_within_tiles_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[vectorized_workflow](bench_ctx)
@@ -328,8 +353,12 @@ def main() raises:
b_host[i] = Scalar[dtype](2 * i + 1)
expected[i] = a_host[i] + b_host[i]
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b.unsafe_ptr())
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ a, layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ b, layout
+ )
ctx.synchronize()
@@ -337,8 +366,10 @@ def main() raises:
print("simd_width:", SIMD_WIDTH)
if argv()[1] == "--elementwise":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE](
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
+ elementwise_add[LayoutType, dtype, SIMD_WIDTH, rank, SIZE](
out_tensor, a_tensor, b_tensor, ctx
)
@@ -350,11 +381,13 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--tiled":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- print("tile size:", TILE_SIZE)
- tiled_elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE](
- out_tensor, a_tensor, b_tensor, ctx
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
)
+ print("tile size:", TILE_SIZE)
+ tiled_elementwise_add[
+ LayoutType, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE
+ ](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
print("out:", out_host)
@@ -364,10 +397,12 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--manual-vectorized":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
print("tile size:", TILE_SIZE)
manual_vectorized_tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
+ LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
@@ -378,10 +413,12 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--vectorized":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
print("tile size:", TILE_SIZE)
vectorize_within_tiles_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
+ LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
diff --git a/problems/p24/p24.mojo b/problems/p24/p24.mojo
index 10ae8958..ed901d15 100644
--- a/problems/p24/p24.mojo
+++ b/problems/p24/p24.mojo
@@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.memory import AddressSpace
from std.gpu.primitives.warp import sum as warp_sum, WARP_SIZE
from std.algorithm.functional import elementwise
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.utils import IndexList
from std.sys import argv, simd_width_of, align_of
from std.testing import assert_equal
@@ -27,26 +29,25 @@ comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype]()
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime InLayoutType = type_of(in_layout)
+comptime out_layout = row_major[1]()
+comptime OutLayoutType = type_of(out_layout)
def traditional_dot_product_p12_style[
- in_layout: Layout, out_layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
):
"""
This is the complex approach from p12_layout_tensor.mojo - kept for comparison.
"""
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(WARP_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[WARP_SIZE]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -73,11 +74,11 @@ def traditional_dot_product_p12_style[
# ANCHOR: simple_warp_kernel
def simple_warp_dot_product[
- in_layout: Layout, out_layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
# FILL IN (6 lines at most)
@@ -88,16 +89,14 @@ def simple_warp_dot_product[
# ANCHOR: functional_warp_approach
def functional_warp_dot_product[
- layout: Layout,
- out_layout: Layout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
](
- output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -162,8 +161,10 @@ def benchmark_simple_warp_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime BenchInLayoutType = type_of(bench_in_layout)
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchOutLayoutType = type_of(bench_out_layout)
comptime n_threads = WARP_SIZE
comptime n_blocks = (ceildiv(test_size, n_threads), 1)
@@ -182,16 +183,18 @@ def benchmark_simple_warp_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](a, bench_in_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](b, bench_in_layout)
+ var out_tensor = TileTensor(out, bench_out_layout)
@parameter
@always_inline
def traditional_workflow(ctx: DeviceContext) raises:
- comptime kernel = simple_warp_dot_product[
- in_layout, out_layout, test_size
- ]
+ comptime kernel = simple_warp_dot_product[test_size]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -214,8 +217,10 @@ def benchmark_functional_warp_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime BenchInLayoutType = type_of(bench_in_layout)
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchOutLayoutType = type_of(bench_out_layout)
var bench_ctx = DeviceContext()
@@ -232,16 +237,20 @@ def benchmark_functional_warp_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](a, bench_in_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](b, bench_in_layout)
+ var out_tensor = TileTensor(out, bench_out_layout)
@parameter
@always_inline
def functional_warp_workflow(ctx: DeviceContext) raises:
- functional_warp_dot_product[
- in_layout, out_layout, dtype, SIMD_WIDTH, 1, test_size
- ](out_tensor, a_tensor, b_tensor, ctx)
+ functional_warp_dot_product[dtype, SIMD_WIDTH, 1, test_size](
+ out_tensor, a_tensor, b_tensor, ctx
+ )
bencher.iter_custom[functional_warp_workflow](bench_ctx)
check_result[dtype, n_warps](out, expected)
@@ -257,8 +266,10 @@ def benchmark_traditional_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime BenchInLayoutType = type_of(bench_in_layout)
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchOutLayoutType = type_of(bench_out_layout)
comptime n_blocks = (ceildiv(test_size, WARP_SIZE), 1)
var bench_ctx = DeviceContext()
@@ -276,16 +287,20 @@ def benchmark_traditional_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](a, bench_in_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin
+ ](b, bench_in_layout)
+ var out_tensor = TileTensor(out, bench_out_layout)
@parameter
@always_inline
def traditional_workflow(ctx: DeviceContext) raises:
ctx.enqueue_function[
- traditional_dot_product_p12_style[in_layout, out_layout, test_size],
- traditional_dot_product_p12_style[in_layout, out_layout, test_size],
+ traditional_dot_product_p12_style[test_size],
+ traditional_dot_product_p12_style[test_size],
](
out_tensor,
a_tensor,
@@ -318,9 +333,13 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](n_warps)
expected.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
+ var out_tensor = TileTensor(out, out_layout)
+ var a_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](a, in_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](b, in_layout)
with a.map_to_host() as a_host, b.map_to_host() as b_host:
for i in range(SIZE):
@@ -329,12 +348,8 @@ def main() raises:
if argv()[1] == "--traditional":
ctx.enqueue_function[
- traditional_dot_product_p12_style[
- in_layout, out_layout, SIZE
- ],
- traditional_dot_product_p12_style[
- in_layout, out_layout, SIZE
- ],
+ traditional_dot_product_p12_style[SIZE],
+ traditional_dot_product_p12_style[SIZE],
](
out_tensor,
a_tensor,
@@ -344,8 +359,8 @@ def main() raises:
)
elif argv()[1] == "--kernel":
ctx.enqueue_function[
- simple_warp_dot_product[in_layout, out_layout, SIZE],
- simple_warp_dot_product[in_layout, out_layout, SIZE],
+ simple_warp_dot_product[SIZE],
+ simple_warp_dot_product[SIZE],
](
out_tensor,
a_tensor,
@@ -354,9 +369,9 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
elif argv()[1] == "--functional":
- functional_warp_dot_product[
- in_layout, out_layout, dtype, SIMD_WIDTH, 1, SIZE
- ](out_tensor, a_tensor, b_tensor, ctx)
+ functional_warp_dot_product[dtype, SIMD_WIDTH, 1, SIZE](
+ out_tensor, a_tensor, b_tensor, ctx
+ )
expected_output[dtype, n_warps](expected, a, b)
check_result[dtype, n_warps, True](out, expected)
print("Puzzle 24 complete โ
")
diff --git a/problems/p25/p25.mojo b/problems/p25/p25.mojo
index 8aaa6edf..ba29aa7e 100644
--- a/problems/p25/p25.mojo
+++ b/problems/p25/p25.mojo
@@ -1,7 +1,8 @@
from std.gpu import thread_idx, block_idx, block_dim, lane_id
from std.gpu.host import DeviceContext
from std.gpu.primitives.warp import shuffle_down, broadcast, WARP_SIZE
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal
@@ -10,14 +11,15 @@ comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
def neighbor_difference[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Compute finite differences: output[i] = input[i+1] - input[i]
@@ -36,14 +38,15 @@ def neighbor_difference[
comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
-comptime layout_2 = Layout.row_major(SIZE_2)
+comptime layout_2 = row_major[SIZE_2]()
+comptime LayoutType_2 = type_of(layout_2)
def moving_average_3[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin],
):
"""
Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
@@ -61,10 +64,10 @@ def moving_average_3[
# ANCHOR: broadcast_shuffle_coordination
def broadcast_shuffle_coordination[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Combine broadcast() and shuffle_down() for advanced warp coordination.
@@ -74,7 +77,7 @@ def broadcast_shuffle_coordination[
var global_i = block_dim.x * block_idx.x + thread_idx.x
var lane = Int(lane_id())
if global_i < size:
- var scale_factor: output.element_type = 0.0
+ var scale_factor: output.ElementType = 0.0
# FILL IN (roughly 14 lines)
@@ -84,10 +87,10 @@ def broadcast_shuffle_coordination[
# ANCHOR: basic_broadcast
def basic_broadcast[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
@@ -96,7 +99,7 @@ def basic_broadcast[
var global_i = block_dim.x * block_idx.x + thread_idx.x
var lane = Int(lane_id())
if global_i < size:
- var broadcast_value: output.element_type = 0.0
+ var broadcast_value: output.ElementType = 0.0
# FILL IN (roughly 10 lines)
@@ -106,10 +109,10 @@ def basic_broadcast[
# ANCHOR: conditional_broadcast
def conditional_broadcast[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
@@ -118,7 +121,7 @@ def conditional_broadcast[
var global_i = block_dim.x * block_idx.x + thread_idx.x
var lane = Int(lane_id())
if global_i < size:
- var decision_value: output.element_type = 0.0
+ var decision_value: output.ElementType = 0.0
# FILL IN (roughly 10 lines)
@@ -145,14 +148,12 @@ def test_neighbor_difference() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i * i)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = neighbor_difference[layout, SIZE]
+ comptime kernel = neighbor_difference[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -193,14 +194,12 @@ def test_moving_average() raises:
for i in range(1, SIZE_2):
input_host[i] = input_host[i - 1] + Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType_2, ImmutAnyOrigin
+ ](input_buf, layout_2)
+ var output_tensor = TileTensor(output_buf, layout_2)
- comptime kernel = moving_average_3[layout_2, SIZE_2]
+ comptime kernel = moving_average_3[SIZE_2]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -263,14 +262,12 @@ def test_broadcast_shuffle_coordination() raises:
else:
input_host[i] = Scalar[dtype](((i - 4) % 4) * 2 + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = broadcast_shuffle_coordination[layout, SIZE]
+ comptime kernel = broadcast_shuffle_coordination[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -317,14 +314,12 @@ def test_basic_broadcast() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = basic_broadcast[layout, SIZE]
+ comptime kernel = basic_broadcast[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -377,14 +372,12 @@ def test_conditional_broadcast() raises:
for i in range(SIZE):
input_host[i] = test_values[i % len(test_values)]
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = conditional_broadcast[layout, SIZE]
+ comptime kernel = conditional_broadcast[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/problems/p26/p26.mojo b/problems/p26/p26.mojo
index d4e82266..33e127fe 100644
--- a/problems/p26/p26.mojo
+++ b/problems/p26/p26.mojo
@@ -1,7 +1,8 @@
from std.gpu import thread_idx, block_idx, block_dim, lane_id
from std.gpu.host import DeviceContext
from std.gpu.primitives.warp import shuffle_xor, prefix_sum, WARP_SIZE
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal
@@ -10,14 +11,15 @@ comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
def butterfly_pair_swap[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
@@ -35,10 +37,10 @@ def butterfly_pair_swap[
# ANCHOR: butterfly_parallel_max
def butterfly_parallel_max[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Parallel maximum reduction using butterfly pattern.
@@ -59,14 +61,15 @@ def butterfly_parallel_max[
comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
-comptime layout_2 = Layout.row_major(SIZE_2)
+comptime layout_2 = row_major[SIZE_2]()
+comptime LayoutType_2 = type_of(layout_2)
def butterfly_conditional_max[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin],
):
"""
Conditional butterfly maximum: Perform butterfly max reduction, but only store result
@@ -88,10 +91,10 @@ def butterfly_conditional_max[
# ANCHOR: warp_inclusive_prefix_sum
def warp_inclusive_prefix_sum[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Inclusive prefix sum using warp primitive:
@@ -123,10 +126,10 @@ def warp_inclusive_prefix_sum[
# ANCHOR: warp_partition
def warp_partition[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
pivot: Float32,
):
"""
@@ -167,14 +170,12 @@ def test_butterfly_pair_swap() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = butterfly_pair_swap[layout, SIZE]
+ comptime kernel = butterfly_pair_swap[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -218,14 +219,12 @@ def test_butterfly_parallel_max() raises:
# Make sure we have a clear maximum
input_host[SIZE - 1] = 1000.0
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = butterfly_parallel_max[layout, SIZE]
+ comptime kernel = butterfly_parallel_max[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -264,14 +263,12 @@ def test_butterfly_conditional_max() raises:
else:
input_host[i] = Scalar[dtype](i % 10)
- var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType_2, ImmutAnyOrigin
+ ](input_buf, layout_2)
+ var output_tensor = TileTensor(output_buf, layout_2)
- comptime kernel = butterfly_conditional_max[layout_2, SIZE_2]
+ comptime kernel = butterfly_conditional_max[SIZE_2]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -324,14 +321,12 @@ def test_warp_inclusive_prefix_sum() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = warp_inclusive_prefix_sum[layout, SIZE]
+ comptime kernel = warp_inclusive_prefix_sum[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -390,14 +385,12 @@ def test_warp_partition() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](test_values[i % len(test_values)])
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
- )
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](input_buf, layout)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = warp_partition[layout, SIZE]
+ comptime kernel = warp_partition[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/problems/p27/p27.mojo b/problems/p27/p27.mojo
index 78835fc3..6dbbd7de 100644
--- a/problems/p27/p27.mojo
+++ b/problems/p27/p27.mojo
@@ -4,7 +4,9 @@ from std.gpu.primitives.warp import WARP_SIZE
from std.gpu.primitives import block
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
from std.math import floor
@@ -12,22 +14,19 @@ from std.math import floor
# ANCHOR: traditional_dot_product
def traditional_dot_product[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
size: Int,
):
"""Traditional dot product using shared memory + barriers + tree reduction.
Educational but complex - shows the manual coordination needed."""
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -58,17 +57,19 @@ def traditional_dot_product[
comptime SIZE = 128
comptime TPB = 128
comptime NUM_BINS = 8
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime InLayoutType = type_of(in_layout)
+comptime out_layout = row_major[1]()
+comptime OutLayoutType = type_of(out_layout)
comptime dtype = DType.float32
def block_sum_dot_product[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
size: Int,
):
"""Dot product using block.sum() - convenience function like warp.sum()!
@@ -83,15 +84,18 @@ def block_sum_dot_product[
# ANCHOR_END: block_sum_dot_product
# ANCHOR: block_histogram
-comptime bin_layout = Layout.row_major(SIZE) # Max SIZE elements per bin
+comptime bin_layout = row_major[SIZE]() # Max SIZE elements per bin
+comptime BinLayoutType = type_of(bin_layout)
def block_histogram_bin_extract[
- in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin],
- count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin],
+ input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ bin_output: TileTensor[mut=True, dtype, BinLayoutType, MutAnyOrigin],
+ count_output: TileTensor[
+ mut=True, DType.int32, OutLayoutType, MutAnyOrigin
+ ],
size: Int,
target_bin: Int,
num_bins: Int,
@@ -133,14 +137,15 @@ def block_histogram_bin_extract[
# ANCHOR: block_normalize
-comptime vector_layout = Layout.row_major(SIZE)
+comptime vector_layout = row_major[SIZE]()
+comptime VectorLayoutType = type_of(vector_layout)
def block_normalize_vector[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
+ input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ output_data: TileTensor[mut=True, dtype, VectorLayoutType, MutAnyOrigin],
size: Int,
):
"""Vector mean normalization using block.sum() + block.broadcast() combination.
@@ -208,14 +213,16 @@ def main() raises:
print("TPB:", TPB)
print("Expected result:", expected)
- a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf)
- out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](a, in_layout)
+ b_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](b_buf, in_layout)
+ out_tensor = TileTensor(out, out_layout)
# Traditional approach: works perfectly when size == TPB
- comptime kernel = traditional_dot_product[
- in_layout, out_layout, TPB
- ]
+ comptime kernel = traditional_dot_product[TPB]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -253,12 +260,16 @@ def main() raises:
print("TPB:", TPB)
print("Expected result:", expected)
- a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf)
- out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](a, in_layout)
+ b_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](b_buf, in_layout)
+ out_tensor = TileTensor(out, out_layout)
# Block.sum(): Same result with dramatically simpler code!
- comptime kernel = block_sum_dot_product[in_layout, out_layout, TPB]
+ comptime kernel = block_sum_dot_product[TPB]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -307,9 +318,9 @@ def main() raises:
print("...")
print()
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
- )
+ input_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](input_buf, in_layout)
# Demonstrate histogram for each bin using block.prefix_sum()
for target_bin in range(NUM_BINS):
@@ -329,17 +340,11 @@ def main() raises:
var bin_count = ctx.enqueue_create_buffer[DType.int32](1)
bin_count.enqueue_fill(0)
- var bin_tensor = LayoutTensor[dtype, bin_layout, MutAnyOrigin](
- bin_data
- )
- var count_tensor = LayoutTensor[
- DType.int32, out_layout, MutAnyOrigin
- ](bin_count)
+ var bin_tensor = TileTensor(bin_data, bin_layout)
+ var count_tensor = TileTensor(bin_count, out_layout)
# Execute histogram kernel for this specific bin
- comptime kernel = block_histogram_bin_extract[
- in_layout, bin_layout, out_layout, TPB
- ]
+ comptime kernel = block_histogram_bin_extract[TPB]
ctx.enqueue_function[kernel, kernel](
input_tensor,
bin_tensor,
@@ -405,17 +410,13 @@ def main() raises:
print("Mean value:", mean_value)
print()
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[
- dtype, vector_layout, MutAnyOrigin
- ](output_buf)
+ input_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](input_buf, in_layout)
+ var output_tensor = TileTensor(output_buf, vector_layout)
# Execute vector normalization kernel
- comptime kernel = block_normalize_vector[
- in_layout, vector_layout, TPB
- ]
+ comptime kernel = block_normalize_vector[TPB]
ctx.enqueue_function[kernel, kernel](
input_tensor,
output_tensor,
diff --git a/problems/p28/p28.mojo b/problems/p28/p28.mojo
index 86b48d7c..ec1c8928 100644
--- a/problems/p28/p28.mojo
+++ b/problems/p28/p28.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace, async_copy_wait_all
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
from std.sys import argv, info
from std.testing import assert_equal, assert_almost_equal
@@ -17,15 +19,18 @@ comptime BLOCKS_PER_GRID_ASYNC = (
) // CONV_TILE_SIZE
comptime THREADS_PER_BLOCK_ASYNC = 256
comptime dtype = DType.float32
-comptime layout_async = Layout.row_major(VECTOR_SIZE)
+comptime layout_async = row_major[VECTOR_SIZE]()
+comptime LayoutAsyncType = type_of(layout_async)
+comptime kernel_layout = row_major[KERNEL_SIZE]()
+comptime KernelLayoutType = type_of(kernel_layout)
def async_copy_overlap_convolution[
- dtype: DType, layout: Layout
+ dtype: DType
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutAsyncType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutAsyncType, ImmutAnyOrigin],
+ kernel: TileTensor[mut=False, dtype, KernelLayoutType, ImmutAnyOrigin],
):
"""Demonstrates async copy operations building on p14 patterns.
@@ -34,18 +39,12 @@ def async_copy_overlap_convolution[
"""
# Shared memory buffers (like p14, but without .fill(0) to avoid race)
- var input_shared = LayoutTensor[
- dtype,
- Layout.row_major(CONV_TILE_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var kernel_shared = LayoutTensor[
- dtype,
- Layout.row_major(KERNEL_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var input_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[CONV_TILE_SIZE]())
+ var kernel_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[KERNEL_SIZE]())
# FILL IN HERE (roughly 19 lines)
@@ -73,17 +72,15 @@ def test_async_copy_overlap_convolution() raises:
for i in range(KERNEL_SIZE):
kernel_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout_async, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[
+ mut=False, dtype, LayoutAsyncType, ImmutAnyOrigin
+ ](input_buf, layout_async)
+ var output_tensor = TileTensor(output_buf, layout_async)
+ var kernel_tensor = TileTensor[mut=False, dtype, KernelLayoutType](
+ kernel_buf, kernel_layout
)
- var output_tensor = LayoutTensor[dtype, layout_async, MutAnyOrigin](
- output_buf
- )
- var kernel_tensor = LayoutTensor[
- mut=False, dtype, Layout.row_major(KERNEL_SIZE)
- ](kernel_buf)
- comptime kernel = async_copy_overlap_convolution[dtype, layout_async]
+ comptime kernel = async_copy_overlap_convolution[dtype]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/problems/p29/p29.mojo b/problems/p29/p29.mojo
index a645526f..89156067 100644
--- a/problems/p29/p29.mojo
+++ b/problems/p29/p29.mojo
@@ -9,7 +9,9 @@ from std.gpu.sync import (
)
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace, async_copy_wait_all
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
from std.sys import argv, info
from std.testing import assert_true, assert_almost_equal
@@ -21,7 +23,8 @@ comptime SIZE = 1024 # Image size (1D for simplicity)
comptime BLOCKS_PER_GRID = (4, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# Multi-stage processing configuration
comptime STAGE1_THREADS = TPB // 2
@@ -29,11 +32,9 @@ comptime STAGE2_THREADS = TPB // 2
comptime BLUR_RADIUS = 2
-def multi_stage_image_blur_pipeline[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def multi_stage_image_blur_pipeline(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
"""Multi-stage image blur pipeline with barrier coordination.
@@ -44,18 +45,12 @@ def multi_stage_image_blur_pipeline[
"""
# Shared memory buffers for pipeline stages
- var input_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var blur_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var input_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+ var blur_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -88,11 +83,9 @@ comptime STENCIL_ITERATIONS = 3
comptime BUFFER_COUNT = 2
-def double_buffered_stencil_computation[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def double_buffered_stencil_computation(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
"""Double-buffered stencil computation with memory barrier coordination.
@@ -102,38 +95,23 @@ def double_buffered_stencil_computation[
"""
# Double-buffering: Two shared memory buffers
- var buffer_A = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var buffer_B = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var buffer_A = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+ var buffer_B = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
# Memory barriers for coordinating buffer swaps
- var init_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var iter_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var final_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var init_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
+ var iter_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
+ var final_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -205,11 +183,13 @@ def test_multi_stage_pipeline() raises:
# Create a simple wave pattern for blurring
inp_host[i] = Scalar[dtype](i % 10) + Scalar[dtype](i) / 100.0
- # Create LayoutTensors
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
+ # Create TileTensors
+ var out_tensor = TileTensor(out, layout)
+ var inp_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](inp, layout)
- comptime kernel = multi_stage_image_blur_pipeline[layout]
+ comptime kernel = multi_stage_image_blur_pipeline
ctx.enqueue_function[kernel, kernel](
out_tensor,
inp_tensor,
@@ -267,11 +247,13 @@ def test_double_buffered_stencil() raises:
# Create a step pattern that will be smoothed by stencil
inp_host[i] = Scalar[dtype](1.0 if i % 20 < 10 else 0.0)
- # Create LayoutTensors for Puzzle 29B
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
+ # Create TileTensors for Puzzle 29B
+ var out_tensor = TileTensor(out, layout)
+ var inp_tensor = TileTensor[
+ mut=False, dtype, LayoutType, ImmutAnyOrigin
+ ](inp, layout)
- comptime kernel = double_buffered_stencil_computation[layout]
+ comptime kernel = double_buffered_stencil_computation
ctx.enqueue_function[kernel, kernel](
out_tensor,
inp_tensor,
diff --git a/problems/p30/p30.mojo b/problems/p30/p30.mojo
index 57ca41dd..4cee7205 100644
--- a/problems/p30/p30.mojo
+++ b/problems/p30/p30.mojo
@@ -1,6 +1,7 @@
from std.gpu import thread_idx, block_dim, block_idx
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.sys import argv
from std.testing import assert_almost_equal
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep
@@ -12,16 +13,15 @@ comptime BLOCKS_PER_GRID = (
1,
) # Enough blocks to cover all elements
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: kernel1
-def kernel1[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def kernel1(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var i = block_dim.x * block_idx.x + thread_idx.x
@@ -33,12 +33,10 @@ def kernel1[
# ANCHOR: kernel2
-def kernel2[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def kernel2(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var tid = block_idx.x * block_dim.x + thread_idx.x
@@ -54,12 +52,10 @@ def kernel2[
# ANCHOR: kernel3
-def kernel3[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def kernel3(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var tid = block_idx.x * block_dim.x + thread_idx.x
@@ -81,7 +77,8 @@ def benchmark_kernel1_parameterized[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def kernel1_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var out = ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](test_size)
@@ -94,11 +91,11 @@ def benchmark_kernel1_parameterized[test_size: Int](mut b: Bencher) raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout)
- ctx.enqueue_function[kernel1[layout], kernel1[layout]](
+ ctx.enqueue_function[kernel1, kernel1](
out_tensor,
a_tensor,
b_tensor,
@@ -119,7 +116,8 @@ def benchmark_kernel2_parameterized[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def kernel2_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var out = ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](test_size)
@@ -132,11 +130,11 @@ def benchmark_kernel2_parameterized[test_size: Int](mut b: Bencher) raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout)
- ctx.enqueue_function[kernel2[layout], kernel2[layout]](
+ ctx.enqueue_function[kernel2, kernel2](
out_tensor,
a_tensor,
b_tensor,
@@ -157,7 +155,8 @@ def benchmark_kernel3_parameterized[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def kernel3_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var out = ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](test_size)
@@ -170,11 +169,11 @@ def benchmark_kernel3_parameterized[test_size: Int](mut b: Bencher) raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout)
- ctx.enqueue_function[kernel3[layout], kernel3[layout]](
+ ctx.enqueue_function[kernel3, kernel3](
out_tensor,
a_tensor,
b_tensor,
@@ -206,12 +205,12 @@ def test_kernel1() raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b)
+ # Create TileTensors
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout)
- ctx.enqueue_function[kernel1[layout], kernel1[layout]](
+ ctx.enqueue_function[kernel1, kernel1](
out_tensor,
a_tensor,
b_tensor,
@@ -249,12 +248,12 @@ def test_kernel2() raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b)
+ # Create TileTensors
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout)
- ctx.enqueue_function[kernel2[layout], kernel2[layout]](
+ ctx.enqueue_function[kernel2, kernel2](
out_tensor,
a_tensor,
b_tensor,
@@ -295,12 +294,12 @@ def test_kernel3() raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b)
+ # Create TileTensors
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout)
- ctx.enqueue_function[kernel3[layout], kernel3[layout]](
+ ctx.enqueue_function[kernel3, kernel3](
out_tensor,
a_tensor,
b_tensor,
diff --git a/problems/p31/p31.mojo b/problems/p31/p31.mojo
index d70f583b..36930648 100644
--- a/problems/p31/p31.mojo
+++ b/problems/p31/p31.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_almost_equal
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep
@@ -11,15 +13,14 @@ comptime SIZE = 32 * 1024 * 1024 # 32M elements - larger workload to show occup
comptime THREADS_PER_BLOCK = (1024, 1)
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime ALPHA = Scalar[dtype](2.5) # SAXPY coefficient
-def minimal_kernel[
- layout: Layout
-](
- y: LayoutTensor[dtype, layout, MutAnyOrigin],
- x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def minimal_kernel(
+ y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
alpha: Float32,
size: Int,
):
@@ -35,23 +36,20 @@ def minimal_kernel[
# ANCHOR: sophisticated_kernel
-def sophisticated_kernel[
- layout: Layout
-](
- y: LayoutTensor[dtype, layout, MutAnyOrigin],
- x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def sophisticated_kernel(
+ y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
alpha: Float32,
size: Int,
):
"""Sophisticated SAXPY kernel - over-engineered with excessive resource usage.
"""
# Maximum shared memory allocation (close to 48KB limit)
- var shared_cache = LayoutTensor[
- dtype,
- Layout.row_major(1024 * 12),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation() # 48KB
+ var shared_cache = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](
+ row_major[1024 * 12]()
+ ) # 48KB
var i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -132,23 +130,20 @@ def sophisticated_kernel[
# ANCHOR: balanced_kernel
-def balanced_kernel[
- layout: Layout
-](
- y: LayoutTensor[dtype, layout, MutAnyOrigin],
- x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def balanced_kernel(
+ y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
alpha: Float32,
size: Int,
):
"""Balanced SAXPY kernel - efficient optimization with moderate resources.
"""
# Reasonable shared memory usage for effective caching (16KB)
- var shared_cache = LayoutTensor[
- dtype,
- Layout.row_major(1024 * 4),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation() # 16KB total
+ var shared_cache = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](
+ row_major[1024 * 4]()
+ ) # 16KB total
var i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -195,7 +190,8 @@ def benchmark_minimal_parameterized[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def minimal_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var y = ctx.enqueue_create_buffer[dtype](test_size)
y.enqueue_fill(0)
var x = ctx.enqueue_create_buffer[dtype](test_size)
@@ -206,10 +202,10 @@ def benchmark_minimal_parameterized[test_size: Int](mut b: Bencher) raises:
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = minimal_kernel[layout]
+ comptime kernel = minimal_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
@@ -233,7 +229,8 @@ def benchmark_sophisticated_parameterized[
@parameter
@always_inline
def sophisticated_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var y = ctx.enqueue_create_buffer[dtype](test_size)
y.enqueue_fill(0)
var x = ctx.enqueue_create_buffer[dtype](test_size)
@@ -244,10 +241,10 @@ def benchmark_sophisticated_parameterized[
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = sophisticated_kernel[layout]
+ comptime kernel = sophisticated_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
@@ -269,7 +266,8 @@ def benchmark_balanced_parameterized[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def balanced_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var y = ctx.enqueue_create_buffer[dtype](test_size)
y.enqueue_fill(0)
var x = ctx.enqueue_create_buffer[dtype](test_size)
@@ -280,10 +278,10 @@ def benchmark_balanced_parameterized[test_size: Int](mut b: Bencher) raises:
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = balanced_kernel[layout]
+ comptime kernel = balanced_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
@@ -314,11 +312,11 @@ def test_minimal() raises:
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ # Create TileTensors
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = minimal_kernel[layout]
+ comptime kernel = minimal_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
@@ -357,11 +355,11 @@ def test_sophisticated() raises:
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ # Create TileTensors
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = sophisticated_kernel[layout]
+ comptime kernel = sophisticated_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
@@ -401,11 +399,11 @@ def test_balanced() raises:
x_host[i] = Scalar[dtype](i + 1)
y_host[i] = Scalar[dtype](i + 2)
- # Create LayoutTensors
- var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y)
- var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x)
+ # Create TileTensors
+ var y_tensor = TileTensor(y, layout)
+ var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout)
- comptime kernel = balanced_kernel[layout]
+ comptime kernel = balanced_kernel
ctx.enqueue_function[kernel, kernel](
y_tensor,
x_tensor,
diff --git a/problems/p32/p32.mojo b/problems/p32/p32.mojo
index 21e3c543..d8b406af 100644
--- a/problems/p32/p32.mojo
+++ b/problems/p32/p32.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_almost_equal
from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep
@@ -12,14 +14,13 @@ comptime TPB = 256 # Threads per block - divisible by 32 (warp size)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime BLOCKS_PER_GRID = (SIZE // TPB, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
-def no_conflict_kernel[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def no_conflict_kernel(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
"""Perfect shared memory access - no bank conflicts.
@@ -29,12 +30,9 @@ def no_conflict_kernel[
"""
# Shared memory buffer - each thread loads one element
- var shared_buf = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_buf = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -58,26 +56,21 @@ def no_conflict_kernel[
# ANCHOR: two_way_conflict_kernel
-def two_way_conflict_kernel[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def two_way_conflict_kernel(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
"""Stride-2 shared memory access - creates 2-way bank conflicts.
- Threads 0,16 โ Bank 0, Threads 1,17 โ Bank 1, etc.
+ Threads 0,16 -> Bank 0, Threads 1,17 -> Bank 1, etc.
Each bank serves 2 threads, doubling access time.
"""
# Shared memory buffer - stride-2 access pattern creates conflicts
- var shared_buf = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_buf = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -111,7 +104,8 @@ def benchmark_no_conflict[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def kernel_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var out = ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var input_buf = ctx.enqueue_create_buffer[dtype](test_size)
@@ -121,12 +115,12 @@ def benchmark_no_conflict[test_size: Int](mut b: Bencher) raises:
for i in range(test_size):
input_host[i] = Scalar[dtype](i + 1)
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- var input_tensor = LayoutTensor[mut=False, dtype, layout](
- input_buf.unsafe_ptr()
+ var out_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- comptime kernel = no_conflict_kernel[layout]
+ comptime kernel = no_conflict_kernel
ctx.enqueue_function[kernel, kernel](
out_tensor,
input_tensor,
@@ -147,7 +141,8 @@ def benchmark_two_way_conflict[test_size: Int](mut b: Bencher) raises:
@parameter
@always_inline
def kernel_workflow(ctx: DeviceContext) raises:
- comptime layout = Layout.row_major(test_size)
+ comptime layout = row_major[test_size]()
+ comptime LayoutType = type_of(layout)
var out = ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var input_buf = ctx.enqueue_create_buffer[dtype](test_size)
@@ -157,12 +152,12 @@ def benchmark_two_way_conflict[test_size: Int](mut b: Bencher) raises:
for i in range(test_size):
input_host[i] = Scalar[dtype](i + 1)
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- var input_tensor = LayoutTensor[mut=False, dtype, layout](
- input_buf.unsafe_ptr()
+ var out_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- comptime kernel = two_way_conflict_kernel[layout]
+ comptime kernel = two_way_conflict_kernel
ctx.enqueue_function[kernel, kernel](
out_tensor,
input_tensor,
@@ -189,12 +184,12 @@ def test_no_conflict() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- var input_tensor = LayoutTensor[mut=False, dtype, layout](
- input_buf.unsafe_ptr()
+ var out_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- comptime kernel = no_conflict_kernel[layout]
+ comptime kernel = no_conflict_kernel
ctx.enqueue_function[kernel, kernel](
out_tensor,
input_tensor,
@@ -223,12 +218,12 @@ def test_two_way_conflict() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- var input_tensor = LayoutTensor[mut=False, dtype, layout](
- input_buf.unsafe_ptr()
+ var out_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- comptime kernel = two_way_conflict_kernel[layout]
+ comptime kernel = two_way_conflict_kernel
ctx.enqueue_function[kernel, kernel](
out_tensor,
input_tensor,
diff --git a/problems/p33/p33.mojo b/problems/p33/p33.mojo
index 9ffa00a4..1e18af61 100644
--- a/problems/p33/p33.mojo
+++ b/problems/p33/p33.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier, WARP_SIZE
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace, async_copy_wait_all
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from layout.tensor_core import TensorCore
from layout.layout_tensor import copy_dram_to_sram_async
from std.utils import Index
@@ -10,7 +12,8 @@ from std.testing import assert_equal, assert_almost_equal
comptime dtype = DType.float32
comptime SIZE = 1024
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
comptime BLOCK_DIM_COUNT = 2
comptime TILE_SIZE = 32
@@ -23,11 +26,11 @@ comptime THREADS_PER_BLOCK_TILED = (TILE_SIZE, TILE_SIZE)
# ANCHOR: matmul_idiomatic_tiled_solution
def matmul_idiomatic_tiled[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
# Use block_dim to get actual tile size dynamically
var tile_size_x = block_dim.x
@@ -40,23 +43,17 @@ def matmul_idiomatic_tiled[
# Get the tile of the output matrix that this thread block is responsible for
var out_tile = output.tile[TILE_SIZE, TILE_SIZE](block_idx.y, block_idx.x)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(TILE_SIZE, TILE_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(TILE_SIZE, TILE_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading
- comptime load_b_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TILE_SIZE, TILE_SIZE]())
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TILE_SIZE, TILE_SIZE]())
+
+ var acc: output.ElementType = 0
+
+ comptime load_a_layout = row_major[1, TILE_SIZE]() # Coalesced loading
+ comptime load_b_layout = row_major[1, TILE_SIZE]() # Coalesced loading
# Note: Both matrices stored in same orientation for correct matrix multiplication
# Transposed loading would be useful if B were pre-transposed in global memory
@@ -121,9 +118,6 @@ comptime BLOCKS_PER_GRID_TENSOR_CORE = (
def tensor_core_matrix_multiplication[
dtype: DType,
- layout_a: Layout,
- layout_b: Layout,
- layout_c: Layout,
BM: Int,
BN: Int,
BK: Int,
@@ -133,13 +127,13 @@ def tensor_core_matrix_multiplication[
MMA_N: Int,
MMA_K: Int,
](
- A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
- B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
- C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
+ A: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ B: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ C: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
- comptime M = C.shape[0]()
- comptime N = C.shape[1]()
- comptime K = A.shape[1]()
+ comptime M = C.dim[0]()
+ comptime N = C.dim[1]()
+ comptime K = A.dim[1]()
var warp_id = thread_idx.x // WARP_SIZE
var warps_in_n = BN // WN
@@ -155,26 +149,17 @@ def tensor_core_matrix_multiplication[
var mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()
# Shared SRAM tiles (no padding to stay under shared memory limit)
- var A_sram_tile = LayoutTensor[
- A.dtype,
- Layout.row_major(BM, BK),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var B_sram_tile = LayoutTensor[
- B.dtype,
- Layout.row_major(BK, BN),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var A_sram_tile = stack_allocation[
+ dtype=A.dtype, address_space=AddressSpace.SHARED
+ ](row_major[BM, BK]())
+ var B_sram_tile = stack_allocation[
+ dtype=B.dtype, address_space=AddressSpace.SHARED
+ ](row_major[BK, BN]())
# One per-warp accumulator tile of shape [WM, WN]
- var C_warp_accum = LayoutTensor[
- C.dtype,
- Layout.row_major(WM, WN),
- MutAnyOrigin,
- address_space=AddressSpace.GENERIC,
- ].stack_allocation()
+ var C_warp_accum = stack_allocation[
+ dtype=C.dtype, address_space=AddressSpace.GENERIC
+ ](row_major[WM, WN]())
# Zero initialize accumulator (only for active warps)
if warp_is_active:
@@ -190,12 +175,12 @@ def tensor_core_matrix_multiplication[
var B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)
copy_dram_to_sram_async[
- thread_layout=Layout.row_major(4, 8),
+ thread_layout=row_major[4, 8](),
num_threads=256,
block_dim_count=BLOCK_DIM_COUNT,
](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
copy_dram_to_sram_async[
- thread_layout=Layout.row_major(4, 8),
+ thread_layout=row_major[4, 8](),
num_threads=256,
block_dim_count=BLOCK_DIM_COUNT,
](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())
@@ -274,19 +259,18 @@ def main() raises:
inp1_host[i * SIZE + k] * inp2_host[k * SIZE + j]
)
# Create layout tensors
- var out_tensor_core_layout = LayoutTensor[dtype, layout](
- out_tensor_core.unsafe_ptr()
+ var out_tensor_core_layout = TileTensor(out_tensor_core, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ inp1, layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ inp2, layout
)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2)
if mode == "--tensor-core":
print("\n=== Running ACTUAL Tensor Core Matrix Multiplication ===")
comptime kernel = tensor_core_matrix_multiplication[
dtype,
- layout,
- layout,
- layout,
BM,
BN,
BK,
@@ -313,12 +297,10 @@ def main() raises:
# Create separate buffer for tiled result
out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_tiled.enqueue_fill(0)
- out_tiled_layout = LayoutTensor[dtype, layout](
- out_tiled.unsafe_ptr()
- )
+ out_tiled_layout = TileTensor(out_tiled, layout)
# Run idiomatic tiled version with proper 2D block configuration
- comptime kernel = matmul_idiomatic_tiled[layout, SIZE]
+ comptime kernel = matmul_idiomatic_tiled[SIZE]
ctx.enqueue_function[kernel, kernel](
out_tiled_layout,
a_tensor,
@@ -341,9 +323,6 @@ def main() raises:
print("\n--- Test 1: Tensor Core vs CPU Reference ---")
comptime kernel = tensor_core_matrix_multiplication[
dtype,
- layout,
- layout,
- layout,
BM,
BN,
BK,
@@ -420,11 +399,9 @@ def main() raises:
print("\n--- Test 2: Idiomatic Tiled vs CPU Reference ---")
out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_tiled.enqueue_fill(0)
- out_tiled_layout = LayoutTensor[dtype, layout](
- out_tiled.unsafe_ptr()
- )
+ out_tiled_layout = TileTensor(out_tiled, layout)
- comptime kernel2 = matmul_idiomatic_tiled[layout, SIZE]
+ comptime kernel2 = matmul_idiomatic_tiled[SIZE]
ctx.enqueue_function[kernel2, kernel2](
out_tiled_layout,
a_tensor,
diff --git a/problems/p34/p34.mojo b/problems/p34/p34.mojo
index 373f9c13..71c5c9af 100644
--- a/problems/p34/p34.mojo
+++ b/problems/p34/p34.mojo
@@ -8,7 +8,9 @@ from std.gpu.primitives.cluster import (
elect_one_sync,
)
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal, assert_true
@@ -16,16 +18,20 @@ comptime SIZE = 1024
comptime TPB = 256
comptime CLUSTER_SIZE = 4
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime InLayoutType = type_of(in_layout)
+comptime out_layout = row_major[1]()
+comptime OutLayoutType = type_of(out_layout)
+comptime cluster_layout = row_major[CLUSTER_SIZE]()
+comptime ClusterLayoutType = type_of(cluster_layout)
# ANCHOR: cluster_coordination_basics
def cluster_coordination_basics[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
size: Int,
):
"""Real cluster coordination using SM90+ cluster APIs."""
@@ -36,12 +42,9 @@ def cluster_coordination_basics[
var my_block_rank = Int(block_rank_in_cluster())
var block_id = block_idx.x
- var shared_data = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_data = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
# FIX: Use block_idx.x for data distribution instead of cluster rank
# Each block should process different portions of the data
@@ -77,13 +80,11 @@ def cluster_coordination_basics[
# ANCHOR: cluster_collective_operations
def cluster_collective_operations[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- temp_storage: LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ],
+ output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
+ temp_storage: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
size: Int,
):
"""Cluster-wide collective operations using real cluster APIs."""
@@ -98,10 +99,10 @@ def cluster_collective_operations[
# ANCHOR: advanced_cluster_patterns
def advanced_cluster_patterns[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
size: Int,
):
"""Advanced cluster programming using cluster masks and relaxed synchronization.
@@ -135,16 +136,12 @@ def main() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i % 10) * 0.1
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
- )
- output_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](output_buf)
+ input_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](input_buf, in_layout)
+ output_tensor = TileTensor(output_buf, cluster_layout)
- comptime kernel = cluster_coordination_basics[
- in_layout, Layout.row_major(CLUSTER_SIZE), TPB
- ]
+ comptime kernel = cluster_coordination_basics[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -199,19 +196,13 @@ def main() raises:
print("Expected sum:", expected_sum)
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- output_buf
- )
- var temp_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](temp_buf)
+ input_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](input_buf, in_layout)
+ var output_tensor = TileTensor(output_buf, out_layout)
+ var temp_tensor = TileTensor(temp_buf, cluster_layout)
- comptime kernel = cluster_collective_operations[
- in_layout, out_layout, TPB
- ]
+ comptime kernel = cluster_collective_operations[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -251,16 +242,12 @@ def main() raises:
Scalar[dtype](i % 50) * 0.02
) # Pattern for testing
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
- )
- output_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](output_buf)
+ input_tensor = TileTensor[
+ mut=False, dtype, InLayoutType, ImmutAnyOrigin
+ ](input_buf, in_layout)
+ output_tensor = TileTensor(output_buf, cluster_layout)
- comptime kernel = advanced_cluster_patterns[
- in_layout, Layout.row_major(CLUSTER_SIZE), TPB
- ]
+ comptime kernel = advanced_cluster_patterns[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/solutions/p04/p04_layout_tensor.mojo b/solutions/p04/p04_tile_tensor.mojo
similarity index 70%
rename from solutions/p04/p04_layout_tensor.mojo
rename to solutions/p04/p04_tile_tensor.mojo
index 394b7a26..c47d4b94 100644
--- a/solutions/p04/p04_layout_tensor.mojo
+++ b/solutions/p04/p04_tile_tensor.mojo
@@ -1,19 +1,21 @@
from std.gpu import thread_idx
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
-# ANCHOR: add_10_2d_layout_tensor_solution
+# ANCHOR: add_10_2d_tile_tensor_solution
def add_10_2d(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
size: Int,
):
var row = thread_idx.y
@@ -22,17 +24,15 @@ def add_10_2d(
output[row, col] = a[row, col] + 10.0
-# ANCHOR_END: add_10_2d_layout_tensor_solution
+# ANCHOR_END: add_10_2d_tile_tensor_solution
def main() raises:
with DeviceContext() as ctx:
var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- out_buf
- ).reshape[layout]()
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
+ var out_tensor = TileTensor(out_buf, layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
expected.enqueue_fill(0)
@@ -44,9 +44,7 @@ def main() raises:
a_host[i] = Scalar[dtype](i)
expected[i] = a_host[i] + 10
- var a_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a).reshape[
- layout
- ]()
+ var a_tensor = TileTensor(a, layout)
ctx.enqueue_function[add_10_2d, add_10_2d](
out_tensor,
diff --git a/solutions/p05/p05.mojo b/solutions/p05/p05.mojo
index eeb61d5d..224a57e5 100644
--- a/solutions/p05/p05.mojo
+++ b/solutions/p05/p05.mojo
@@ -1,25 +1,32 @@
-from std.memory import UnsafePointer
from std.gpu import thread_idx
from std.gpu.host import DeviceContext
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
+comptime out_layout = row_major[SIZE, SIZE]()
+comptime a_layout = row_major[1, SIZE]()
+comptime b_layout = row_major[SIZE, 1]()
+comptime OutLayout = type_of(out_layout)
+comptime ALayout = type_of(a_layout)
+comptime BLayout = type_of(b_layout)
# ANCHOR: broadcast_add_solution
def broadcast_add(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin],
size: Int,
):
var row = thread_idx.y
var col = thread_idx.x
if row < size and col < size:
- output[row * size + col] = a[col] + b[row]
+ output[row, col] = a[0, col] + b[row, 0]
# ANCHOR_END: broadcast_add_solution
@@ -27,10 +34,15 @@ def broadcast_add(
def main() raises:
with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out.enqueue_fill(0)
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected.enqueue_fill(0)
+ var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
+ out_buf.enqueue_fill(0)
+ var out_tensor = TileTensor(out_buf, out_layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
+
+ var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
+ expected_buf.enqueue_fill(0)
+ var expected_tensor = TileTensor(expected_buf, out_layout)
+
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(0)
var b = ctx.enqueue_create_buffer[dtype](SIZE)
@@ -40,14 +52,17 @@ def main() raises:
a_host[i] = Scalar[dtype](i + 1)
b_host[i] = Scalar[dtype](i * 10)
- for y in range(SIZE):
- for x in range(SIZE):
- expected[y * SIZE + x] = a_host[x] + b_host[y]
+ for i in range(SIZE):
+ for j in range(SIZE):
+ expected_tensor[i, j] = a_host[j] + b_host[i]
+
+ var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout)
+ var b_tensor = TileTensor[mut=False, dtype, BLayout](b, b_layout)
ctx.enqueue_function[broadcast_add, broadcast_add](
- out,
- a,
- b,
+ out_tensor,
+ a_tensor,
+ b_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -55,10 +70,12 @@ def main() raises:
ctx.synchronize()
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- for y in range(SIZE):
- for x in range(SIZE):
- assert_equal(out_host[y * SIZE + x], expected[y * SIZE + x])
+ with out_buf.map_to_host() as out_buf_host:
+ print("out:", out_buf_host)
+ print("expected:", expected_buf)
+ for i in range(SIZE):
+ for j in range(SIZE):
+ assert_equal(
+ out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
+ )
print("Puzzle 05 complete โ
")
diff --git a/solutions/p05/p05_layout_tensor.mojo b/solutions/p05/p05_layout_tensor.mojo
deleted file mode 100644
index 3573c21d..00000000
--- a/solutions/p05/p05_layout_tensor.mojo
+++ /dev/null
@@ -1,84 +0,0 @@
-from std.gpu import thread_idx
-from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-comptime SIZE = 2
-comptime BLOCKS_PER_GRID = 1
-comptime THREADS_PER_BLOCK = (3, 3)
-comptime dtype = DType.float32
-comptime out_layout = Layout.row_major(SIZE, SIZE)
-comptime a_layout = Layout.row_major(1, SIZE)
-comptime b_layout = Layout.row_major(SIZE, 1)
-
-
-# ANCHOR: broadcast_add_layout_tensor_solution
-def broadcast_add[
- out_layout: Layout,
- a_layout: Layout,
- b_layout: Layout,
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
- size: Int,
-):
- var row = thread_idx.y
- var col = thread_idx.x
- if row < size and col < size:
- output[row, col] = a[0, col] + b[row, 0]
-
-
-# ANCHOR_END: broadcast_add_layout_tensor_solution
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf)
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
-
- var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected_buf.enqueue_fill(0)
- var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- expected_buf
- )
-
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
- var b = ctx.enqueue_create_buffer[dtype](SIZE)
- b.enqueue_fill(0)
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i + 1)
- b_host[i] = Scalar[dtype](i * 10)
-
- for i in range(SIZE):
- for j in range(SIZE):
- expected_tensor[i, j] = a_host[j] + b_host[i]
-
- var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, b_layout, ImmutAnyOrigin](b)
-
- comptime kernel = broadcast_add[out_layout, a_layout, b_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- b_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- ctx.synchronize()
-
- with out_buf.map_to_host() as out_buf_host:
- print("out:", out_buf_host)
- print("expected:", expected_buf)
- for i in range(SIZE):
- for j in range(SIZE):
- assert_equal(
- out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
- )
- print("Puzzle 05 complete โ
")
diff --git a/solutions/p07/p07.mojo b/solutions/p07/p07.mojo
index 35639053..9523bb37 100644
--- a/solutions/p07/p07.mojo
+++ b/solutions/p07/p07.mojo
@@ -1,24 +1,29 @@
-from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim
from std.gpu.host import DeviceContext
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_equal
comptime SIZE = 5
comptime BLOCKS_PER_GRID = (2, 2)
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
+comptime out_layout = row_major[SIZE, SIZE]()
+comptime a_layout = row_major[SIZE, SIZE]()
+comptime OutLayout = type_of(out_layout)
+comptime ALayout = type_of(a_layout)
# ANCHOR: add_10_blocks_2d_solution
def add_10_blocks_2d(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
size: Int,
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size:
- output[row * size + col] = a[row * size + col] + 10.0
+ output[row, col] = a[row, col] + 10.0
# ANCHOR_END: add_10_blocks_2d_solution
@@ -26,10 +31,13 @@ def add_10_blocks_2d(
def main() raises:
with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out.enqueue_fill(0)
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected.enqueue_fill(1)
+ var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
+ out_buf.enqueue_fill(0)
+ var out_tensor = TileTensor(out_buf, out_layout)
+
+ var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
+ expected_buf.enqueue_fill(1)
+
var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
a.enqueue_fill(1)
@@ -38,11 +46,13 @@ def main() raises:
for i in range(SIZE):
var k = j * SIZE + i
a_host[k] = Scalar[dtype](k)
- expected[k] = Scalar[dtype](k + 10)
+ expected_buf[k] = Scalar[dtype](k + 10)
+
+ var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout)
ctx.enqueue_function[add_10_blocks_2d, add_10_blocks_2d](
- out,
- a,
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -50,10 +60,17 @@ def main() raises:
ctx.synchronize()
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
+ var expected_tensor = TileTensor(expected_buf, out_layout)
+
+ with out_buf.map_to_host() as out_buf_host:
+ print(
+ "out:",
+ TileTensor(out_buf_host, out_layout),
+ )
+ print("expected:", expected_tensor)
for i in range(SIZE):
for j in range(SIZE):
- assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j])
+ assert_equal(
+ out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
+ )
print("Puzzle 07 complete โ
")
diff --git a/solutions/p07/p07_layout_tensor.mojo b/solutions/p07/p07_layout_tensor.mojo
deleted file mode 100644
index 2f1b397f..00000000
--- a/solutions/p07/p07_layout_tensor.mojo
+++ /dev/null
@@ -1,79 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim
-from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-comptime SIZE = 5
-comptime BLOCKS_PER_GRID = (2, 2)
-comptime THREADS_PER_BLOCK = (3, 3)
-comptime dtype = DType.float32
-comptime out_layout = Layout.row_major(SIZE, SIZE)
-comptime a_layout = Layout.row_major(SIZE, SIZE)
-
-
-# ANCHOR: add_10_blocks_2d_layout_tensor_solution
-def add_10_blocks_2d[
- out_layout: Layout,
- a_layout: Layout,
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
- size: Int,
-):
- var row = block_dim.y * block_idx.y + thread_idx.y
- var col = block_dim.x * block_idx.x + thread_idx.x
- if row < size and col < size:
- output[row, col] = a[row, col] + 10.0
-
-
-# ANCHOR_END: add_10_blocks_2d_layout_tensor_solution
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf)
-
- var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
- expected_buf.enqueue_fill(1)
-
- var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
- a.enqueue_fill(1)
-
- with a.map_to_host() as a_host:
- for j in range(SIZE):
- for i in range(SIZE):
- var k = j * SIZE + i
- a_host[k] = Scalar[dtype](k)
- expected_buf[k] = Scalar[dtype](k + 10)
-
- var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a)
-
- comptime kernel = add_10_blocks_2d[out_layout, a_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- ctx.synchronize()
-
- var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- expected_buf
- )
-
- with out_buf.map_to_host() as out_buf_host:
- print(
- "out:",
- LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf_host),
- )
- print("expected:", expected_tensor)
- for i in range(SIZE):
- for j in range(SIZE):
- assert_equal(
- out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j]
- )
- print("Puzzle 07 complete โ
")
diff --git a/solutions/p08/p08.mojo b/solutions/p08/p08.mojo
index 3349960c..035d744a 100644
--- a/solutions/p08/p08.mojo
+++ b/solutions/p08/p08.mojo
@@ -1,7 +1,9 @@
-from std.memory import UnsafePointer, stack_allocation
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
comptime TPB = 4
@@ -9,33 +11,33 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: add_10_shared_solution
-def add_10_shared(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+def add_10_shared_tile_tensor(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
+ # Allocate shared memory using stack_allocation
var shared = stack_allocation[
- TPB,
- Scalar[dtype],
- address_space=AddressSpace.SHARED,
- ]()
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- # Load local data into shared memory
+
if global_i < size:
shared[local_i] = a[global_i]
- # Wait for all threads to complete (works within a thread block).
# Note: barrier is not strictly needed here since each thread only accesses
# its own shared memory location. However, it's included to teach proper
# shared memory synchronization patterns for more complex scenarios where
# threads need to coordinate access to shared data.
barrier()
- # process using shared memory
if global_i < size:
output[global_i] = shared[local_i] + 10
@@ -49,9 +51,15 @@ def main() raises:
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(1)
- ctx.enqueue_function[add_10_shared, add_10_shared](
- out,
- a,
+
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+
+ ctx.enqueue_function[
+ add_10_shared_tile_tensor, add_10_shared_tile_tensor
+ ](
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -59,7 +67,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
expected.enqueue_fill(11)
-
ctx.synchronize()
with out.map_to_host() as out_host:
diff --git a/solutions/p08/p08_layout_tensor.mojo b/solutions/p08/p08_layout_tensor.mojo
deleted file mode 100644
index 9864fe8e..00000000
--- a/solutions/p08/p08_layout_tensor.mojo
+++ /dev/null
@@ -1,78 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-comptime TPB = 4
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (2, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-
-
-# ANCHOR: add_10_shared_layout_tensor_solution
-def add_10_shared_layout_tensor[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- size: Int,
-):
- # Allocate shared memory using tensor builder
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var global_i = block_dim.x * block_idx.x + thread_idx.x
- var local_i = thread_idx.x
-
- if global_i < size:
- shared[local_i] = a[global_i]
-
- # Note: barrier is not strictly needed here since each thread only accesses
- # its own shared memory location. However, it's included to teach proper
- # shared memory synchronization patterns for more complex scenarios where
- # threads need to coordinate access to shared data.
- barrier()
-
- if global_i < size:
- output[global_i] = shared[local_i] + 10
-
-
-# ANCHOR_END: add_10_shared_layout_tensor_solution
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(1)
-
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
-
- comptime kernel = add_10_shared_layout_tensor[layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
- expected.enqueue_fill(11)
- ctx.synchronize()
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- for i in range(SIZE):
- assert_equal(out_host[i], expected[i])
- print("Puzzle 08 complete โ
")
diff --git a/solutions/p10/p10.mojo b/solutions/p10/p10.mojo
index fa7bce95..7c25c4ed 100644
--- a/solutions/p10/p10.mojo
+++ b/solutions/p10/p10.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
from std.sys import argv
from std.os.atomic import Atomic
@@ -12,24 +14,22 @@ comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
def shared_memory_race(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
"""Fixed: sequential access with barriers eliminates race conditions."""
var row = thread_idx.y
var col = thread_idx.x
- var shared_sum = LayoutTensor[
- dtype,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_sum = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[1]())
# Only thread 0 does all the accumulation work to prevent races
if row == 0 and col == 0:
@@ -53,8 +53,8 @@ def shared_memory_race(
# ANCHOR: add_10_2d_solution
def add_10_2d(
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var row = thread_idx.y
@@ -79,10 +79,8 @@ def main() raises:
with DeviceContext() as ctx:
var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_buf.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- out_buf
- ).reshape[layout]()
- print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]())
+ var out_tensor = TileTensor(out_buf, layout)
+ print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]())
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE)
expected.enqueue_fill(0)
@@ -92,9 +90,7 @@ def main() raises:
for i in range(SIZE * SIZE):
a_host[i] = Scalar[dtype](i)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a).reshape[
- layout
- ]()
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
if flag == "--memory-bug":
print("Running memory bug example (bounds checking issue)...")
diff --git a/solutions/p11/p11.mojo b/solutions/p11/p11.mojo
index de3e243d..89d16c70 100644
--- a/solutions/p11/p11.mojo
+++ b/solutions/p11/p11.mojo
@@ -1,7 +1,9 @@
-from std.memory import UnsafePointer, stack_allocation
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
comptime TPB = 8
@@ -9,30 +11,37 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: pooling_solution
def pooling(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
+ # Allocate shared memory using stack_allocation
var shared = stack_allocation[
- TPB,
- Scalar[dtype],
- address_space=AddressSpace.SHARED,
- ]()
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
+
+ # Load data into shared memory
if global_i < size:
shared[local_i] = a[global_i]
+ # Synchronize threads within block
barrier()
+ # Handle first two special cases
if global_i == 0:
output[0] = shared[0]
elif global_i == 1:
output[1] = shared[0] + shared[1]
+ # Handle general case
elif 1 < global_i < size:
output[global_i] = (
shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
@@ -48,13 +57,17 @@ def main() raises:
out.enqueue_fill(0)
var a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(0)
+
with a.map_to_host() as a_host:
for i in range(SIZE):
a_host[i] = Scalar[dtype](i)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+
ctx.enqueue_function[pooling, pooling](
- out,
- a,
+ out_tensor,
+ a_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -62,7 +75,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
expected.enqueue_fill(0)
-
ctx.synchronize()
with a.map_to_host() as a_host:
@@ -71,7 +83,6 @@ def main() raises:
var s = Scalar[dtype](0)
for j in range(max(i - 2, 0), i + 1):
s += ptr[j]
-
expected[i] = s
with out.map_to_host() as out_host:
diff --git a/solutions/p11/p11_layout_tensor.mojo b/solutions/p11/p11_layout_tensor.mojo
deleted file mode 100644
index 7cfa112c..00000000
--- a/solutions/p11/p11_layout_tensor.mojo
+++ /dev/null
@@ -1,95 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-comptime TPB = 8
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (1, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-
-
-# ANCHOR: pooling_layout_tensor_solution
-def pooling[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- size: Int,
-):
- # Allocate shared memory using tensor builder
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var global_i = block_dim.x * block_idx.x + thread_idx.x
- var local_i = thread_idx.x
-
- # Load data into shared memory
- if global_i < size:
- shared[local_i] = a[global_i]
-
- # Synchronize threads within block
- barrier()
-
- # Handle first two special cases
- if global_i == 0:
- output[0] = shared[0]
- elif global_i == 1:
- output[1] = shared[0] + shared[1]
- # Handle general case
- elif 1 < global_i < size:
- output[global_i] = (
- shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
- )
-
-
-# ANCHOR_END: pooling_layout_tensor_solution
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](SIZE)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
-
- with a.map_to_host() as a_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i)
-
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
-
- ctx.enqueue_function[pooling[layout], pooling[layout]](
- out_tensor,
- a_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
- expected.enqueue_fill(0)
- ctx.synchronize()
-
- with a.map_to_host() as a_host:
- var ptr = a_host
- for i in range(SIZE):
- var s = Scalar[dtype](0)
- for j in range(max(i - 2, 0), i + 1):
- s += ptr[j]
- expected[i] = s
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- for i in range(SIZE):
- assert_equal(out_host[i], expected[i])
- print("Puzzle 11 complete โ
")
diff --git a/solutions/p12/p12.mojo b/solutions/p12/p12.mojo
index 99393982..b8b962dd 100644
--- a/solutions/p12/p12.mojo
+++ b/solutions/p12/p12.mojo
@@ -1,7 +1,9 @@
-from std.memory import UnsafePointer, stack_allocation
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
comptime TPB = 8
@@ -9,37 +11,33 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
+comptime layout = row_major[SIZE]()
+comptime out_layout = row_major[1]()
+comptime LayoutType = type_of(layout)
+comptime OutLayout = type_of(out_layout)
# ANCHOR: dot_product_solution
def dot_product(
- output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
- b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var shared = stack_allocation[
- TPB,
- Scalar[dtype],
- address_space=AddressSpace.SHARED,
- ]()
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
+
+ # Compute element-wise multiplication into shared memory
if global_i < size:
shared[local_i] = a[global_i] * b[global_i]
+ # Synchronize threads within block
barrier()
- # The following causes race condition: all threads writing to the same location
- # out[0] += shared[local_i]
-
- # Instead can do parallel reduction in shared memory as opposed to
- # global memory which has no guarantee on synchronization.
- # Loops using global memory can cause thread divergence because
- # fundamentally GPUs execute threads in warps (groups of 32 threads typically)
- # and warps can be scheduled independently.
- # However, shared memory does not have such issues as long as we use `barrier()`
- # correctly when we're in the same thread block.
+ # Parallel reduction in shared memory
var stride = TPB // 2
while stride > 0:
if local_i < stride:
@@ -48,7 +46,7 @@ def dot_product(
barrier()
stride //= 2
- # only thread 0 writes the final result
+ # Only thread 0 writes the final result
if local_i == 0:
output[0] = shared[0]
@@ -64,15 +62,20 @@ def main() raises:
a.enqueue_fill(0)
var b = ctx.enqueue_create_buffer[dtype](SIZE)
b.enqueue_fill(0)
+
with a.map_to_host() as a_host, b.map_to_host() as b_host:
for i in range(SIZE):
a_host[i] = Scalar[dtype](i)
b_host[i] = Scalar[dtype](i)
+ var out_tensor = TileTensor(out, out_layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout)
+
ctx.enqueue_function[dot_product, dot_product](
- out,
- a,
- b,
+ out_tensor,
+ a_tensor,
+ b_tensor,
SIZE,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
@@ -80,7 +83,6 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](1)
expected.enqueue_fill(0)
-
ctx.synchronize()
with a.map_to_host() as a_host, b.map_to_host() as b_host:
diff --git a/solutions/p12/p12_layout_tensor.mojo b/solutions/p12/p12_layout_tensor.mojo
deleted file mode 100644
index a5359bb5..00000000
--- a/solutions/p12/p12_layout_tensor.mojo
+++ /dev/null
@@ -1,98 +0,0 @@
-from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.host import DeviceContext
-from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
-from std.testing import assert_equal
-
-comptime TPB = 8
-comptime SIZE = 8
-comptime BLOCKS_PER_GRID = (1, 1)
-comptime THREADS_PER_BLOCK = (TPB, 1)
-comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
-
-
-# ANCHOR: dot_product_layout_tensor_solution
-def dot_product[
- in_layout: Layout, out_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- size: Int,
-):
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var global_i = block_dim.x * block_idx.x + thread_idx.x
- var local_i = thread_idx.x
-
- # Compute element-wise multiplication into shared memory
- if global_i < size:
- shared[local_i] = a[global_i] * b[global_i]
-
- # Synchronize threads within block
- barrier()
-
- # Parallel reduction in shared memory
- var stride = TPB // 2
- while stride > 0:
- if local_i < stride:
- shared[local_i] += shared[local_i + stride]
-
- barrier()
- stride //= 2
-
- # Only thread 0 writes the final result
- if local_i == 0:
- output[0] = shared[0]
-
-
-# ANCHOR_END: dot_product_layout_tensor_solution
-
-
-def main() raises:
- with DeviceContext() as ctx:
- var out = ctx.enqueue_create_buffer[dtype](1)
- out.enqueue_fill(0)
- var a = ctx.enqueue_create_buffer[dtype](SIZE)
- a.enqueue_fill(0)
- var b = ctx.enqueue_create_buffer[dtype](SIZE)
- b.enqueue_fill(0)
-
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- a_host[i] = Scalar[dtype](i)
- b_host[i] = Scalar[dtype](i)
-
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b)
-
- comptime kernel = dot_product[layout, out_layout]
- ctx.enqueue_function[kernel, kernel](
- out_tensor,
- a_tensor,
- b_tensor,
- SIZE,
- grid_dim=BLOCKS_PER_GRID,
- block_dim=THREADS_PER_BLOCK,
- )
-
- var expected = ctx.enqueue_create_host_buffer[dtype](1)
- expected.enqueue_fill(0)
- ctx.synchronize()
-
- with a.map_to_host() as a_host, b.map_to_host() as b_host:
- for i in range(SIZE):
- expected[0] += a_host[i] * b_host[i]
-
- with out.map_to_host() as out_host:
- print("out:", out_host)
- print("expected:", expected)
- assert_equal(out_host[0], expected[0])
- print("Puzzle 12 complete โ
")
diff --git a/solutions/p13/p13.mojo b/solutions/p13/p13.mojo
index 7307a6f0..aacf2514 100644
--- a/solutions/p13/p13.mojo
+++ b/solutions/p13/p13.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
@@ -11,33 +13,28 @@ comptime CONV = 3
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(SIZE)
-comptime conv_layout = Layout.row_major(CONV)
+comptime in_layout = row_major[SIZE]()
+comptime out_layout = row_major[SIZE]()
+comptime conv_layout = row_major[CONV]()
+comptime InLayout = type_of(in_layout)
+comptime OutLayout = type_of(out_layout)
+comptime ConvLayout = type_of(conv_layout)
# ANCHOR: conv_1d_simple_solution
-def conv_1d_simple[
- in_layout: Layout, out_layout: Layout, conv_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
+def conv_1d_simple(
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(CONV),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[SIZE]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[CONV]())
if global_i < SIZE:
shared_a[local_i] = a[global_i]
@@ -58,8 +55,8 @@ def conv_1d_simple[
# Safe and correct:
if global_i < SIZE:
# Note: using `var` allows us to include the type in the type inference
- # `out.element_type` is available in LayoutTensor
- var local_sum: output.element_type = 0
+ # `out.ElementType` is available in TileTensor
+ var local_sum: output.ElementType = 0
# Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
# See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
@@ -77,34 +74,29 @@ comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
-comptime in_2_layout = Layout.row_major(SIZE_2)
-comptime out_2_layout = Layout.row_major(SIZE_2)
-comptime conv_2_layout = Layout.row_major(CONV_2)
+comptime in_2_layout = row_major[SIZE_2]()
+comptime out_2_layout = row_major[SIZE_2]()
+comptime conv_2_layout = row_major[CONV_2]()
+comptime In2Layout = type_of(in_2_layout)
+comptime Out2Layout = type_of(out_2_layout)
+comptime Conv2Layout = type_of(conv_2_layout)
# ANCHOR: conv_1d_block_boundary_solution
-def conv_1d_block_boundary[
- in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
+def conv_1d_block_boundary(
+ output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
# first: need to account for padding
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(TPB + CONV_2 - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(CONV_2),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB + CONV_2 - 1]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[CONV_2]())
if global_i < SIZE_2:
shared_a[local_i] = a[global_i]
else:
@@ -127,7 +119,7 @@ def conv_1d_block_boundary[
barrier()
if global_i < SIZE_2:
- var local_sum: output.element_type = 0
+ var local_sum: output.ElementType = 0
comptime for j in range(CONV_2):
if global_i + j < SIZE_2:
@@ -158,11 +150,12 @@ def main() raises:
b_host[i] = Scalar[dtype](i)
if argv()[1] == "--simple":
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, conv_layout, ImmutAnyOrigin](b)
- comptime kernel = conv_1d_simple[in_layout, out_layout, conv_layout]
- ctx.enqueue_function[kernel, kernel](
+ var out_tensor = TileTensor(out, out_layout)
+ var a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout)
+ var b_tensor = TileTensor[mut=False, dtype, ConvLayout](
+ b, conv_layout
+ )
+ ctx.enqueue_function[conv_1d_simple, conv_1d_simple](
out_tensor,
a_tensor,
b_tensor,
@@ -170,15 +163,16 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
elif argv()[1] == "--block-boundary":
- var out_tensor = LayoutTensor[dtype, out_2_layout, MutAnyOrigin](
- out
+ var out_tensor = TileTensor(out, out_2_layout)
+ var a_tensor = TileTensor[mut=False, dtype, In2Layout](
+ a, in_2_layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, Conv2Layout](
+ b, conv_2_layout
)
- var a_tensor = LayoutTensor[dtype, in_2_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, conv_2_layout, ImmutAnyOrigin](b)
- comptime kernel = conv_1d_block_boundary[
- in_2_layout, out_2_layout, conv_2_layout, dtype
- ]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[
+ conv_1d_block_boundary, conv_1d_block_boundary
+ ](
out_tensor,
a_tensor,
b_tensor,
diff --git a/solutions/p14/p14.mojo b/solutions/p14/p14.mojo
index 55d6a800..794b46ee 100644
--- a/solutions/p14/p14.mojo
+++ b/solutions/p14/p14.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.math import log2
from std.testing import assert_equal
@@ -11,25 +13,21 @@ comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: prefix_sum_simple_solution
-def prefix_sum_simple[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def prefix_sum_simple(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
if global_i < size:
shared[local_i] = a[global_i]
@@ -37,7 +35,7 @@ def prefix_sum_simple[
var offset = 1
for i in range(Int(log2(Scalar[dtype](TPB)))):
- var current_val: output.element_type = 0
+ var current_val: output.ElementType = 0
if local_i >= offset and local_i < size:
current_val = shared[local_i - offset] # read
@@ -59,28 +57,25 @@ comptime SIZE_2 = 15
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime EXTENDED_SIZE = SIZE_2 + 2 # up to 2 blocks
-comptime layout_2 = Layout.row_major(SIZE_2)
-comptime extended_layout = Layout.row_major(EXTENDED_SIZE)
+comptime layout_2 = row_major[SIZE_2]()
+comptime extended_layout = row_major[EXTENDED_SIZE]()
+comptime Layout2Type = type_of(layout_2)
+comptime ExtendedLayout = type_of(extended_layout)
# ANCHOR: prefix_sum_complete_solution
# Kernel 1: Compute local prefix sums and store block sums in out
-def prefix_sum_local_phase[
- out_layout: Layout, in_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+def prefix_sum_local_phase(
+ output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
# Load data into shared memory
# Example with SIZE_2=15, TPB=8, BLOCKS=2:
@@ -104,7 +99,7 @@ def prefix_sum_local_phase[
# Block 1 follows same pattern to get [8,17,27,38,50,63,77,???]
var offset = 1
for i in range(Int(log2(Scalar[dtype](TPB)))):
- var current_val: output.element_type = 0
+ var current_val: output.ElementType = 0
if local_i >= offset and local_i < TPB:
current_val = shared[local_i - offset] # read
@@ -132,9 +127,10 @@ def prefix_sum_local_phase[
# Kernel 2: Add block sums to their respective blocks
-def prefix_sum_block_sum_phase[
- layout: Layout
-](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: Int):
+def prefix_sum_block_sum_phase(
+ output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin],
+ size: Int,
+):
var global_i = block_dim.x * block_idx.x + thread_idx.x
# Second pass: add previous block's sum to each element
@@ -172,11 +168,10 @@ def main() raises:
a_host[i] = Scalar[dtype](i)
if use_simple:
- a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a)
- out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout)
+ out_tensor = TileTensor(out, layout)
- comptime kernel = prefix_sum_simple[layout]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[prefix_sum_simple, prefix_sum_simple](
out_tensor,
a_tensor,
size,
@@ -184,15 +179,16 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
else:
- var a_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](a)
- var out_tensor = LayoutTensor[dtype, extended_layout, MutAnyOrigin](
- out
+ var a_tensor = TileTensor[mut=False, dtype, Layout2Type](
+ a, layout_2
)
+ var out_tensor = TileTensor(out, extended_layout)
# ANCHOR: prefix_sum_complete_block_level_sync
# Phase 1: Local prefix sums
- comptime kernel = prefix_sum_local_phase[extended_layout, layout_2]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[
+ prefix_sum_local_phase, prefix_sum_local_phase
+ ](
out_tensor,
a_tensor,
size,
@@ -201,8 +197,9 @@ def main() raises:
)
# Phase 2: Add block sums
- comptime kernel2 = prefix_sum_block_sum_phase[extended_layout]
- ctx.enqueue_function[kernel2, kernel2](
+ ctx.enqueue_function[
+ prefix_sum_block_sum_phase, prefix_sum_block_sum_phase
+ ](
out_tensor,
size,
grid_dim=BLOCKS_PER_GRID_2,
diff --git a/solutions/p15/p15.mojo b/solutions/p15/p15.mojo
index 218c34b7..df06ed5a 100644
--- a/solutions/p15/p15.mojo
+++ b/solutions/p15/p15.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.testing import assert_equal
comptime TPB = 8
@@ -10,27 +12,24 @@ comptime SIZE = 6
comptime BLOCKS_PER_GRID = (1, BATCH)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(BATCH, SIZE)
-comptime out_layout = Layout.row_major(BATCH, 1)
+comptime in_layout = row_major[BATCH, SIZE]()
+comptime out_layout = row_major[BATCH, 1]()
+comptime InLayout = type_of(in_layout)
+comptime OutLayout = type_of(out_layout)
# ANCHOR: axis_sum_solution
-def axis_sum[
- in_layout: Layout, out_layout: Layout
-](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+def axis_sum(
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
size: Int,
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
var batch = block_idx.y
- var cache = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var cache = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
# Visualize:
# Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
@@ -52,7 +51,7 @@ def axis_sum[
var stride = TPB // 2
while stride > 0:
# Read phase: all threads read the values they need first to avoid race conditions
- var temp_val: output.element_type = 0
+ var temp_val: output.ElementType = 0
if local_i < stride:
temp_val = cache[local_i + stride]
@@ -84,11 +83,10 @@ def main() raises:
for col in range(SIZE):
inp_host[row * SIZE + col] = Scalar[dtype](row * SIZE + col)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](inp)
+ var out_tensor = TileTensor(out, out_layout)
+ var inp_tensor = TileTensor[mut=False, dtype, InLayout](inp, in_layout)
- comptime kernel = axis_sum[in_layout, out_layout]
- ctx.enqueue_function[kernel, kernel](
+ ctx.enqueue_function[axis_sum, axis_sum](
out_tensor,
inp_tensor,
SIZE,
diff --git a/solutions/p16/p16.mojo b/solutions/p16/p16.mojo
index 62b76e00..e800a5f1 100644
--- a/solutions/p16/p16.mojo
+++ b/solutions/p16/p16.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
@@ -10,22 +12,23 @@ comptime SIZE = 2
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, TPB)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: naive_matmul_solution
def naive_matmul[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size:
- var acc: output.element_type = 0
+ var acc: output.ElementType = 0
comptime for k in range(size):
acc += a[row, k] * b[k, col]
@@ -38,29 +41,23 @@ def naive_matmul[
# ANCHOR: single_block_matmul_solution
def single_block_matmul[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
var row = block_dim.y * block_idx.y + thread_idx.y
var col = block_dim.x * block_idx.x + thread_idx.x
var local_row = thread_idx.y
var local_col = thread_idx.x
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
if row < size and col < size:
a_shared[local_row, local_col] = a[row, col]
@@ -69,7 +66,7 @@ def single_block_matmul[
barrier()
if row < size and col < size:
- var acc: output.element_type = 0
+ var acc: output.ElementType = 0
comptime for k in range(size):
acc += a_shared[local_row, k] * b_shared[k, local_col]
@@ -83,36 +80,31 @@ def single_block_matmul[
comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3) # each block covers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
-comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
+comptime layout_tiled = row_major[SIZE_TILED, SIZE_TILED]()
+comptime LayoutTiledType = type_of(layout_tiled)
# ANCHOR: matmul_tiled_solution
def matmul_tiled[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
- a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
var local_row = thread_idx.y
var local_col = thread_idx.x
var tiled_row = block_idx.y * TPB + local_row
var tiled_col = block_idx.x * TPB + local_col
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var acc: output.element_type = 0
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
+
+ var acc: output.ElementType = 0
# Iterate over tiles to compute matrix product
comptime for tile in range((size + TPB - 1) // TPB):
@@ -147,17 +139,18 @@ def matmul_tiled[
# ANCHOR: matmul_idiomatic_tiled_solution
from std.gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async
+from layout import Layout as IntTupleLayout
comptime NUM_THREADS = TPB * TPB
comptime BLOCK_DIM_COUNT = 2
def matmul_idiomatic_tiled[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
- a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
var local_row = thread_idx.y
var local_col = thread_idx.x
@@ -166,23 +159,21 @@ def matmul_idiomatic_tiled[
# Get the tile of the output matrix that this thread block is responsible for
var out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB, TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
-
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(1, TPB) # Coalesced loading
- comptime load_b_layout = Layout.row_major(1, TPB) # Coalesced loading
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB, TPB]())
+
+ var acc: output.ElementType = 0
+
+ comptime load_a_layout = IntTupleLayout.row_major(
+ 1, TPB
+ ) # Coalesced loading
+ comptime load_b_layout = IntTupleLayout.row_major(
+ 1, TPB
+ ) # Coalesced loading
# Note: Both matrices stored in same orientation for correct matrix multiplication
# Transposed loading would be useful if B were pre-transposed in global memory
@@ -198,12 +189,12 @@ def matmul_idiomatic_tiled[
thread_layout=load_a_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
- ](a_shared, a_tile)
+ ](a_shared.to_layout_tensor(), a_tile.to_layout_tensor())
copy_dram_to_sram_async[
thread_layout=load_b_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
- ](b_shared, b_tile)
+ ](b_shared.to_layout_tensor(), b_tile.to_layout_tensor())
# Wait for all async copies to complete
async_copy_wait_all()
@@ -254,12 +245,12 @@ def main() raises:
inp1_host[i * size + k] * inp2_host[k * size + j]
)
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2)
+ var out_tensor = TileTensor(out, layout)
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType](inp1, layout)
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType](inp2, layout)
if argv()[1] == "--naive":
- comptime kernel = naive_matmul[layout, SIZE]
+ comptime kernel = naive_matmul[SIZE]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -268,7 +259,7 @@ def main() raises:
block_dim=THREADS_PER_BLOCK,
)
elif argv()[1] == "--single-block":
- comptime kernel = single_block_matmul[layout, SIZE]
+ comptime kernel = single_block_matmul[SIZE]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -278,17 +269,15 @@ def main() raises:
)
elif argv()[1] == "--tiled":
# Need to update the layout of the tensors to the tiled layout
- out_tensor_tiled = LayoutTensor[dtype, layout_tiled, MutAnyOrigin](
- out
- )
- a_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin](
- inp1
+ out_tensor_tiled = TileTensor(out, layout_tiled)
+ a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp1, layout_tiled
)
- b_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin](
- inp2
+ b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp2, layout_tiled
)
- comptime kernel = matmul_tiled[layout_tiled, SIZE_TILED]
+ comptime kernel = matmul_tiled[SIZE_TILED]
ctx.enqueue_function[kernel, kernel](
out_tensor_tiled,
a_tensor_tiled,
@@ -297,17 +286,15 @@ def main() raises:
block_dim=THREADS_PER_BLOCK_TILED,
)
elif argv()[1] == "--idiomatic-tiled":
- out_tensor_tiled = LayoutTensor[dtype, layout_tiled, MutAnyOrigin](
- out
- )
- a_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin](
- inp1
+ out_tensor_tiled = TileTensor(out, layout_tiled)
+ a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp1, layout_tiled
)
- b_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin](
- inp2
+ b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType](
+ inp2, layout_tiled
)
- comptime kernel = matmul_idiomatic_tiled[layout_tiled, SIZE_TILED]
+ comptime kernel = matmul_idiomatic_tiled[SIZE_TILED]
ctx.enqueue_function[kernel, kernel](
out_tensor_tiled,
a_tensor_tiled,
diff --git a/solutions/p17/op/conv1d.mojo b/solutions/p17/op/conv1d.mojo
index 77517480..b83f4a74 100644
--- a/solutions/p17/op/conv1d.mojo
+++ b/solutions/p17/op/conv1d.mojo
@@ -1,7 +1,10 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
+from std.utils import Index
from std.sys import argv
from std.testing import assert_equal
@@ -10,59 +13,57 @@ comptime BLOCKS_PER_GRID = (2, 1)
def conv1d_kernel[
- in_layout: Layout,
- out_layout: Layout,
- conv_layout: Layout,
input_size: Int,
conv_size: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
+ ConvLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
- kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin],
+ kernel: TileTensor[mut=True, dtype, ConvLayout, MutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
+ # Convert generic TileTensors to LayoutTensor for indexing (flat_rank proof required)
+ var input_lt = input.to_layout_tensor()
+ var kernel_lt = kernel.to_layout_tensor()
+ var output_lt = output.to_layout_tensor()
# first: need to account for padding
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(TPB + conv_size - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(conv_size),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB + conv_size - 1]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[conv_size]())
if global_i < input_size:
- shared_a[local_i] = input[global_i]
+ shared_a[local_i] = rebind[Scalar[dtype]](input_lt[global_i])
# second: load elements needed for convolution at block boundary
if local_i < conv_size - 1:
# indices from next block
var next_idx = global_i + TPB
if next_idx < input_size:
- shared_a[TPB + local_i] = input[next_idx]
+ shared_a[TPB + local_i] = rebind[Scalar[dtype]](input_lt[next_idx])
else:
# Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
# which is an undefined behavior
shared_a[TPB + local_i] = 0
if local_i < conv_size:
- shared_b[local_i] = kernel[local_i]
+ shared_b[local_i] = rebind[Scalar[dtype]](kernel_lt[local_i])
barrier()
if global_i < input_size:
- var local_sum: output.element_type = 0
+ var local_sum: Scalar[dtype] = 0
comptime for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
- output[global_i] = local_sum
+ output_lt.store[1](Index(global_i), local_sum)
import compiler
@@ -82,18 +83,26 @@ struct Conv1DCustomOp:
conv_size: Int,
dtype: DType = DType.float32,
](
- output: OutputTensor[rank=1, static_spec=_],
- input: InputTensor[rank=output.rank, static_spec=_],
- kernel: InputTensor[rank=output.rank, static_spec=_],
+ output: OutputTensor[dtype=dtype, rank=1, static_spec=_],
+ input: InputTensor[dtype=dtype, rank=output.rank, static_spec=_],
+ kernel: InputTensor[dtype=dtype, rank=output.rank, static_spec=_],
# the context is needed for some GPU calls
ctx: DeviceContextPtr,
) raises:
- var output_tensor = output.to_layout_tensor()
- var input_tensor = input.to_layout_tensor()
- var kernel_tensor = kernel.to_layout_tensor()
- comptime in_layout = input_tensor.layout
- comptime out_layout = output_tensor.layout
- comptime conv_layout = kernel_tensor.layout
+ comptime out_layout_val = row_major[input_size]()
+ comptime OutLayout = type_of(out_layout_val)
+ comptime conv_layout_val = row_major[conv_size]()
+ comptime ConvLayout = type_of(conv_layout_val)
+
+ var output_tensor = TileTensor[
+ mut=True, dtype, OutLayout, MutAnyOrigin
+ ](output.unsafe_ptr(), out_layout_val)
+ var input_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin](
+ input.unsafe_ptr(), out_layout_val
+ )
+ var kernel_tensor = TileTensor[
+ mut=True, dtype, ConvLayout, MutAnyOrigin
+ ](kernel.unsafe_ptr(), conv_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -101,7 +110,7 @@ struct Conv1DCustomOp:
gpu_ctx.enqueue_memset(
DeviceBuffer[output_tensor.dtype](
gpu_ctx,
- output_tensor.ptr,
+ output.unsafe_ptr(),
input_size,
owning=False,
),
@@ -109,7 +118,7 @@ struct Conv1DCustomOp:
)
# ANCHOR: conv1d_custom_op_solution
comptime kernel = conv1d_kernel[
- in_layout, out_layout, conv_layout, input_size, conv_size
+ input_size, conv_size, OutLayout, OutLayout, ConvLayout
]
gpu_ctx.enqueue_function[kernel, kernel](
output_tensor,
diff --git a/solutions/p18/op/softmax.mojo b/solutions/p18/op/softmax.mojo
index d9674e7a..eea5d67c 100644
--- a/solutions/p18/op/softmax.mojo
+++ b/solutions/p18/op/softmax.mojo
@@ -2,14 +2,17 @@ from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.math import exp
from std.bit import log2_ceil
from std.utils.numerics import max_finite, min_finite
comptime SIZE = 128 # This must be equal to INPUT_SIZE in p18.py
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime GRID_DIM_X = 1
# Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness.
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)
@@ -17,28 +20,21 @@ comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)
# ANCHOR: softmax_gpu_kernel_solution
def softmax_gpu_kernel[
- layout: Layout,
input_size: Int,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
), "dtype must be a floating-point type"
- var shared_max = LayoutTensor[
- dtype,
- Layout.row_major(BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_sum = LayoutTensor[
- dtype,
- Layout.row_major(BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_max = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[BLOCK_DIM_X]())
+ var shared_sum = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[BLOCK_DIM_X]())
var global_i = thread_idx.x
# Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum
@@ -92,12 +88,11 @@ def softmax_gpu_kernel[
# ANCHOR: softmax_cpu_kernel_solution
def softmax_cpu_kernel[
- layout: Layout,
input_size: Int,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
@@ -131,32 +126,31 @@ struct SoftmaxCustomOp:
input_size: Int,
dtype: DType = DType.float32,
](
- output: OutputTensor[rank=1, static_spec=_],
- input: InputTensor[rank=output.rank, static_spec=_],
+ output: OutputTensor[dtype=dtype, rank=1, static_spec=_],
+ input: InputTensor[dtype=dtype, rank=output.rank, static_spec=_],
ctx: DeviceContextPtr,
) raises:
- # Note: rebind is necessary now but it shouldn't be!
- var output_tensor = rebind[LayoutTensor[dtype, layout, MutAnyOrigin]](
- output.to_layout_tensor()
- )
- var input_tensor = rebind[LayoutTensor[dtype, layout, ImmutAnyOrigin]](
- input.to_layout_tensor()
- )
+ var output_tensor = TileTensor[
+ mut=True, dtype, LayoutType, MutAnyOrigin
+ ](output.unsafe_ptr(), layout)
+ var input_tensor = TileTensor[
+ mut=True, dtype, LayoutType, MutAnyOrigin
+ ](input.unsafe_ptr(), layout)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
# making sure the output tensor is zeroed out before the kernel is called
gpu_ctx.enqueue_memset(
- DeviceBuffer[output_tensor.dtype](
+ DeviceBuffer[dtype](
gpu_ctx,
- output_tensor.ptr,
+ output.unsafe_ptr(),
input_size,
owning=False,
),
0,
)
- comptime kernel = softmax_gpu_kernel[layout, input_size, dtype]
+ comptime kernel = softmax_gpu_kernel[input_size, dtype]
gpu_ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -165,8 +159,6 @@ struct SoftmaxCustomOp:
)
elif target == "cpu":
- softmax_cpu_kernel[layout, input_size, dtype](
- output_tensor, input_tensor
- )
+ softmax_cpu_kernel[input_size, dtype](output_tensor, input_tensor)
else:
raise Error("Unsupported target: " + target)
diff --git a/solutions/p18/test/test_softmax.mojo b/solutions/p18/test/test_softmax.mojo
index 70b25871..c506ae79 100644
--- a/solutions/p18/test/test_softmax.mojo
+++ b/solutions/p18/test/test_softmax.mojo
@@ -1,12 +1,14 @@
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.testing import assert_almost_equal
from std.bit import log2_ceil
from op import softmax_gpu_kernel, softmax_cpu_kernel
comptime SIZE = 128
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime GRID_DIM_X = 1
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)
comptime dtype = DType.float32
@@ -21,10 +23,11 @@ def test_softmax() raises:
# for CPU testing
var expected = ctx.enqueue_create_host_buffer[DType.float32](SIZE)
expected.enqueue_fill(0)
- var expected_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- expected
- )
- # Initialize input with more reasonable values
+ var expected_tensor = TileTensor[
+ mut=True, dtype, LayoutType, MutAnyOrigin
+ ](expected, layout)
+
+ # Initialize input and compute expected (CPU) inside map_to_host block
with inp.map_to_host() as inp_host:
for i in range(SIZE):
inp_host[i] = Scalar[dtype](i)
@@ -33,22 +36,21 @@ def test_softmax() raises:
for i in range(SIZE):
print(inp_host[i], end=" ")
print()
- # Create layout tensors for CPU calculation
- input_host_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- inp_host
- )
+ # Create layout tensor for CPU calculation (must stay inside with block)
+ var input_host_tensor = TileTensor[
+ mut=True, dtype, LayoutType, MutAnyOrigin
+ ](inp_host, layout)
+ # Compute expected results using our CPU kernel while inp_host is valid
+ softmax_cpu_kernel[SIZE, dtype](expected_tensor, input_host_tensor)
# for GPU testing
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
-
- # Compute expected results using our CPU kernel
- softmax_cpu_kernel[layout, SIZE, dtype](
- expected_tensor, input_host_tensor
- )
+ var output_tensor = TileTensor(out, layout)
+ var input_tensor = TileTensor[
+ mut=True, dtype, LayoutType, MutAnyOrigin
+ ](inp, layout)
# Run GPU kernel
- comptime kernel = softmax_gpu_kernel[layout, SIZE, dtype]
+ comptime kernel = softmax_gpu_kernel[SIZE, dtype]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/solutions/p19/op/attention.mojo b/solutions/p19/op/attention.mojo
index 600b5ebe..66962e7f 100644
--- a/solutions/p19/op/attention.mojo
+++ b/solutions/p19/op/attention.mojo
@@ -1,9 +1,10 @@
from std.memory import UnsafePointer
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
-from std.gpu.memory import AddressSpace, async_copy_wait_all
-from layout import Layout, LayoutTensor
-from layout.layout_tensor import copy_dram_to_sram_async
+from std.gpu.memory import AddressSpace
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from std.math import exp
from std.bit import log2_ceil
from std.utils.numerics import max_finite, min_finite
@@ -22,24 +23,24 @@ comptime SOFTMAX_BLOCK_DIM_X = 1 << log2_ceil(SEQ_LEN)
# Tiled matrix multiplication (from p16), updated to:
-# 1) Support different layouts for input (a, b) and output LayoutTensors.
+# 1) Support different layouts for input (a, b) and output TileTensors.
# 2) Handle cases where the inner dimension is not a multiple of MATMUL_BLOCK_DIM_XY.
# 3) Explicitly check for out-of-bounds elements.
-# The approach still tiles all three LayoutTensors (a, b, and output) into identical square tiles
+# The approach still tiles all three TileTensors (a, b, and output) into identical square tiles
# of size (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) with each thread loading one element
# from a and b, and writing one element to output.
def matmul_idiomatic_tiled[
- a_layout: Layout,
- b_layout: Layout,
- out_layout: Layout,
rows: Int,
cols: Int,
inner: Int,
+ OutLayout: TensorLayout,
+ ALayout: TensorLayout,
+ BLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, MutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=True, dtype, ALayout, MutAnyOrigin],
+ b: TileTensor[mut=True, dtype, BLayout, MutAnyOrigin],
):
"""Updated idiomatic tiled matrix multiplication from p16."""
var local_row = thread_idx.y
@@ -51,89 +52,84 @@ def matmul_idiomatic_tiled[
var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
block_idx.y, block_idx.x
)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(
+ comptime shared_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
- comptime load_b_layout = Layout.row_major(
- MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
+ ]()
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var acc: output.ElementType = 0
+
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_tile_lt = out_tile.to_layout_tensor()
+ var a_shared_lt = a_shared.to_layout_tensor()
+ var b_shared_lt = b_shared.to_layout_tensor()
comptime for idx in range(
(inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY
):
# Get tiles from A and B matrices
- var a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
- block_idx.y, idx
- )
- var b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
- idx, block_idx.x
- )
+ var a_tile_row_start = block_idx.y * MATMUL_BLOCK_DIM_XY
+ var a_tile_col_start = idx * MATMUL_BLOCK_DIM_XY
+ var b_tile_row_start = idx * MATMUL_BLOCK_DIM_XY
+ var b_tile_col_start = block_idx.x * MATMUL_BLOCK_DIM_XY
+
+ # Synchronously load tiles to shared memory - each thread loads one element
+ var a_global_row = a_tile_row_start + local_row
+ var a_global_col = a_tile_col_start + local_col
+ if a_global_row < rows and a_global_col < inner:
+ a_shared_lt[local_row, local_col] = a_lt[a_global_row, a_global_col]
+ else:
+ a_shared_lt[local_row, local_col] = 0
+
+ var b_global_row = b_tile_row_start + local_row
+ var b_global_col = b_tile_col_start + local_col
+ if b_global_row < inner and b_global_col < cols:
+ b_shared_lt[local_row, local_col] = b_lt[b_global_row, b_global_col]
+ else:
+ b_shared_lt[local_row, local_col] = 0
- # Asynchronously copy tiles to shared memory with consistent orientation
- copy_dram_to_sram_async[
- thread_layout=load_a_layout,
- num_threads=MATMUL_NUM_THREADS,
- block_dim_count=MATMUL_BLOCK_DIM_COUNT,
- ](a_shared, a_tile)
- copy_dram_to_sram_async[
- thread_layout=load_b_layout,
- num_threads=MATMUL_NUM_THREADS,
- block_dim_count=MATMUL_BLOCK_DIM_COUNT,
- ](b_shared, b_tile)
-
- # Wait for all async copies to complete
- async_copy_wait_all()
barrier()
# Compute partial matrix multiplication for this tile
- comptime for k in range(MATMUL_BLOCK_DIM_XY):
- if (
- tiled_row < rows and tiled_col < cols
- ): # Only perform calculation for valid outputs
- if k < a_tile.dim(
- 1
- ): # Only perform calculation on valid inputs
- acc += a_shared[local_row, k] * b_shared[k, local_col]
+ comptime k_max = min(
+ MATMUL_BLOCK_DIM_XY, inner - idx * MATMUL_BLOCK_DIM_XY
+ )
+ comptime for k in range(k_max):
+ if tiled_row < rows and tiled_col < cols:
+ acc += rebind[Scalar[dtype]](
+ a_shared_lt[local_row, k]
+ ) * rebind[Scalar[dtype]](b_shared_lt[k, local_col])
barrier()
# Write final result with bounds checking (needed for attention's variable sizes)
if tiled_row < rows and tiled_col < cols:
- out_tile[local_row, local_col] = acc
+ out_tile_lt[local_row, local_col] = acc
# ANCHOR: transpose_kernel_solution
def transpose_kernel[
- layout_in: Layout, # Layout for input matrix (seq_len, d)
- layout_out: Layout, # Layout for output matrix (d, seq_len)
rows: Int,
cols: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ inp: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin],
):
"""Transpose matrix using shared memory tiling for coalesced access."""
- var shared_tile = LayoutTensor[
- dtype,
- Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ comptime shared_layout = row_major[
+ TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY
+ ]()
+ var shared_tile = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
var local_row = thread_idx.y
var local_col = thread_idx.x
@@ -141,8 +137,12 @@ def transpose_kernel[
var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row
var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col
+ var inp_lt = inp.to_layout_tensor()
+ var output_lt = output.to_layout_tensor()
+ var shared_tile_lt = shared_tile.to_layout_tensor()
+
if global_row < rows and global_col < cols:
- shared_tile[local_row, local_col] = inp[global_row, global_col]
+ shared_tile_lt[local_row, local_col] = inp_lt[global_row, global_col]
barrier()
@@ -152,7 +152,7 @@ def transpose_kernel[
# Store data from shared memory to global memory (coalesced write)
# Note: we transpose the shared memory access pattern
if out_row < cols and out_col < rows:
- output[out_row, out_col] = shared_tile[local_col, local_row]
+ output_lt[out_row, out_col] = shared_tile_lt[local_col, local_row]
# ANCHOR_END: transpose_kernel_solution
@@ -160,36 +160,33 @@ def transpose_kernel[
# Apply softmax to attention scores taken from p16
def softmax_gpu_kernel[
- layout: Layout,
input_size: Int,
+ LayoutType: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
comptime assert (
dtype.is_floating_point()
), "dtype must be a floating-point type"
- var shared_max = LayoutTensor[
- dtype,
- Layout.row_major(SOFTMAX_BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_sum = LayoutTensor[
- dtype,
- Layout.row_major(SOFTMAX_BLOCK_DIM_X),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ comptime softmax_layout = row_major[SOFTMAX_BLOCK_DIM_X]()
+ var shared_max = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](softmax_layout)
+ var shared_sum = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](softmax_layout)
var global_i = thread_idx.x
+ var input_lt = input.to_layout_tensor()
+ var output_lt = output.to_layout_tensor()
# Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum
# finite value for dtype, ensuring that if these elements are accessed in the parallel max reduction below they
# do not influence the result (max(min_finite, x) == x for any x).
var val: Scalar[dtype] = min_finite[dtype]()
if global_i < input_size:
- val = rebind[Scalar[dtype]](input[global_i])
+ val = rebind[Scalar[dtype]](input_lt[global_i])
shared_max[global_i] = val
barrier()
@@ -227,25 +224,29 @@ def softmax_gpu_kernel[
# Normalize by sum
if global_i < input_size:
- output[global_i] = exp_val / block_sum
+ output_lt[global_i] = exp_val / block_sum
# CPU implementation for vector attention
def attention_cpu_kernel[
- layout_q: Layout,
- layout_k: Layout,
- layout_v: Layout,
- layout_out: Layout,
seq_len: Int,
d: Int,
+ OutLayout: TensorLayout,
+ QLayout: TensorLayout,
+ KLayout: TensorLayout,
+ VLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- q: LayoutTensor[dtype, layout_q, MutAnyOrigin],
- k: LayoutTensor[dtype, layout_k, ImmutAnyOrigin],
- v: LayoutTensor[dtype, layout_v, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ q: TileTensor[mut=True, dtype, QLayout, MutAnyOrigin],
+ k: TileTensor[mut=True, dtype, KLayout, MutAnyOrigin],
+ v: TileTensor[mut=True, dtype, VLayout, MutAnyOrigin],
):
"""CPU implementation of vector attention."""
+ var output_lt = output.to_layout_tensor()
+ var q_lt = q.to_layout_tensor()
+ var k_lt = k.to_layout_tensor()
+ var v_lt = v.to_layout_tensor()
var scores = List[Float32]()
var weights = List[Float32]()
for _ in range(seq_len):
@@ -256,7 +257,9 @@ def attention_cpu_kernel[
for i in range(seq_len):
var score: Float32 = 0.0
for dim in range(d):
- score = score + rebind[Float32](q[dim]) * rebind[Float32](k[i, dim])
+ score = score + rebind[Float32](q_lt[dim]) * rebind[Float32](
+ k_lt[i, dim]
+ )
scores[i] = score
var max_score: Float32 = scores[0]
@@ -276,9 +279,9 @@ def attention_cpu_kernel[
var weighted_sum: Float32 = 0.0
for i in range(seq_len):
weighted_sum = weighted_sum + weights[i] * rebind[Float32](
- v[i, dim]
+ v_lt[i, dim]
)
- output[dim] = rebind[Scalar[dtype]](weighted_sum)
+ output_lt[dim] = rebind[Scalar[dtype]](weighted_sum)
@compiler.register("attention")
@@ -290,31 +293,42 @@ struct AttentionCustomOp:
d: Int,
dtype: DType = DType.float32,
](
- output: OutputTensor[rank=1, static_spec=_], # Output vector (d,)
- q: InputTensor[rank=1, static_spec=_], # Query vector (d,)
- k: InputTensor[rank=2, static_spec=_], # Key matrix (seq_len, d)
- v: InputTensor[rank=2, static_spec=_], # Value matrix (seq_len, d)
+ output: OutputTensor[
+ dtype=dtype, rank=1, static_spec=_
+ ], # Output vector (d,)
+ q: InputTensor[dtype=dtype, rank=1, static_spec=_], # Query vector (d,)
+ k: InputTensor[
+ dtype=dtype, rank=2, static_spec=_
+ ], # Key matrix (seq_len, d)
+ v: InputTensor[
+ dtype=dtype, rank=2, static_spec=_
+ ], # Value matrix (seq_len, d)
ctx: DeviceContextPtr,
) raises:
# Define layouts
- comptime layout_q = Layout.row_major(d)
- comptime layout_k = Layout.row_major(seq_len, d)
- comptime layout_v = Layout.row_major(seq_len, d)
- comptime layout_out = Layout.row_major(d)
- comptime layout_scores = Layout.row_major(seq_len)
+ comptime layout_q = row_major[d]()
+ comptime layout_k = row_major[seq_len, d]()
+ comptime layout_v = row_major[seq_len, d]()
+ comptime layout_out = row_major[d]()
+ comptime layout_scores = row_major[seq_len]()
+ comptime QLayout = type_of(layout_q)
+ comptime KLayout = type_of(layout_k)
+ comptime VLayout = type_of(layout_v)
+ comptime OutLayout = type_of(layout_out)
+ comptime ScoresLayout = type_of(layout_scores)
# Convert to layout tensors
- var output_tensor = rebind[
- LayoutTensor[dtype, layout_out, MutAnyOrigin]
- ](output.to_layout_tensor())
- var q_tensor = rebind[LayoutTensor[dtype, layout_q, MutAnyOrigin]](
- q.to_layout_tensor()
+ var output_tensor = TileTensor[
+ mut=True, dtype, OutLayout, MutAnyOrigin
+ ](output.unsafe_ptr(), layout_out)
+ var q_tensor = TileTensor[mut=True, dtype, QLayout, MutAnyOrigin](
+ q.unsafe_ptr(), layout_q
)
- var k_tensor = rebind[LayoutTensor[dtype, layout_k, ImmutAnyOrigin]](
- k.to_layout_tensor()
+ var k_tensor = TileTensor[mut=True, dtype, KLayout, MutAnyOrigin](
+ k.unsafe_ptr(), layout_k
)
- var v_tensor = rebind[LayoutTensor[dtype, layout_v, MutAnyOrigin]](
- v.to_layout_tensor()
+ var v_tensor = TileTensor[mut=True, dtype, VLayout, MutAnyOrigin](
+ v.unsafe_ptr(), layout_v
)
comptime if target == "gpu":
@@ -322,15 +336,20 @@ struct AttentionCustomOp:
# Define layouts for matrix multiplication
# Q reshaped to (1, d)
- comptime layout_q_2d = Layout.row_major(1, d)
+ comptime layout_q_2d = row_major[1, d]()
+ comptime Q2DLayout = type_of(layout_q_2d)
# K^T is (d, seq_len)
- comptime layout_k_t = Layout.row_major(d, seq_len)
+ comptime layout_k_t = row_major[d, seq_len]()
+ comptime KTLayout = type_of(layout_k_t)
# Scores as (1, seq_len)
- comptime layout_scores_2d = Layout.row_major(1, seq_len)
+ comptime layout_scores_2d = row_major[1, seq_len]()
+ comptime Scores2DLayout = type_of(layout_scores_2d)
# Weights as (1, seq_len)
- comptime layout_weights_2d = Layout.row_major(1, seq_len)
+ comptime layout_weights_2d = row_major[1, seq_len]()
+ comptime Weights2DLayout = type_of(layout_weights_2d)
# Result as (1, d)
- comptime layout_result_2d = Layout.row_major(1, d)
+ comptime layout_result_2d = row_major[1, d]()
+ comptime Result2DLayout = type_of(layout_result_2d)
# Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks
comptime transpose_threads_per_block = (
@@ -367,16 +386,16 @@ struct AttentionCustomOp:
seq_len
) # Reused for scores and weights
- var k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf)
+ var k_t = TileTensor(k_t_buf, layout_k_t)
# ANCHOR: attention_orchestration_solution
# Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
- var q_2d = q_tensor.reshape[layout_q_2d]()
+ var q_2d = q_tensor.reshape(layout_q_2d)
# Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)\
comptime kernel = transpose_kernel[
- layout_k, layout_k_t, seq_len, d, dtype
+ seq_len, d, KTLayout, KLayout, dtype
]
gpu_ctx.enqueue_function[kernel, kernel](
k_t,
@@ -388,16 +407,14 @@ struct AttentionCustomOp:
# Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
# This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K)
# Reuse scores_weights_buf as (1, seq_len) for scores
- var scores_2d = LayoutTensor[dtype, layout_scores_2d, MutAnyOrigin](
- scores_weights_buf
- )
+ var scores_2d = TileTensor(scores_weights_buf, layout_scores_2d)
comptime kernel2 = matmul_idiomatic_tiled[
- layout_q_2d,
- layout_k_t,
- layout_scores_2d,
1,
seq_len,
d,
+ Scores2DLayout,
+ Q2DLayout,
+ KTLayout,
dtype,
]
gpu_ctx.enqueue_function[kernel2, kernel2](
@@ -409,30 +426,38 @@ struct AttentionCustomOp:
)
# Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
- var weights = scores_2d.reshape[layout_scores]()
-
- # Step 5: Apply softmax to get attention weights
- comptime kernel3 = softmax_gpu_kernel[layout_scores, seq_len, dtype]
+ var weights = scores_2d.reshape(layout_scores)
+
+ # Step 5: Apply softmax to get attention weights (in-place)
+ comptime ScoresLayout = type_of(layout_scores)
+ comptime kernel3 = softmax_gpu_kernel[seq_len, ScoresLayout, dtype]
+ # Create two TileTensor views from the underlying buffer to avoid aliasing error
+ var weights_out = TileTensor[
+ mut=True, dtype, ScoresLayout, MutAnyOrigin
+ ](scores_weights_buf, layout_scores)
+ var weights_in = TileTensor[
+ mut=True, dtype, ScoresLayout, MutAnyOrigin
+ ](scores_weights_buf, layout_scores)
gpu_ctx.enqueue_function[kernel3, kernel3](
- weights,
- weights,
+ weights_out,
+ weights_in,
grid_dim=softmax_blocks_per_grid,
block_dim=softmax_threads,
)
# Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
- var weights_2d = weights.reshape[layout_weights_2d]()
+ var weights_2d = weights.reshape(layout_weights_2d)
# Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
# Reuse out_tensor reshaped as (1, d) for result
- var result_2d = output_tensor.reshape[layout_result_2d]()
+ var result_2d = output_tensor.reshape(layout_result_2d)
comptime kernel4 = matmul_idiomatic_tiled[
- layout_weights_2d,
- layout_v,
- layout_result_2d,
1,
d,
seq_len,
+ Result2DLayout,
+ Weights2DLayout,
+ VLayout,
dtype,
]
gpu_ctx.enqueue_function[kernel4, kernel4](
@@ -447,7 +472,7 @@ struct AttentionCustomOp:
elif target == "cpu":
attention_cpu_kernel[
- layout_q, layout_k, layout_v, layout_out, seq_len, d, dtype
+ seq_len, d, OutLayout, QLayout, KLayout, VLayout, dtype
](output_tensor, q_tensor, k_tensor, v_tensor)
else:
diff --git a/solutions/p20/op/conv1d.mojo b/solutions/p20/op/conv1d.mojo
index eb9d86af..63ea3ee4 100644
--- a/solutions/p20/op/conv1d.mojo
+++ b/solutions/p20/op/conv1d.mojo
@@ -2,7 +2,10 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
+from std.utils import Index
from std.sys import argv
from std.testing import assert_equal
@@ -11,59 +14,56 @@ comptime BLOCKS_PER_GRID = (2, 1)
def conv1d_kernel[
- in_layout: Layout,
- out_layout: Layout,
- conv_layout: Layout,
input_size: Int,
conv_size: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
+ ConvLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
- kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin],
+ kernel: TileTensor[mut=True, dtype, ConvLayout, MutAnyOrigin],
):
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
+ var input_lt = input.to_layout_tensor()
+ var kernel_lt = kernel.to_layout_tensor()
+ var output_lt = output.to_layout_tensor()
# first: need to account for padding
- var shared_a = LayoutTensor[
- dtype,
- Layout.row_major(TPB + conv_size - 1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var shared_b = LayoutTensor[
- dtype,
- Layout.row_major(conv_size),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_a = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB + conv_size - 1]())
+ var shared_b = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[conv_size]())
if global_i < input_size:
- shared_a[local_i] = input[global_i]
+ shared_a[local_i] = rebind[Scalar[dtype]](input_lt[global_i])
# second: load elements needed for convolution at block boundary
if local_i < conv_size - 1:
# indices from next block
var next_idx = global_i + TPB
if next_idx < input_size:
- shared_a[TPB + local_i] = input[next_idx]
+ shared_a[TPB + local_i] = rebind[Scalar[dtype]](input_lt[next_idx])
else:
# Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
# which is an undefined behavior
shared_a[TPB + local_i] = 0
if local_i < conv_size:
- shared_b[local_i] = kernel[local_i]
+ shared_b[local_i] = rebind[Scalar[dtype]](kernel_lt[local_i])
barrier()
if global_i < input_size:
- var local_sum: output.element_type = 0
+ var local_sum: Scalar[dtype] = 0
comptime for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
- output[global_i] = local_sum
+ output_lt.store[1](Index(global_i), local_sum)
import compiler
@@ -89,12 +89,20 @@ struct Conv1DCustomOp:
# the context is needed for some GPU calls
ctx: DeviceContextPtr,
) raises:
- var out_tensor = output.to_layout_tensor()
- var input_tensor = input.to_layout_tensor()
- var kernel_tensor = kernel.to_layout_tensor()
- comptime in_layout = input_tensor.layout
- comptime out_layout = out_tensor.layout
- comptime conv_layout = kernel_tensor.layout
+ comptime out_layout_val = row_major[input_size]()
+ comptime OutLayout = type_of(out_layout_val)
+ comptime conv_layout_val = row_major[conv_size]()
+ comptime ConvLayout = type_of(conv_layout_val)
+
+ var out_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin](
+ output.unsafe_ptr(), out_layout_val
+ )
+ var input_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin](
+ input.unsafe_ptr(), out_layout_val
+ )
+ var kernel_tensor = TileTensor[
+ mut=True, dtype, ConvLayout, MutAnyOrigin
+ ](kernel.unsafe_ptr(), conv_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -102,7 +110,7 @@ struct Conv1DCustomOp:
gpu_ctx.enqueue_memset(
DeviceBuffer[output.dtype](
gpu_ctx,
- out_tensor.ptr,
+ output.unsafe_ptr(),
input_size,
owning=False,
),
@@ -110,7 +118,7 @@ struct Conv1DCustomOp:
)
# ANCHOR: conv1d_custom_op_solution
comptime kernel = conv1d_kernel[
- in_layout, out_layout, conv_layout, input_size, conv_size
+ input_size, conv_size, OutLayout, OutLayout, ConvLayout
]
gpu_ctx.enqueue_function[kernel, kernel](
out_tensor,
diff --git a/solutions/p21/op/embedding.mojo b/solutions/p21/op/embedding.mojo
index 10487336..6f8a47be 100644
--- a/solutions/p21/op/embedding.mojo
+++ b/solutions/p21/op/embedding.mojo
@@ -1,7 +1,8 @@
from std.math import ceildiv
from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
from std.sys import argv
from std.testing import assert_equal
@@ -10,18 +11,18 @@ comptime THREADS_PER_BLOCK = 256
# ANCHOR: embedding_kernel_coalesced_solution
def embedding_kernel_coalesced[
- indices_layout: Layout,
- weights_layout: Layout,
- out_layout: Layout,
batch_size: Int,
seq_len: Int,
vocab_size: Int,
embed_dim: Int,
+ OutLayout: TensorLayout,
+ IndicesLayout: TensorLayout,
+ WeightsLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
- weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin],
+ weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin],
):
"""
Memory-coalescing focused embedding kernel.
@@ -39,6 +40,10 @@ def embedding_kernel_coalesced[
if global_idx >= total_elements:
return
+ var output_lt = output.to_layout_tensor()
+ var indices_lt = indices.to_layout_tensor()
+ var weights_lt = weights.to_layout_tensor()
+
# Convert to (batch, seq, embed) coordinates
var batch_idx = global_idx // (seq_len * embed_dim)
var remaining = global_idx % (seq_len * embed_dim)
@@ -46,15 +51,15 @@ def embedding_kernel_coalesced[
var embed_idx = remaining % embed_dim
# Get token index
- var token_idx_val = Int(indices[batch_idx, seq_idx])
+ var token_idx_val = Int(indices_lt[batch_idx, seq_idx])
# Simple, correct assignment
if token_idx_val >= 0 and token_idx_val < vocab_size:
- output[batch_idx, seq_idx, embed_idx] = weights[
+ output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[
token_idx_val, embed_idx
]
else:
- output[batch_idx, seq_idx, embed_idx] = 0
+ output_lt[batch_idx, seq_idx, embed_idx] = 0
# ANCHOR_END: embedding_kernel_coalesced_solution
@@ -62,18 +67,18 @@ def embedding_kernel_coalesced[
# ANCHOR: embedding_kernel_2d_solution
def embedding_kernel_2d[
- indices_layout: Layout,
- weights_layout: Layout,
- out_layout: Layout,
batch_size: Int,
seq_len: Int,
vocab_size: Int,
embed_dim: Int,
+ OutLayout: TensorLayout,
+ IndicesLayout: TensorLayout,
+ WeightsLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
- weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin],
+ weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin],
):
"""
2D grid non-coalesced embedding kernel.
@@ -94,20 +99,24 @@ def embedding_kernel_2d[
if batch_seq_idx >= total_positions or embed_idx >= embed_dim:
return
+ var output_lt = output.to_layout_tensor()
+ var indices_lt = indices.to_layout_tensor()
+ var weights_lt = weights.to_layout_tensor()
+
# Convert to (batch, seq) coordinates
var batch_idx = batch_seq_idx // seq_len
var seq_idx = batch_seq_idx % seq_len
# Get token index
- var token_idx_val = Int(indices[batch_idx, seq_idx])
+ var token_idx_val = Int(indices_lt[batch_idx, seq_idx])
# Assignment with 2D grid pattern
if token_idx_val >= 0 and token_idx_val < vocab_size:
- output[batch_idx, seq_idx, embed_idx] = weights[
+ output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[
token_idx_val, embed_idx
]
else:
- output[batch_idx, seq_idx, embed_idx] = 0
+ output_lt[batch_idx, seq_idx, embed_idx] = 0
# ANCHOR_END: embedding_kernel_2d_solution
@@ -141,13 +150,22 @@ struct EmbeddingCustomOp:
], # [vocab_size, embed_dim]
ctx: DeviceContextPtr,
) raises:
- var output_tensor = output.to_layout_tensor()
- var indices_tensor = indices.to_layout_tensor()
- var weights_tensor = weights.to_layout_tensor()
-
- comptime indices_layout = indices_tensor.layout
- comptime weights_layout = weights_tensor.layout
- comptime out_layout = output_tensor.layout
+ comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]()
+ comptime OutLayout = type_of(out_layout_val)
+ comptime indices_layout_val = row_major[batch_size, seq_len]()
+ comptime IndicesLayout = type_of(indices_layout_val)
+ comptime weights_layout_val = row_major[vocab_size, embed_dim]()
+ comptime WeightsLayout = type_of(weights_layout_val)
+
+ var output_tensor = TileTensor[
+ mut=True, output.dtype, OutLayout, MutAnyOrigin
+ ](output.unsafe_ptr(), out_layout_val)
+ var indices_tensor = TileTensor[
+ mut=True, DType.int32, IndicesLayout, MutAnyOrigin
+ ](indices.unsafe_ptr(), indices_layout_val)
+ var weights_tensor = TileTensor[
+ mut=True, output.dtype, WeightsLayout, MutAnyOrigin
+ ](weights.unsafe_ptr(), weights_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -156,7 +174,7 @@ struct EmbeddingCustomOp:
gpu_ctx.enqueue_memset(
DeviceBuffer[output.dtype](
gpu_ctx,
- output_tensor.ptr,
+ output.unsafe_ptr(),
batch_size * seq_len * embed_dim,
owning=False,
),
@@ -169,13 +187,13 @@ struct EmbeddingCustomOp:
# Compile and launch optimized kernel
comptime kernel = embedding_kernel_coalesced[
- indices_layout,
- weights_layout,
- out_layout,
batch_size,
seq_len,
vocab_size,
embed_dim,
+ OutLayout,
+ IndicesLayout,
+ WeightsLayout,
output.dtype,
]
var compiled_kernel = gpu_ctx.compile_function[kernel, kernel]()
@@ -227,13 +245,22 @@ struct Embedding2DCustomOp:
], # [vocab_size, embed_dim]
ctx: DeviceContextPtr,
) raises:
- var output_tensor = output.to_layout_tensor()
- var indices_tensor = indices.to_layout_tensor()
- var weights_tensor = weights.to_layout_tensor()
-
- comptime indices_layout = indices_tensor.layout
- comptime weights_layout = weights_tensor.layout
- comptime out_layout = output_tensor.layout
+ comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]()
+ comptime OutLayout = type_of(out_layout_val)
+ comptime indices_layout_val = row_major[batch_size, seq_len]()
+ comptime IndicesLayout = type_of(indices_layout_val)
+ comptime weights_layout_val = row_major[vocab_size, embed_dim]()
+ comptime WeightsLayout = type_of(weights_layout_val)
+
+ var output_tensor = TileTensor[
+ mut=True, output.dtype, OutLayout, MutAnyOrigin
+ ](output.unsafe_ptr(), out_layout_val)
+ var indices_tensor = TileTensor[
+ mut=True, DType.int32, IndicesLayout, MutAnyOrigin
+ ](indices.unsafe_ptr(), indices_layout_val)
+ var weights_tensor = TileTensor[
+ mut=True, output.dtype, WeightsLayout, MutAnyOrigin
+ ](weights.unsafe_ptr(), weights_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -242,7 +269,7 @@ struct Embedding2DCustomOp:
gpu_ctx.enqueue_memset(
DeviceBuffer[output.dtype](
gpu_ctx,
- output_tensor.ptr,
+ output.unsafe_ptr(),
batch_size * seq_len * embed_dim,
owning=False,
),
@@ -258,13 +285,13 @@ struct Embedding2DCustomOp:
# Compile and launch 2D kernel
comptime kernel = embedding_kernel_2d[
- indices_layout,
- weights_layout,
- out_layout,
batch_size,
seq_len,
vocab_size,
embed_dim,
+ OutLayout,
+ IndicesLayout,
+ WeightsLayout,
output.dtype,
]
diff --git a/solutions/p22/op/layernorm_linear.mojo b/solutions/p22/op/layernorm_linear.mojo
index 3e5a4153..fe9fea3e 100644
--- a/solutions/p22/op/layernorm_linear.mojo
+++ b/solutions/p22/op/layernorm_linear.mojo
@@ -1,9 +1,10 @@
from std.math import sqrt
from std.gpu import thread_idx, block_idx, block_dim, barrier
-from std.gpu.memory import async_copy_wait_all, AddressSpace
+from std.gpu.memory import AddressSpace
from std.os.atomic import Atomic
-from layout import Layout, LayoutTensor
-from layout.layout_tensor import copy_dram_to_sram_async
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
import compiler
from std.runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
@@ -18,17 +19,17 @@ comptime TRANSPOSE_BLOCK_DIM_XY = 16 # Square blocks for input and output
# ANCHOR: matmul_idiomatic_tiled
# Idiomatic tiled matmul from p19.mojo
def matmul_idiomatic_tiled[
- a_layout: Layout,
- b_layout: Layout,
- out_layout: Layout,
rows: Int,
cols: Int,
inner: Int,
+ OutLayout: TensorLayout,
+ ALayout: TensorLayout,
+ BLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, a_layout, MutAnyOrigin],
- b: LayoutTensor[dtype, b_layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=True, dtype, ALayout, MutAnyOrigin],
+ b: TileTensor[mut=True, dtype, BLayout, MutAnyOrigin],
):
"""Idiomatic tiled matrix multiplication from p19."""
var local_row = thread_idx.y
@@ -40,69 +41,63 @@ def matmul_idiomatic_tiled[
var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
block_idx.y, block_idx.x
)
- var a_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var b_shared = LayoutTensor[
- dtype,
- Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var acc: output.element_type = 0
-
- comptime load_a_layout = Layout.row_major(
+ comptime shared_layout = row_major[
MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
- comptime load_b_layout = Layout.row_major(
- MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
- ) # Coalesced loading
+ ]()
+ var a_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var b_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
+ var acc: output.ElementType = 0
+
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_tile_lt = out_tile.to_layout_tensor()
+ var a_shared_lt = a_shared.to_layout_tensor()
+ var b_shared_lt = b_shared.to_layout_tensor()
comptime for idx in range(
(inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY
):
- # Get tiles from A and B matrices
- var a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
- block_idx.y, idx
- )
- var b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
- idx, block_idx.x
- )
+ # Synchronously load tiles to shared memory - each thread loads one element
+ var a_tile_row_start = block_idx.y * MATMUL_BLOCK_DIM_XY
+ var a_tile_col_start = idx * MATMUL_BLOCK_DIM_XY
+ var b_tile_row_start = idx * MATMUL_BLOCK_DIM_XY
+ var b_tile_col_start = block_idx.x * MATMUL_BLOCK_DIM_XY
+
+ var a_global_row = a_tile_row_start + local_row
+ var a_global_col = a_tile_col_start + local_col
+ if a_global_row < rows and a_global_col < inner:
+ a_shared_lt[local_row, local_col] = a_lt[a_global_row, a_global_col]
+ else:
+ a_shared_lt[local_row, local_col] = 0
+
+ var b_global_row = b_tile_row_start + local_row
+ var b_global_col = b_tile_col_start + local_col
+ if b_global_row < inner and b_global_col < cols:
+ b_shared_lt[local_row, local_col] = b_lt[b_global_row, b_global_col]
+ else:
+ b_shared_lt[local_row, local_col] = 0
- # Asynchronously copy tiles to shared memory with consistent orientation
- copy_dram_to_sram_async[
- thread_layout=load_a_layout,
- num_threads=MATMUL_NUM_THREADS,
- block_dim_count=MATMUL_BLOCK_DIM_COUNT,
- ](a_shared, a_tile)
- copy_dram_to_sram_async[
- thread_layout=load_b_layout,
- num_threads=MATMUL_NUM_THREADS,
- block_dim_count=MATMUL_BLOCK_DIM_COUNT,
- ](b_shared, b_tile)
-
- # Wait for all async copies to complete
- async_copy_wait_all()
barrier()
# Compute partial matrix multiplication for this tile
- comptime for k in range(MATMUL_BLOCK_DIM_XY):
- if (
- tiled_row < rows and tiled_col < cols
- ): # Only perform calculation for valid outputs
- if k < a_tile.dim(
- 1
- ): # Only perform calculation on valid inputs
- acc += a_shared[local_row, k] * b_shared[k, local_col]
+ comptime k_max = min(
+ MATMUL_BLOCK_DIM_XY, inner - idx * MATMUL_BLOCK_DIM_XY
+ )
+ comptime for k in range(k_max):
+ if tiled_row < rows and tiled_col < cols:
+ acc += rebind[Scalar[dtype]](
+ a_shared_lt[local_row, k]
+ ) * rebind[Scalar[dtype]](b_shared_lt[k, local_col])
barrier()
# Write final result with bounds checking (needed for variable matrix sizes)
if tiled_row < rows and tiled_col < cols:
- out_tile[local_row, local_col] = acc
+ out_tile_lt[local_row, local_col] = acc
# ANCHOR_END: matmul_idiomatic_tiled
@@ -110,18 +105,18 @@ def matmul_idiomatic_tiled[
# ANCHOR: layernorm_kernel_solution
def layernorm_kernel[
- input_layout: Layout,
- ln_params_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
+ ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
+ ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
):
var batch_idx = block_idx.x
var seq_idx = block_idx.y
@@ -134,12 +129,17 @@ def layernorm_kernel[
):
return
+ var output_lt = output.to_layout_tensor()
+ var input_lt = input.to_layout_tensor()
+ var ln_weight_lt = ln_weight.to_layout_tensor()
+ var ln_bias_lt = ln_bias.to_layout_tensor()
+
# Compute statistics for this sequence position (redundant but simple)
var sum_val: Scalar[dtype] = 0
var sq_sum: Scalar[dtype] = 0
comptime for h in range(hidden_dim):
- var val = input[batch_idx, seq_idx, h]
+ var val = input_lt[batch_idx, seq_idx, h]
sum_val += rebind[Scalar[dtype]](val)
sq_sum += rebind[Scalar[dtype]](val * val)
@@ -148,11 +148,11 @@ def layernorm_kernel[
var inv_std = 1.0 / sqrt(var_val + 1e-5)
# Apply LayerNorm to this element
- var input_val = input[batch_idx, seq_idx, hidden_idx]
+ var input_val = input_lt[batch_idx, seq_idx, hidden_idx]
var normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]](
- ln_weight[hidden_idx]
- ) + rebind[Scalar[dtype]](ln_bias[hidden_idx])
- output[batch_idx, seq_idx, hidden_idx] = normalized
+ ln_weight_lt[hidden_idx]
+ ) + rebind[Scalar[dtype]](ln_bias_lt[hidden_idx])
+ output_lt[batch_idx, seq_idx, hidden_idx] = normalized
# ANCHOR_END: layernorm_kernel_solution
@@ -160,33 +160,37 @@ def layernorm_kernel[
# ANCHOR: transpose_kernel_solution
def transpose_kernel[
- layout_in: Layout,
- layout_out: Layout,
rows: Int,
cols: Int,
+ OutLayout: TensorLayout,
+ InLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
- inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ inp: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin],
):
"""Transpose matrix using shared memory tiling for coalesced access.
We will learn more about coalesced access in the next part.
"""
- var shared_tile = LayoutTensor[
- dtype,
- Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ comptime shared_layout = row_major[
+ TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY
+ ]()
+ var shared_tile = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](shared_layout)
var local_row = thread_idx.y
var local_col = thread_idx.x
+ var inp_lt = inp.to_layout_tensor()
+ var output_lt = output.to_layout_tensor()
+ var shared_tile_lt = shared_tile.to_layout_tensor()
+
var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row
var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col
if global_row < rows and global_col < cols:
- shared_tile[local_row, local_col] = inp[global_row, global_col]
+ shared_tile_lt[local_row, local_col] = inp_lt[global_row, global_col]
barrier()
@@ -196,7 +200,7 @@ def transpose_kernel[
# Store data from shared memory to global memory (coalesced write)
# Note: we transpose the shared memory access pattern
if out_row < cols and out_col < rows:
- output[out_row, out_col] = shared_tile[local_col, local_row]
+ output_lt[out_row, out_col] = shared_tile_lt[local_col, local_row]
# ANCHOR_END: transpose_kernel_solution
@@ -204,17 +208,17 @@ def transpose_kernel[
# ANCHOR: add_bias_kernel
def add_bias_kernel[
- input_layout: Layout,
- bias_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
output_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ BiasLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, MutAnyOrigin],
- bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
+ bias: TileTensor[mut=True, dtype, BiasLayout, MutAnyOrigin],
):
"""Simple bias addition."""
var batch_idx = block_idx.x
@@ -224,9 +228,13 @@ def add_bias_kernel[
if batch_idx >= batch_size or seq_idx >= seq_len or out_idx >= output_dim:
return
- output[batch_idx, seq_idx, out_idx] = input[
+ var output_lt = output.to_layout_tensor()
+ var input_lt = input.to_layout_tensor()
+ var bias_lt = bias.to_layout_tensor()
+
+ output_lt[batch_idx, seq_idx, out_idx] = input_lt[
batch_idx, seq_idx, out_idx
- ] + rebind[Scalar[dtype]](bias[out_idx])
+ ] + rebind[Scalar[dtype]](bias_lt[out_idx])
# ANCHOR_END: add_bias_kernel
@@ -234,23 +242,23 @@ def add_bias_kernel[
# ANCHOR: minimal_fused_forward_kernel_solution
def minimal_fused_kernel[
- input_layout: Layout,
- ln_params_layout: Layout,
- weight_layout: Layout,
- bias_layout: Layout,
- output_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
output_dim: Int,
+ OutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
+ WeightLayout: TensorLayout,
+ BiasLayout: TensorLayout,
dtype: DType = DType.float32,
](
- output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
- linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
+ ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
+ ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
+ linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin],
+ linear_bias: TileTensor[mut=True, dtype, BiasLayout, MutAnyOrigin],
):
"""Minimal fused kernel - one thread per sequence position to avoid redundancy.
"""
@@ -262,12 +270,19 @@ def minimal_fused_kernel[
if batch_idx >= batch_size or seq_idx >= seq_len:
return
+ var output_lt = output.to_layout_tensor()
+ var input_lt = input.to_layout_tensor()
+ var ln_weight_lt = ln_weight.to_layout_tensor()
+ var ln_bias_lt = ln_bias.to_layout_tensor()
+ var linear_weight_lt = linear_weight.to_layout_tensor()
+ var linear_bias_lt = linear_bias.to_layout_tensor()
+
# Step 1: Compute LayerNorm statistics once per sequence position
var sum_val: Scalar[dtype] = 0
var sq_sum: Scalar[dtype] = 0
comptime for h in range(hidden_dim):
- var val = input[batch_idx, seq_idx, h]
+ var val = input_lt[batch_idx, seq_idx, h]
sum_val += rebind[Scalar[dtype]](val)
sq_sum += rebind[Scalar[dtype]](val * val)
@@ -280,14 +295,16 @@ def minimal_fused_kernel[
var acc: Scalar[dtype] = 0
comptime for h in range(hidden_dim):
- var input_val = input[batch_idx, seq_idx, h]
+ var input_val = input_lt[batch_idx, seq_idx, h]
var normalized = (input_val - mean_val) * inv_std * rebind[
Scalar[dtype]
- ](ln_weight[h]) + rebind[Scalar[dtype]](ln_bias[h])
- acc += rebind[Scalar[dtype]](normalized * linear_weight[out_idx, h])
+ ](ln_weight_lt[h]) + rebind[Scalar[dtype]](ln_bias_lt[h])
+ acc += rebind[Scalar[dtype]](
+ normalized * linear_weight_lt[out_idx, h]
+ )
- output[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]](
- linear_bias[out_idx]
+ output_lt[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]](
+ linear_bias_lt[out_idx]
)
@@ -296,31 +313,33 @@ def minimal_fused_kernel[
# ANCHOR: minimal_fused_backward_kernel_solution
def minimal_fused_kernel_backward[
- grad_output_layout: Layout,
- input_layout: Layout,
- ln_params_layout: Layout,
- weight_layout: Layout,
- grad_input_layout: Layout,
- grad_ln_weight_layout: Layout,
- grad_ln_bias_layout: Layout,
- grad_weight_layout: Layout,
- grad_bias_layout: Layout,
batch_size: Int,
seq_len: Int,
hidden_dim: Int,
output_dim: Int,
+ GradInputLayout: TensorLayout,
+ GradLnWeightLayout: TensorLayout,
+ GradLnBiasLayout: TensorLayout,
+ GradWeightLayout: TensorLayout,
+ GradBiasLayout: TensorLayout,
+ GradOutputLayout: TensorLayout,
+ InputLayout: TensorLayout,
+ LnParamsLayout: TensorLayout,
+ WeightLayout: TensorLayout,
dtype: DType = DType.float32,
](
- grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin],
- grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin],
- grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin],
- grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin],
- grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin],
- grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin],
- input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
- ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
- linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
+ grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin],
+ grad_ln_weight: TileTensor[
+ mut=True, dtype, GradLnWeightLayout, MutAnyOrigin
+ ],
+ grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin],
+ grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin],
+ grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin],
+ grad_output: TileTensor[mut=True, dtype, GradOutputLayout, MutAnyOrigin],
+ input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
+ ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
+ ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
+ linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin],
):
"""Fused backward kernel using atomic operations for safe gradient accumulation.
"""
@@ -332,6 +351,17 @@ def minimal_fused_kernel_backward[
if batch_idx >= batch_size or seq_idx >= seq_len:
return
+ var grad_input_lt = grad_input.to_layout_tensor()
+ var grad_ln_weight_lt = grad_ln_weight.to_layout_tensor()
+ var grad_ln_bias_lt = grad_ln_bias.to_layout_tensor()
+ var grad_weight_lt = grad_weight.to_layout_tensor()
+ var grad_bias_lt = grad_bias.to_layout_tensor()
+ var grad_output_lt = grad_output.to_layout_tensor()
+ var input_lt = input.to_layout_tensor()
+ var ln_weight_lt = ln_weight.to_layout_tensor()
+ var ln_bias_lt = ln_bias.to_layout_tensor()
+ var linear_weight_lt = linear_weight.to_layout_tensor()
+
# Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops)
if batch_idx == 0 and seq_idx == 0:
# Initialize grad_ln_weight and grad_ln_bias
@@ -356,7 +386,7 @@ def minimal_fused_kernel_backward[
var sq_sum: Scalar[dtype] = 0
comptime for h in range(hidden_dim):
- var val = input[batch_idx, seq_idx, h]
+ var val = input_lt[batch_idx, seq_idx, h]
sum_val += rebind[Scalar[dtype]](val)
sq_sum += rebind[Scalar[dtype]](val * val)
@@ -369,28 +399,28 @@ def minimal_fused_kernel_backward[
var grad_bias_ptr = grad_bias.ptr + out_idx
_ = Atomic[dtype].fetch_add(
grad_bias_ptr,
- rebind[Scalar[dtype]](grad_output[batch_idx, seq_idx, out_idx]),
+ rebind[Scalar[dtype]](grad_output_lt[batch_idx, seq_idx, out_idx]),
)
# Step 3: Atomically accumulate gradients w.r.t. linear weight
comptime for out_idx in range(output_dim):
comptime for h in range(hidden_dim):
- var input_val = input[batch_idx, seq_idx, h]
+ var input_val = input_lt[batch_idx, seq_idx, h]
var normalized = (input_val - mean_val) * inv_std
var ln_output_val = normalized * rebind[Scalar[dtype]](
- ln_weight[h]
- ) + rebind[Scalar[dtype]](ln_bias[h])
+ ln_weight_lt[h]
+ ) + rebind[Scalar[dtype]](ln_bias_lt[h])
# Atomic gradient accumulation for linear weight
var grad_w = (
- grad_output[batch_idx, seq_idx, out_idx] * ln_output_val
+ grad_output_lt[batch_idx, seq_idx, out_idx] * ln_output_val
)
var grad_weight_ptr = grad_weight.ptr + out_idx * hidden_dim + h
_ = Atomic.fetch_add(grad_weight_ptr, rebind[Scalar[dtype]](grad_w))
# Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters
comptime for h in range(hidden_dim):
- input_val = input[batch_idx, seq_idx, h]
+ input_val = input_lt[batch_idx, seq_idx, h]
normalized = (input_val - mean_val) * inv_std
# Compute gradient w.r.t. LayerNorm output for this h
@@ -398,8 +428,8 @@ def minimal_fused_kernel_backward[
comptime for out_idx in range(output_dim):
grad_ln_out = grad_ln_out + rebind[Scalar[dtype]](
- grad_output[batch_idx, seq_idx, out_idx]
- * linear_weight[out_idx, h]
+ grad_output_lt[batch_idx, seq_idx, out_idx]
+ * linear_weight_lt[out_idx, h]
)
# Atomic accumulation of LayerNorm parameter gradients
@@ -418,18 +448,18 @@ def minimal_fused_kernel_backward[
var sum_grad_normalized_times_normalized: Scalar[dtype] = 0
comptime for h in range(hidden_dim):
- h_input_val = input[batch_idx, seq_idx, h]
+ h_input_val = input_lt[batch_idx, seq_idx, h]
h_normalized = (h_input_val - mean_val) * inv_std
var h_grad_ln_out: Scalar[dtype] = 0
comptime for out_idx in range(output_dim):
h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
- grad_output[batch_idx, seq_idx, out_idx]
- * linear_weight[out_idx, h]
+ grad_output_lt[batch_idx, seq_idx, out_idx]
+ * linear_weight_lt[out_idx, h]
)
- h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h])
+ h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h])
sum_grad_normalized = sum_grad_normalized + rebind[Scalar[dtype]](
h_grad_norm
)
@@ -440,19 +470,19 @@ def minimal_fused_kernel_backward[
# Compute actual input gradients (no race conditions here - each thread writes to different positions)
comptime for h in range(hidden_dim):
- h_input_val = input[batch_idx, seq_idx, h]
+ h_input_val = input_lt[batch_idx, seq_idx, h]
h_normalized = (h_input_val - mean_val) * inv_std
var h_grad_ln_out: Scalar[dtype] = 0
comptime for out_idx in range(output_dim):
h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
- grad_output[batch_idx, seq_idx, out_idx]
- * linear_weight[out_idx, h]
+ grad_output_lt[batch_idx, seq_idx, out_idx]
+ * linear_weight_lt[out_idx, h]
)
- h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h])
- grad_input[batch_idx, seq_idx, h] = inv_std * (
+ h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h])
+ grad_input_lt[batch_idx, seq_idx, h] = inv_std * (
h_grad_norm
- (sum_grad_normalized / hidden_dim)
- (h_normalized * sum_grad_normalized_times_normalized / hidden_dim)
@@ -482,31 +512,37 @@ struct LayerNormLinearCustomOp:
linear_bias: InputTensor[dtype=dtype, rank=1, static_spec=_],
ctx: DeviceContextPtr,
) raises:
- comptime input_layout = input.static_spec.to_layout()
- comptime ln_params_layout = ln_weight.static_spec.to_layout()
- comptime weight_layout = linear_weight.static_spec.to_layout()
- comptime bias_layout = linear_bias.static_spec.to_layout()
- comptime output_layout = output.static_spec.to_layout()
-
- # Note: rebind is necessary now but it shouldn't be!
- var output_tensor = rebind[
- LayoutTensor[dtype, output_layout, MutAnyOrigin]
- ](output.to_layout_tensor())
- var input_tensor = rebind[
- LayoutTensor[dtype, input_layout, ImmutAnyOrigin]
- ](input.to_layout_tensor())
- var ln_weight_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
- ](ln_weight.to_layout_tensor())
- var ln_bias_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
- ](ln_bias.to_layout_tensor())
- var linear_weight_tensor = rebind[
- LayoutTensor[dtype, weight_layout, ImmutAnyOrigin]
- ](linear_weight.to_layout_tensor())
- var linear_bias_tensor = rebind[
- LayoutTensor[dtype, bias_layout, ImmutAnyOrigin]
- ](linear_bias.to_layout_tensor())
+ comptime input_layout_val = row_major[batch_size, seq_len, hidden_dim]()
+ comptime ln_params_layout_val = row_major[hidden_dim]()
+ comptime weight_layout_val = row_major[output_dim, hidden_dim]()
+ comptime bias_layout_val = row_major[output_dim]()
+ comptime output_layout_val = row_major[
+ batch_size, seq_len, output_dim
+ ]()
+ comptime InputLayout = type_of(input_layout_val)
+ comptime LnParamsLayout = type_of(ln_params_layout_val)
+ comptime WeightLayout = type_of(weight_layout_val)
+ comptime BiasLayout = type_of(bias_layout_val)
+ comptime OutputLayout = type_of(output_layout_val)
+
+ var output_tensor = TileTensor[
+ mut=True, dtype, OutputLayout, MutAnyOrigin
+ ](output.unsafe_ptr(), output_layout_val)
+ var input_tensor = TileTensor[
+ mut=True, dtype, InputLayout, MutAnyOrigin
+ ](input.unsafe_ptr(), input_layout_val)
+ var ln_weight_tensor = TileTensor[
+ mut=True, dtype, LnParamsLayout, MutAnyOrigin
+ ](ln_weight.unsafe_ptr(), ln_params_layout_val)
+ var ln_bias_tensor = TileTensor[
+ mut=True, dtype, LnParamsLayout, MutAnyOrigin
+ ](ln_bias.unsafe_ptr(), ln_params_layout_val)
+ var linear_weight_tensor = TileTensor[
+ mut=True, dtype, WeightLayout, MutAnyOrigin
+ ](linear_weight.unsafe_ptr(), weight_layout_val)
+ var linear_bias_tensor = TileTensor[
+ mut=True, dtype, BiasLayout, MutAnyOrigin
+ ](linear_bias.unsafe_ptr(), bias_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
@@ -515,15 +551,15 @@ struct LayerNormLinearCustomOp:
comptime if algorithm == "fused":
# fused case - one thread per sequence position
comptime kernel = minimal_fused_kernel[
- input_layout,
- ln_params_layout,
- weight_layout,
- bias_layout,
- output_layout,
batch_size,
seq_len,
hidden_dim,
output_dim,
+ OutputLayout,
+ InputLayout,
+ LnParamsLayout,
+ WeightLayout,
+ BiasLayout,
]
gpu_ctx.enqueue_function[kernel, kernel](
output_tensor,
@@ -541,18 +577,18 @@ struct LayerNormLinearCustomOp:
var normalized_buffer = gpu_ctx.enqueue_create_buffer[dtype](
batch_size * seq_len * hidden_dim
)
- var normalized_tensor = LayoutTensor[
- dtype, input_layout, MutAnyOrigin
- ](normalized_buffer)
+ var normalized_tensor = TileTensor[
+ mut=True, dtype, InputLayout, MutAnyOrigin
+ ](normalized_buffer, input_layout_val)
# Step 1: LayerNorm kernel
comptime kernel = layernorm_kernel[
- input_layout,
- ln_params_layout,
- input_layout,
batch_size,
seq_len,
hidden_dim,
+ InputLayout,
+ InputLayout,
+ LnParamsLayout,
]
gpu_ctx.enqueue_function[kernel, kernel](
normalized_tensor,
@@ -577,19 +613,26 @@ struct LayerNormLinearCustomOp:
var matmul_buffer = gpu_ctx.enqueue_create_buffer[dtype](
batch_size * seq_len * output_dim
)
- var matmul_tensor = LayoutTensor[
- dtype, output_layout, MutAnyOrigin
- ](matmul_buffer)
+ var matmul_tensor = TileTensor[
+ mut=True, dtype, OutputLayout, MutAnyOrigin
+ ](matmul_buffer, output_layout_val)
# Create transposed weight matrix: [output_dim, hidden_dim] -> [hidden_dim, output_dim]
var transposed_weight_buffer = gpu_ctx.enqueue_create_buffer[
dtype
](hidden_dim * output_dim)
- var transposed_weight_tensor = LayoutTensor[
+ comptime transposed_weight_layout = row_major[
+ hidden_dim, output_dim
+ ]()
+ comptime TransposedWeightLayout = type_of(
+ transposed_weight_layout
+ )
+ var transposed_weight_tensor = TileTensor[
+ mut=True,
dtype,
- Layout.row_major(hidden_dim, output_dim),
+ TransposedWeightLayout,
MutAnyOrigin,
- ](transposed_weight_buffer)
+ ](transposed_weight_buffer, transposed_weight_layout)
# Transpose the weight matrix
var transpose_blocks_x = (
@@ -599,10 +642,10 @@ struct LayerNormLinearCustomOp:
output_dim + TRANSPOSE_BLOCK_DIM_XY - 1
) // TRANSPOSE_BLOCK_DIM_XY
comptime kernel2 = transpose_kernel[
- weight_layout,
- transposed_weight_tensor.layout,
output_dim,
hidden_dim,
+ TransposedWeightLayout,
+ WeightLayout,
]
gpu_ctx.enqueue_function[kernel2, kernel2](
transposed_weight_tensor,
@@ -612,20 +655,26 @@ struct LayerNormLinearCustomOp:
)
# Reshape tensors for matmul: [batch*seq, hidden] @ [hidden, output] -> [batch*seq, output]
- var flat_normalized = normalized_tensor.reshape[
- Layout.row_major(batch_size * seq_len, hidden_dim)
+ comptime flat_normalized_layout = row_major[
+ batch_size * seq_len, hidden_dim
]()
- var flat_matmul = matmul_tensor.reshape[
- Layout.row_major(batch_size * seq_len, output_dim)
+ comptime FlatNormalizedLayout = type_of(flat_normalized_layout)
+ comptime flat_matmul_layout = row_major[
+ batch_size * seq_len, output_dim
]()
+ comptime FlatMatmulLayout = type_of(flat_matmul_layout)
+ var flat_normalized = normalized_tensor.reshape(
+ flat_normalized_layout
+ )
+ var flat_matmul = matmul_tensor.reshape(flat_matmul_layout)
comptime kernel3 = matmul_idiomatic_tiled[
- flat_normalized.layout,
- transposed_weight_tensor.layout,
- flat_matmul.layout,
batch_size * seq_len,
output_dim,
hidden_dim,
+ FlatMatmulLayout,
+ FlatNormalizedLayout,
+ TransposedWeightLayout,
]
gpu_ctx.enqueue_function[kernel3, kernel3](
flat_matmul,
@@ -636,17 +685,21 @@ struct LayerNormLinearCustomOp:
)
# Step 3: Add bias - reshape matmul result back to 3D for bias addition
- var reshaped_matmul = matmul_tensor.reshape[
- Layout.row_major(batch_size, seq_len, output_dim)
+ comptime reshaped_matmul_layout = row_major[
+ batch_size, seq_len, output_dim
]()
+ comptime ReshapedMatmulLayout = type_of(reshaped_matmul_layout)
+ var reshaped_matmul = matmul_tensor.reshape(
+ reshaped_matmul_layout
+ )
comptime kernel4 = add_bias_kernel[
- reshaped_matmul.layout,
- bias_layout,
- output_layout,
batch_size,
seq_len,
output_dim,
+ OutputLayout,
+ ReshapedMatmulLayout,
+ BiasLayout,
]
gpu_ctx.enqueue_function[kernel4, kernel4](
output_tensor,
@@ -722,65 +775,78 @@ struct LayerNormLinearBackwardCustomOp:
linear_weight: InputTensor[dtype=dtype, rank=2, static_spec=_],
ctx: DeviceContextPtr,
) raises:
- comptime grad_output_layout = grad_output.static_spec.to_layout()
- comptime input_layout = input.static_spec.to_layout()
- comptime ln_params_layout = ln_weight.static_spec.to_layout()
- comptime weight_layout = linear_weight.static_spec.to_layout()
- comptime grad_input_layout = grad_input.static_spec.to_layout()
- comptime grad_ln_weight_layout = grad_ln_weight.static_spec.to_layout()
- comptime grad_ln_bias_layout = grad_ln_bias.static_spec.to_layout()
- comptime grad_weight_layout = grad_weight.static_spec.to_layout()
- comptime grad_bias_layout = grad_bias.static_spec.to_layout()
-
- var grad_input_tensor = rebind[
- LayoutTensor[dtype, grad_input_layout, MutAnyOrigin]
- ](grad_input.to_layout_tensor())
- var grad_ln_weight_tensor = rebind[
- LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin]
- ](grad_ln_weight.to_layout_tensor())
- var grad_ln_bias_tensor = rebind[
- LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin]
- ](grad_ln_bias.to_layout_tensor())
- var grad_weight_tensor = rebind[
- LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin]
- ](grad_weight.to_layout_tensor())
- var grad_bias_tensor = rebind[
- LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin]
- ](grad_bias.to_layout_tensor())
- var grad_output_tensor = rebind[
- LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin]
- ](grad_output.to_layout_tensor())
- var input_tensor = rebind[
- LayoutTensor[dtype, input_layout, ImmutAnyOrigin]
- ](input.to_layout_tensor())
- var ln_weight_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
- ](ln_weight.to_layout_tensor())
- var ln_bias_tensor = rebind[
- LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin]
- ](ln_bias.to_layout_tensor())
- var linear_weight_tensor = rebind[
- LayoutTensor[dtype, weight_layout, ImmutAnyOrigin]
- ](linear_weight.to_layout_tensor())
+ comptime input_layout_val = row_major[batch_size, seq_len, hidden_dim]()
+ comptime ln_params_layout_val = row_major[hidden_dim]()
+ comptime weight_layout_val = row_major[output_dim, hidden_dim]()
+ comptime grad_input_layout_val = row_major[
+ batch_size, seq_len, hidden_dim
+ ]()
+ comptime grad_ln_weight_layout_val = row_major[hidden_dim]()
+ comptime grad_ln_bias_layout_val = row_major[hidden_dim]()
+ comptime grad_weight_layout_val = row_major[output_dim, hidden_dim]()
+ comptime grad_bias_layout_val = row_major[output_dim]()
+ comptime grad_output_layout_val = row_major[
+ batch_size, seq_len, output_dim
+ ]()
+ comptime GradOutputLayout = type_of(grad_output_layout_val)
+ comptime InputLayout = type_of(input_layout_val)
+ comptime LnParamsLayout = type_of(ln_params_layout_val)
+ comptime WeightLayout = type_of(weight_layout_val)
+ comptime GradInputLayout = type_of(grad_input_layout_val)
+ comptime GradLnWeightLayout = type_of(grad_ln_weight_layout_val)
+ comptime GradLnBiasLayout = type_of(grad_ln_bias_layout_val)
+ comptime GradWeightLayout = type_of(grad_weight_layout_val)
+ comptime GradBiasLayout = type_of(grad_bias_layout_val)
+
+ var grad_input_tensor = TileTensor[
+ mut=True, dtype, GradInputLayout, MutAnyOrigin
+ ](grad_input.unsafe_ptr(), grad_input_layout_val)
+ var grad_ln_weight_tensor = TileTensor[
+ mut=True, dtype, GradLnWeightLayout, MutAnyOrigin
+ ](grad_ln_weight.unsafe_ptr(), grad_ln_weight_layout_val)
+ var grad_ln_bias_tensor = TileTensor[
+ mut=True, dtype, GradLnBiasLayout, MutAnyOrigin
+ ](grad_ln_bias.unsafe_ptr(), grad_ln_bias_layout_val)
+ var grad_weight_tensor = TileTensor[
+ mut=True, dtype, GradWeightLayout, MutAnyOrigin
+ ](grad_weight.unsafe_ptr(), grad_weight_layout_val)
+ var grad_bias_tensor = TileTensor[
+ mut=True, dtype, GradBiasLayout, MutAnyOrigin
+ ](grad_bias.unsafe_ptr(), grad_bias_layout_val)
+ var grad_output_tensor = TileTensor[
+ mut=True, dtype, GradOutputLayout, MutAnyOrigin
+ ](grad_output.unsafe_ptr(), grad_output_layout_val)
+ var input_tensor = TileTensor[
+ mut=True, dtype, InputLayout, MutAnyOrigin
+ ](input.unsafe_ptr(), input_layout_val)
+ var ln_weight_tensor = TileTensor[
+ mut=True, dtype, LnParamsLayout, MutAnyOrigin
+ ](ln_weight.unsafe_ptr(), ln_params_layout_val)
+ var ln_bias_tensor = TileTensor[
+ mut=True, dtype, LnParamsLayout, MutAnyOrigin
+ ](ln_bias.unsafe_ptr(), ln_params_layout_val)
+ var linear_weight_tensor = TileTensor[
+ mut=True, dtype, WeightLayout, MutAnyOrigin
+ ](linear_weight.unsafe_ptr(), weight_layout_val)
comptime if target == "gpu":
var gpu_ctx = ctx.get_device_context()
# Launch backward kernel
comptime kernel = minimal_fused_kernel_backward[
- grad_output_layout,
- input_layout,
- ln_params_layout,
- weight_layout,
- grad_input_layout,
- grad_ln_weight_layout,
- grad_ln_bias_layout,
- grad_weight_layout,
- grad_bias_layout,
batch_size,
seq_len,
hidden_dim,
output_dim,
+ GradInputLayout,
+ GradLnWeightLayout,
+ GradLnBiasLayout,
+ GradWeightLayout,
+ GradBiasLayout,
+ GradOutputLayout,
+ InputLayout,
+ LnParamsLayout,
+ WeightLayout,
]
gpu_ctx.enqueue_function[kernel, kernel](
grad_input_tensor,
diff --git a/solutions/p23/p23.mojo b/solutions/p23/p23.mojo
index 49a1c04a..9ab446a6 100644
--- a/solutions/p23/p23.mojo
+++ b/solutions/p23/p23.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_dim, block_idx, barrier
from std.gpu.host import DeviceContext
from std.gpu.host.compile import get_gpu_target
-from layout import Layout, LayoutTensor
+from layout import TileTensor, LayoutTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from std.utils import Index, IndexList
from std.math import log2
from std.algorithm.functional import elementwise, vectorize
@@ -11,18 +13,19 @@ from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep
comptime SIZE = 1024
comptime rank = 1
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype, target=get_gpu_target()]()
# ANCHOR: elementwise_add_solution
def elementwise_add[
- layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
+ LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -31,18 +34,19 @@ def elementwise_add[
simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
var idx = indices[0]
+ # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
# Note: This is thread-local SIMD - each thread processes its own vector of data
# we'll later better see this hierarchy in Mojo:
# SIMD within threads, warp across threads, block across warps
- var a_simd = a.aligned_load[width=simd_width](Index(idx))
- var b_simd = b.aligned_load[width=simd_width](Index(idx))
+ var a_simd = a_lt.aligned_load[width=simd_width](Index(idx))
+ var b_simd = b_lt.aligned_load[width=simd_width](Index(idx))
var ret = a_simd + b_simd
- # print(
- # "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret
- # )
- output.store[simd_width](Index(idx), ret)
+ out_lt.store[simd_width](Index(idx), ret)
- elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)
+ elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)
# ANCHOR_END: elementwise_add_solution
@@ -53,16 +57,16 @@ comptime TILE_SIZE = 32
def tiled_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -72,13 +76,13 @@ def tiled_elementwise_add[
](indices: IndexList[rank]) capturing -> None:
var tile_id = indices[0]
- var output_tile = output.tile[tile_size](tile_id)
- var a_tile = a.tile[tile_size](tile_id)
- var b_tile = b.tile[tile_size](tile_id)
+ var output_tile = output.tile[tile_size](tile_id).to_layout_tensor()
+ var a_tile = a.tile[tile_size](tile_id).to_layout_tensor()
+ var b_tile = b.tile[tile_size](tile_id).to_layout_tensor()
comptime for i in range(tile_size):
- var a_vec = a_tile.load[simd_width](Index(i))
- var b_vec = b_tile.load[simd_width](Index(i))
+ var a_vec = a_tile.aligned_load[width=simd_width](Index(i))
+ var b_vec = b_tile.aligned_load[width=simd_width](Index(i))
var ret = a_vec + b_vec
output_tile.store[simd_width](Index(i), ret)
@@ -91,7 +95,7 @@ def tiled_elementwise_add[
# ANCHOR: manual_vectorized_tiled_elementwise_add_solution
def manual_vectorized_tiled_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
@@ -99,9 +103,9 @@ def manual_vectorized_tiled_elementwise_add[
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size groups of simd_width elements
@@ -113,20 +117,18 @@ def manual_vectorized_tiled_elementwise_add[
num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
var tile_id = indices[0]
-
- var output_tile = output.tile[chunk_size](tile_id)
- var a_tile = a.tile[chunk_size](tile_id)
- var b_tile = b.tile[chunk_size](tile_id)
+ # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
comptime for i in range(tile_size):
var global_start = tile_id * chunk_size + i * simd_width
- var a_vec = a.aligned_load[simd_width](Index(global_start))
- var b_vec = b.aligned_load[simd_width](Index(global_start))
+ var a_vec = a_lt.aligned_load[width=simd_width](Index(global_start))
+ var b_vec = b_lt.aligned_load[width=simd_width](Index(global_start))
var ret = a_vec + b_vec
- # print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret)
-
- output.store[simd_width](Index(global_start), ret)
+ out_lt.store[simd_width](Index(global_start), ret)
# Number of tiles needed: each tile processes chunk_size elements
var num_tiles = (size + chunk_size - 1) // chunk_size
@@ -140,7 +142,7 @@ def manual_vectorized_tiled_elementwise_add[
# ANCHOR: vectorize_within_tiles_elementwise_add_solution
def vectorize_within_tiles_elementwise_add[
- layout: Layout,
+ LayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
@@ -148,9 +150,9 @@ def vectorize_within_tiles_elementwise_add[
size: Int,
tile_size: Int,
](
- output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size elements (not SIMD groups)
@@ -163,16 +165,20 @@ def vectorize_within_tiles_elementwise_add[
var tile_start = tile_id * tile_size
var tile_end = min(tile_start + tile_size, size)
var actual_tile_size = tile_end - tile_start
+ # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
def vectorized_add[
width: Int
- ](i: Int) unified {read tile_start, read a, read b, mut output}:
+ ](i: Int) unified {read tile_start, read a_lt, read b_lt, mut out_lt}:
var global_idx = tile_start + i
if global_idx + width <= size:
- var a_vec = a.aligned_load[width](Index(global_idx))
- var b_vec = b.aligned_load[width](Index(global_idx))
+ var a_vec = a_lt.aligned_load[width](Index(global_idx))
+ var b_vec = b_lt.aligned_load[width](Index(global_idx))
var result = a_vec + b_vec
- output.store[width](Index(global_idx), result)
+ out_lt.store[width](Index(global_idx), result)
# Use vectorize within each tile
vectorize[simd_width](actual_tile_size, vectorized_add)
@@ -192,7 +198,8 @@ def benchmark_elementwise_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -205,20 +212,20 @@ def benchmark_elementwise_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin](
- a.unsafe_ptr()
- )
- var b_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin](
- b_buf.unsafe_ptr()
- )
- var out_tensor = LayoutTensor[mut=True, dtype, layout, MutAnyOrigin](
- out.unsafe_ptr()
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
)
@parameter
@always_inline
def elementwise_workflow(ctx: DeviceContext) raises:
- elementwise_add[layout, dtype, SIMD_WIDTH, rank, test_size](
+ elementwise_add[BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size](
out_tensor, a_tensor, b_tensor, ctx
)
@@ -233,7 +240,8 @@ def benchmark_tiled_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -246,15 +254,21 @@ def benchmark_tiled_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def tiled_workflow(ctx: DeviceContext) raises:
tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[tiled_workflow](bench_ctx)
@@ -268,7 +282,8 @@ def benchmark_manual_vectorized_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -281,15 +296,21 @@ def benchmark_manual_vectorized_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def manual_vectorized_workflow(ctx: DeviceContext) raises:
manual_vectorized_tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[manual_vectorized_workflow](bench_ctx)
@@ -303,7 +324,8 @@ def benchmark_vectorized_parameterized[
test_size: Int, tile_size: Int
](mut b: Bencher) raises:
var bench_ctx = DeviceContext()
- comptime layout = Layout.row_major(test_size)
+ comptime bench_layout = row_major[test_size]()
+ comptime BenchLayoutType = type_of(bench_layout)
var out = bench_ctx.enqueue_create_buffer[dtype](test_size)
out.enqueue_fill(0)
var a = bench_ctx.enqueue_create_buffer[dtype](test_size)
@@ -316,15 +338,21 @@ def benchmark_vectorized_parameterized[
a_host[i] = Scalar[dtype](2 * i)
b_host[i] = Scalar[dtype](2 * i + 1)
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr())
- var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var a_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](a, bench_layout)
+ var b_tensor = TileTensor[
+ mut=False, dtype, BenchLayoutType, ImmutAnyOrigin
+ ](b_buf, bench_layout)
+ var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin](
+ out, bench_layout
+ )
@parameter
@always_inline
def vectorized_workflow(ctx: DeviceContext) raises:
vectorize_within_tiles_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
+ BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size
](out_tensor, a_tensor, b_tensor, ctx)
b.iter_custom[vectorized_workflow](bench_ctx)
@@ -349,8 +377,12 @@ def main() raises:
b_host[i] = Scalar[dtype](2 * i + 1)
expected[i] = a_host[i] + b_host[i]
- var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr())
- var b_tensor = LayoutTensor[mut=False, dtype, layout](b.unsafe_ptr())
+ var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ a, layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin](
+ b, layout
+ )
ctx.synchronize()
@@ -358,8 +390,10 @@ def main() raises:
print("simd_width:", SIMD_WIDTH)
if argv()[1] == "--elementwise":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE](
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
+ elementwise_add[LayoutType, dtype, SIMD_WIDTH, rank, SIZE](
out_tensor, a_tensor, b_tensor, ctx
)
@@ -371,11 +405,13 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--tiled":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
- print("tile size:", TILE_SIZE)
- tiled_elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE](
- out_tensor, a_tensor, b_tensor, ctx
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
)
+ print("tile size:", TILE_SIZE)
+ tiled_elementwise_add[
+ LayoutType, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE
+ ](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
print("out:", out_host)
@@ -385,10 +421,12 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--manual-vectorized":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
print("tile size:", TILE_SIZE)
manual_vectorized_tiled_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
+ LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
@@ -399,10 +437,12 @@ def main() raises:
print("Puzzle 23 complete โ
")
elif argv()[1] == "--vectorized":
- out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr())
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin](
+ out, layout
+ )
print("tile size:", TILE_SIZE)
vectorize_within_tiles_elementwise_add[
- layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
+ LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE
](out_tensor, a_tensor, b_tensor, ctx)
with out.map_to_host() as out_host:
diff --git a/solutions/p24/p24.mojo b/solutions/p24/p24.mojo
index c2f77764..55c3efc8 100644
--- a/solutions/p24/p24.mojo
+++ b/solutions/p24/p24.mojo
@@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.primitives.warp import sum as warp_sum, WARP_SIZE
from std.gpu.memory import AddressSpace
from std.algorithm.functional import elementwise
-from layout import Layout, LayoutTensor
+from layout import TileTensor, LayoutTensor
+from layout.tile_layout import row_major, TensorLayout
+from layout.tile_tensor import stack_allocation
from std.utils import Index, IndexList
from std.sys import argv, simd_width_of, align_of
from std.testing import assert_equal
@@ -26,32 +28,36 @@ comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) # optimal choice for warp kernel
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype]()
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime out_layout = row_major[1]()
+comptime InLayout = type_of(in_layout)
+comptime OutLayout = type_of(out_layout)
# ANCHOR: traditional_approach_from_p12
def traditional_dot_product_p12_style[
- in_layout: Layout, out_layout: Layout, size: Int
+ InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
):
"""
This is the complex approach from p12_layout_tensor.mojo - kept for comparison.
"""
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(WARP_SIZE),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[WARP_SIZE]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
if global_i < size:
- shared[local_i] = (a[global_i] * b[global_i]).reduce_add()
+ shared[local_i] = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[
+ Scalar[dtype]
+ ](b_lt[global_i])
else:
shared[local_i] = 0.0
@@ -65,7 +71,7 @@ def traditional_dot_product_p12_style[
stride //= 2
if local_i == 0:
- output[global_i // WARP_SIZE] = shared[0]
+ out_lt.store[1](Index(global_i // WARP_SIZE), shared[0])
# ANCHOR_END: traditional_approach_from_p12
@@ -73,25 +79,30 @@ def traditional_dot_product_p12_style[
# ANCHOR: simple_warp_kernel_solution
def simple_warp_dot_product[
- in_layout: Layout, out_layout: Layout, size: Int
+ InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
):
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
var global_i = block_dim.x * block_idx.x + thread_idx.x
# Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
var partial_product: Scalar[dtype] = 0
if global_i < size:
- partial_product = (a[global_i] * b[global_i]).reduce_add()
+ partial_product = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[
+ Scalar[dtype]
+ ](b_lt[global_i])
# warp_sum() replaces all the shared memory + barriers + tree reduction
var total = warp_sum(partial_product)
# Only lane 0 writes the result (all lanes have the same total)
if lane_id() == 0:
- output[global_i // WARP_SIZE] = total
+ out_lt.store[1](Index(global_i // WARP_SIZE), total)
# ANCHOR_END: simple_warp_kernel_solution
@@ -99,16 +110,16 @@ def simple_warp_dot_product[
# ANCHOR: functional_warp_approach_solution
def functional_warp_dot_product[
- layout: Layout,
- out_layout: Layout,
+ InLayoutT: TensorLayout,
+ OutLayoutT: TensorLayout,
dtype: DType,
simd_width: Int,
rank: Int,
size: Int,
](
- output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
- b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
@@ -117,13 +128,19 @@ def functional_warp_dot_product[
simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
var idx = indices[0]
+ # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
+ var a_lt = a.to_layout_tensor()
+ var b_lt = b.to_layout_tensor()
+ var out_lt = output.to_layout_tensor()
# Each thread computes one partial product
var partial_product: Scalar[dtype] = 0.0
if idx < size:
- var a_val = a.load[1](Index(idx))
- var b_val = b.load[1](Index(idx))
- partial_product = a_val * b_val
+ var a_val = a_lt.load[1](Index(idx))
+ var b_val = b_lt.load[1](Index(idx))
+ partial_product = rebind[Scalar[dtype]](a_val) * rebind[
+ Scalar[dtype]
+ ](b_val)
else:
partial_product = 0.0
@@ -132,7 +149,7 @@ def functional_warp_dot_product[
# Only lane 0 writes the result (all lanes have the same total)
if lane_id() == 0:
- output.store[1](Index(idx // WARP_SIZE), total)
+ out_lt.store[1](Index(idx // WARP_SIZE), total)
# Launch exactly size == WARP_SIZE threads (one warp) to process all elements
elementwise[compute_dot_product, 1, target="gpu"](size, ctx)
@@ -187,8 +204,10 @@ def benchmark_simple_warp_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchInLayout = type_of(bench_in_layout)
+ comptime BenchOutLayout = type_of(bench_out_layout)
comptime n_threads = WARP_SIZE
comptime n_blocks = (ceildiv(test_size, n_threads), 1)
@@ -207,15 +226,21 @@ def benchmark_simple_warp_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = TileTensor[mut=False, dtype, BenchInLayout](
+ a, bench_in_layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, BenchInLayout](
+ b, bench_in_layout
+ )
+ var out_tensor = TileTensor[mut=True, dtype, BenchOutLayout](
+ out, bench_out_layout
+ )
@parameter
@always_inline
def traditional_workflow(ctx: DeviceContext) raises:
comptime kernel = simple_warp_dot_product[
- in_layout, out_layout, test_size
+ BenchInLayout, BenchOutLayout, test_size
]
ctx.enqueue_function[kernel, kernel](
out_tensor,
@@ -239,8 +264,10 @@ def benchmark_functional_warp_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchInLayout = type_of(bench_in_layout)
+ comptime BenchOutLayout = type_of(bench_out_layout)
var bench_ctx = DeviceContext()
@@ -257,15 +284,21 @@ def benchmark_functional_warp_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = rebind[
+ TileTensor[mut=False, dtype, BenchInLayout, ImmutAnyOrigin]
+ ](TileTensor[mut=False, dtype, BenchInLayout](a, bench_in_layout))
+ var b_tensor = rebind[
+ TileTensor[mut=False, dtype, BenchInLayout, ImmutAnyOrigin]
+ ](TileTensor[mut=False, dtype, BenchInLayout](b, bench_in_layout))
+ var out_tensor = rebind[
+ TileTensor[mut=True, dtype, BenchOutLayout, MutAnyOrigin]
+ ](TileTensor[mut=True, dtype, BenchOutLayout](out, bench_out_layout))
@parameter
@always_inline
def functional_warp_workflow(ctx: DeviceContext) raises:
functional_warp_dot_product[
- in_layout, out_layout, dtype, SIMD_WIDTH, 1, test_size
+ BenchInLayout, BenchOutLayout, dtype, SIMD_WIDTH, 1, test_size
](out_tensor, a_tensor, b_tensor, ctx)
bencher.iter_custom[functional_warp_workflow](bench_ctx)
@@ -282,8 +315,10 @@ def benchmark_traditional_parameterized[
test_size: Int
](mut bencher: Bencher) raises:
comptime n_warps = test_size // WARP_SIZE
- comptime in_layout = Layout.row_major(test_size)
- comptime out_layout = Layout.row_major(n_warps)
+ comptime bench_in_layout = row_major[test_size]()
+ comptime bench_out_layout = row_major[n_warps]()
+ comptime BenchInLayout = type_of(bench_in_layout)
+ comptime BenchOutLayout = type_of(bench_out_layout)
comptime n_blocks = (ceildiv(test_size, WARP_SIZE), 1)
var bench_ctx = DeviceContext()
@@ -301,16 +336,26 @@ def benchmark_traditional_parameterized[
rand_int[dtype, test_size](b)
expected_output[dtype, n_warps](expected, a, b)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ var a_tensor = TileTensor[mut=False, dtype, BenchInLayout](
+ a, bench_in_layout
+ )
+ var b_tensor = TileTensor[mut=False, dtype, BenchInLayout](
+ b, bench_in_layout
+ )
+ var out_tensor = TileTensor[mut=True, dtype, BenchOutLayout](
+ out, bench_out_layout
+ )
@parameter
@always_inline
def traditional_workflow(ctx: DeviceContext) raises:
ctx.enqueue_function[
- traditional_dot_product_p12_style[in_layout, out_layout, test_size],
- traditional_dot_product_p12_style[in_layout, out_layout, test_size],
+ traditional_dot_product_p12_style[
+ BenchInLayout, BenchOutLayout, test_size
+ ],
+ traditional_dot_product_p12_style[
+ BenchInLayout, BenchOutLayout, test_size
+ ],
](
out_tensor,
a_tensor,
@@ -333,6 +378,8 @@ def main() raises:
print("WARP_SIZE:", WARP_SIZE)
print("SIMD_WIDTH:", SIMD_WIDTH)
comptime n_warps = SIZE // WARP_SIZE
+ comptime main_out_layout = row_major[n_warps]()
+ comptime MainOutLayout = type_of(main_out_layout)
with DeviceContext() as ctx:
var out = ctx.enqueue_create_buffer[dtype](n_warps)
out.enqueue_fill(0)
@@ -343,9 +390,15 @@ def main() raises:
var expected = ctx.enqueue_create_host_buffer[dtype](n_warps)
expected.enqueue_fill(0)
- var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
- var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b)
+ var out_tensor = rebind[
+ TileTensor[mut=True, dtype, MainOutLayout, MutAnyOrigin]
+ ](TileTensor[mut=True, dtype, MainOutLayout](out, main_out_layout))
+ var a_tensor = rebind[
+ TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin]
+ ](TileTensor[mut=False, dtype, InLayout](a, in_layout))
+ var b_tensor = rebind[
+ TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin]
+ ](TileTensor[mut=False, dtype, InLayout](b, in_layout))
with a.map_to_host() as a_host, b.map_to_host() as b_host:
for i in range(SIZE):
@@ -355,10 +408,10 @@ def main() raises:
if argv()[1] == "--traditional":
ctx.enqueue_function[
traditional_dot_product_p12_style[
- in_layout, out_layout, SIZE
+ InLayout, MainOutLayout, SIZE
],
traditional_dot_product_p12_style[
- in_layout, out_layout, SIZE
+ InLayout, MainOutLayout, SIZE
],
](
out_tensor,
@@ -369,8 +422,8 @@ def main() raises:
)
elif argv()[1] == "--kernel":
ctx.enqueue_function[
- simple_warp_dot_product[in_layout, out_layout, SIZE],
- simple_warp_dot_product[in_layout, out_layout, SIZE],
+ simple_warp_dot_product[InLayout, MainOutLayout, SIZE],
+ simple_warp_dot_product[InLayout, MainOutLayout, SIZE],
](
out_tensor,
a_tensor,
@@ -380,7 +433,7 @@ def main() raises:
)
elif argv()[1] == "--functional":
functional_warp_dot_product[
- in_layout, out_layout, dtype, SIMD_WIDTH, 1, SIZE
+ InLayout, MainOutLayout, dtype, SIMD_WIDTH, 1, SIZE
](out_tensor, a_tensor, b_tensor, ctx)
expected_output[dtype, n_warps](expected, a, b)
check_result[dtype, n_warps, True](out, expected)
diff --git a/solutions/p25/p25.mojo b/solutions/p25/p25.mojo
index 36130776..c33da7f8 100644
--- a/solutions/p25/p25.mojo
+++ b/solutions/p25/p25.mojo
@@ -1,7 +1,8 @@
from std.gpu import thread_idx, block_idx, block_dim, lane_id
from std.gpu.host import DeviceContext
from std.gpu.primitives.warp import shuffle_down, broadcast, WARP_SIZE
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major, TensorLayout
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal
@@ -10,15 +11,16 @@ comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: neighbor_difference_solution
def neighbor_difference[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
"""
Compute finite differences: output[i] = input[i+1] - input[i]
@@ -52,15 +54,16 @@ def neighbor_difference[
comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
-comptime layout_2 = Layout.row_major(SIZE_2)
+comptime layout_2 = row_major[SIZE_2]()
+comptime Layout2Type = type_of(layout_2)
# ANCHOR: moving_average_3_solution
def moving_average_3[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, Layout2Type, MutAnyOrigin],
):
"""
Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
@@ -92,10 +95,10 @@ def moving_average_3[
# ANCHOR: broadcast_shuffle_coordination_solution
def broadcast_shuffle_coordination[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
"""
Combine broadcast() and shuffle_down() for advanced warp coordination.
@@ -107,11 +110,11 @@ def broadcast_shuffle_coordination[
if global_i < size:
# Step 1: Lane 0 computes block-local scaling factor
- var scale_factor: output.element_type = 0.0
+ var scale_factor: output.ElementType = 0.0
if lane == 0:
# Compute average of first 4 elements in this block's data
var block_start = block_idx.x * block_dim.x
- var sum: output.element_type = 0.0
+ var sum: output.ElementType = 0.0
for i in range(4):
if block_start + i < size:
sum += input[block_start + i]
@@ -138,10 +141,10 @@ def broadcast_shuffle_coordination[
# ANCHOR: basic_broadcast_solution
def basic_broadcast[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
"""
Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
@@ -152,10 +155,10 @@ def basic_broadcast[
if global_i < size:
# Step 1: Lane 0 computes special value (sum of first 4 elements in this block)
- var broadcast_value: output.element_type = 0.0
+ var broadcast_value: output.ElementType = 0.0
if lane == 0:
var block_start = block_idx.x * block_dim.x
- var sum: output.element_type = 0.0
+ var sum: output.ElementType = 0.0
for i in range(4):
if block_start + i < size:
sum += input[block_start + i]
@@ -173,10 +176,10 @@ def basic_broadcast[
# ANCHOR: conditional_broadcast_solution
def conditional_broadcast[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
"""
Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
@@ -187,7 +190,7 @@ def conditional_broadcast[
if global_i < size:
# Step 1: Lane 0 analyzes block-local data and makes decision (find max of first 8 in block)
- var decision_value: output.element_type = 0.0
+ var decision_value: output.ElementType = 0.0
if lane == 0:
var block_start = block_idx.x * block_dim.x
decision_value = input[block_start] if block_start < size else 0.0
@@ -224,14 +227,14 @@ def test_neighbor_difference() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i * i)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, LayoutType](
+ output_buf, layout
)
- comptime kernel = neighbor_difference[layout, SIZE]
+ comptime kernel = neighbor_difference[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -273,14 +276,14 @@ def test_moving_average() raises:
for i in range(1, SIZE_2):
input_host[i] = input_host[i - 1] + Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, Layout2Type](
+ input_buf, layout_2
)
- var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, Layout2Type](
+ output_buf, layout_2
)
- comptime kernel = moving_average_3[layout_2, SIZE_2]
+ comptime kernel = moving_average_3[SIZE_2]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -344,14 +347,14 @@ def test_broadcast_shuffle_coordination() raises:
else:
input_host[i] = Scalar[dtype](((i - 4) % 4) * 2 + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, LayoutType](
+ output_buf, layout
)
- comptime kernel = broadcast_shuffle_coordination[layout, SIZE]
+ comptime kernel = broadcast_shuffle_coordination[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -399,14 +402,14 @@ def test_basic_broadcast() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, LayoutType](
+ output_buf, layout
)
- comptime kernel = basic_broadcast[layout, SIZE]
+ comptime kernel = basic_broadcast[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -460,14 +463,14 @@ def test_conditional_broadcast() raises:
for i in range(SIZE):
input_host[i] = test_values[i % len(test_values)]
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, LayoutType](
+ output_buf, layout
)
- comptime kernel = conditional_broadcast[layout, SIZE]
+ comptime kernel = conditional_broadcast[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/solutions/p26/p26.mojo b/solutions/p26/p26.mojo
index ebf102b6..f3a68270 100644
--- a/solutions/p26/p26.mojo
+++ b/solutions/p26/p26.mojo
@@ -1,7 +1,8 @@
from std.gpu import thread_idx, block_idx, block_dim, lane_id
from std.gpu.host import DeviceContext
from std.gpu.primitives.warp import shuffle_xor, prefix_sum, WARP_SIZE
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal
@@ -10,15 +11,16 @@ comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# ANCHOR: butterfly_pair_swap_solution
def butterfly_pair_swap[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
@@ -45,10 +47,10 @@ def butterfly_pair_swap[
# ANCHOR: butterfly_parallel_max_solution
def butterfly_parallel_max[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Parallel maximum reduction using butterfly pattern.
@@ -78,15 +80,16 @@ def butterfly_parallel_max[
comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
-comptime layout_2 = Layout.row_major(SIZE_2)
+comptime layout_2 = row_major[SIZE_2]()
+comptime Layout2Type = type_of(layout_2)
# ANCHOR: butterfly_conditional_max_solution
def butterfly_conditional_max[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
):
"""
Conditional butterfly maximum: Perform butterfly max reduction, but only store result
@@ -123,10 +126,10 @@ def butterfly_conditional_max[
# ANCHOR: warp_inclusive_prefix_sum_solution
def warp_inclusive_prefix_sum[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
"""
Inclusive prefix sum using warp primitive: Each thread gets sum of all elements up to and including its position.
@@ -166,10 +169,10 @@ def warp_inclusive_prefix_sum[
# ANCHOR: warp_partition_solution
def warp_partition[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
pivot: Float32,
):
"""
@@ -237,14 +240,12 @@ def test_butterfly_pair_swap() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = butterfly_pair_swap[layout, SIZE]
+ comptime kernel = butterfly_pair_swap[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -288,14 +289,12 @@ def test_butterfly_parallel_max() raises:
# Make sure we have a clear maximum
input_host[SIZE - 1] = 1000.0
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = butterfly_parallel_max[layout, SIZE]
+ comptime kernel = butterfly_parallel_max[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -334,14 +333,12 @@ def test_butterfly_conditional_max() raises:
else:
input_host[i] = Scalar[dtype](i % 10)
- var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin](
- output_buf
+ var input_tensor = TileTensor[mut=False, dtype, Layout2Type](
+ input_buf, layout_2
)
+ var output_tensor = TileTensor(output_buf, layout_2)
- comptime kernel = butterfly_conditional_max[layout_2, SIZE_2]
+ comptime kernel = butterfly_conditional_max[SIZE_2]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -394,14 +391,12 @@ def test_warp_inclusive_prefix_sum() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = warp_inclusive_prefix_sum[layout, SIZE]
+ comptime kernel = warp_inclusive_prefix_sum[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -461,14 +456,12 @@ def test_warp_partition() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](test_values[i % len(test_values)])
- var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](
- input_buf
- )
- var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](
- output_buf
+ var input_tensor = TileTensor[mut=False, dtype, LayoutType](
+ input_buf, layout
)
+ var output_tensor = TileTensor(output_buf, layout)
- comptime kernel = warp_partition[layout, SIZE]
+ comptime kernel = warp_partition[SIZE]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/solutions/p27/p27.mojo b/solutions/p27/p27.mojo
index 7d2c6206..2f095125 100644
--- a/solutions/p27/p27.mojo
+++ b/solutions/p27/p27.mojo
@@ -4,7 +4,9 @@ from std.gpu.primitives.warp import WARP_SIZE
from std.gpu.primitives import block
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal
from std.math import floor
@@ -12,18 +14,20 @@ from std.math import floor
comptime SIZE = 128
comptime TPB = 128
comptime NUM_BINS = 8
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime out_layout = row_major[1]()
comptime dtype = DType.float32
+comptime InLayout = type_of(in_layout)
+comptime OutLayout = type_of(out_layout)
# ANCHOR: block_sum_dot_product_solution
def block_sum_dot_product[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
size: Int,
):
"""Dot product using block.sum() - convenience function like warp.sum()!
@@ -35,7 +39,7 @@ def block_sum_dot_product[
# Each thread computes partial product
var partial_product: Scalar[dtype] = 0.0
if global_i < size:
- # LayoutTensor indexing `[0]` returns the underlying SIMD value
+ # TileTensor indexing `[0]` returns the underlying SIMD value
partial_product = a[global_i][0] * b[global_i][0]
# The magic: block.sum() replaces 15+ lines of manual reduction!
@@ -54,22 +58,19 @@ def block_sum_dot_product[
# ANCHOR: traditional_dot_product_solution
def traditional_dot_product[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
size: Int,
):
"""Traditional dot product using shared memory + barriers + tree reduction.
Educational but complex - shows the manual coordination needed."""
- var shared = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -96,16 +97,17 @@ def traditional_dot_product[
# ANCHOR_END: traditional_dot_product_solution
-comptime bin_layout = Layout.row_major(SIZE) # Max SIZE elements per bin
+comptime bin_layout = row_major[SIZE]() # Max SIZE elements per bin
+comptime BinLayout = type_of(bin_layout)
# ANCHOR: block_histogram_solution
def block_histogram_bin_extract[
- in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin],
- count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin],
+ input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ bin_output: TileTensor[mut=True, dtype, BinLayout, MutAnyOrigin],
+ count_output: TileTensor[mut=True, DType.int32, OutLayout, MutAnyOrigin],
size: Int,
target_bin: Int,
num_bins: Int,
@@ -160,23 +162,24 @@ def block_histogram_bin_extract[
# ANCHOR_END: block_histogram_solution
-comptime vector_layout = Layout.row_major(SIZE) # For full vector output
+comptime vector_layout = row_major[SIZE]() # For full vector output
+comptime VectorLayout = type_of(vector_layout)
# ANCHOR: block_normalize_solution
def block_normalize_vector[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
+ input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
+ output_data: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
size: Int,
):
"""Vector mean normalization using block.sum() + block.broadcast() combination.
This demonstrates the complete block operations workflow:
- 1. Use block.sum() to compute sum of all elements (all โ one)
+ 1. Use block.sum() to compute sum of all elements (all -> one)
2. Thread 0 computes mean = sum / size
- 3. Use block.broadcast() to share mean to all threads (one โ all)
+ 3. Use block.broadcast() to share mean to all threads (one -> all)
4. Each thread normalizes: output[i] = input[i] / mean
"""
@@ -242,20 +245,18 @@ def main() raises:
print("TPB:", TPB)
print("Expected result:", expected)
- a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf)
- out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout)
+ b_tensor = TileTensor[mut=False, dtype, InLayout](b_buf, in_layout)
+ out_tensor = TileTensor(out, out_layout)
# Traditional approach: works perfectly when size == TPB
- comptime kernel = traditional_dot_product[
- in_layout, out_layout, TPB
- ]
+ comptime kernel = traditional_dot_product[TPB]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
b_tensor,
SIZE,
- grid_dim=(1, 1), # โ
Single block works when size == TPB
+ grid_dim=(1, 1),
block_dim=(TPB, 1),
)
@@ -287,12 +288,12 @@ def main() raises:
print("TPB:", TPB)
print("Expected result:", expected)
- a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a)
- b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf)
- out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out)
+ a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout)
+ b_tensor = TileTensor[mut=False, dtype, InLayout](b_buf, in_layout)
+ out_tensor = TileTensor(out, out_layout)
# Block.sum(): Same result with dramatically simpler code!
- comptime kernel = block_sum_dot_product[in_layout, out_layout, TPB]
+ comptime kernel = block_sum_dot_product[TPB]
ctx.enqueue_function[kernel, kernel](
out_tensor,
a_tensor,
@@ -341,8 +342,8 @@ def main() raises:
print("...")
print()
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
+ input_tensor = TileTensor[mut=False, dtype, InLayout](
+ input_buf, in_layout
)
# Demonstrate histogram for each bin using block.prefix_sum()
@@ -363,17 +364,11 @@ def main() raises:
var bin_count = ctx.enqueue_create_buffer[DType.int32](1)
bin_count.enqueue_fill(0)
- var bin_tensor = LayoutTensor[dtype, bin_layout, MutAnyOrigin](
- bin_data
- )
- var count_tensor = LayoutTensor[
- DType.int32, out_layout, MutAnyOrigin
- ](bin_count)
+ var bin_tensor = TileTensor(bin_data, bin_layout)
+ var count_tensor = TileTensor(bin_count, out_layout)
# Execute histogram kernel for this specific bin
- comptime kernel = block_histogram_bin_extract[
- in_layout, bin_layout, out_layout, TPB
- ]
+ comptime kernel = block_histogram_bin_extract[TPB]
ctx.enqueue_function[kernel, kernel](
input_tensor,
bin_tensor,
@@ -439,17 +434,13 @@ def main() raises:
print("Mean value:", mean_value)
print()
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
+ input_tensor = TileTensor[mut=False, dtype, InLayout](
+ input_buf, in_layout
)
- var output_tensor = LayoutTensor[
- dtype, vector_layout, MutAnyOrigin
- ](output_buf)
+ var output_tensor = TileTensor(output_buf, vector_layout)
# Execute vector normalization kernel
- comptime kernel = block_normalize_vector[
- in_layout, vector_layout, TPB
- ]
+ comptime kernel = block_normalize_vector[TPB]
ctx.enqueue_function[kernel, kernel](
input_tensor,
output_tensor,
diff --git a/solutions/p28/p28.mojo b/solutions/p28/p28.mojo
index 047e8b94..a6438d95 100644
--- a/solutions/p28/p28.mojo
+++ b/solutions/p28/p28.mojo
@@ -1,7 +1,9 @@
from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier
from std.gpu.host import DeviceContext
from std.gpu.memory import async_copy_wait_all, AddressSpace
-from layout import Layout, LayoutTensor
+from layout import Layout, LayoutTensor, TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
from std.sys import argv, info
from std.testing import assert_equal, assert_almost_equal
@@ -17,16 +19,18 @@ comptime BLOCKS_PER_GRID_ASYNC = (
) // CONV_TILE_SIZE
comptime THREADS_PER_BLOCK_ASYNC = 256
comptime dtype = DType.float32
-comptime layout_async = Layout.row_major(VECTOR_SIZE)
+comptime layout_async = row_major[VECTOR_SIZE]()
+comptime AsyncLayoutType = type_of(layout_async)
+comptime kernel_layout = Layout.row_major(KERNEL_SIZE)
# ANCHOR: async_copy_overlap_convolution_solution
def async_copy_overlap_convolution[
- dtype: DType, layout: Layout
+ dtype: DType
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, AsyncLayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, AsyncLayoutType, MutAnyOrigin],
+ kernel: LayoutTensor[dtype, kernel_layout, ImmutAnyOrigin],
):
"""Demonstrates async copy operations building on p14 patterns.
@@ -52,7 +56,7 @@ def async_copy_overlap_convolution[
# Phase 1: Launch async copy for input tile
# Note: tile() does NOT perform bounds checking - ensure valid tile bounds
- var input_tile = input.tile[CONV_TILE_SIZE](block_idx.x)
+ var input_tile = input.tile[CONV_TILE_SIZE](block_idx.x).to_layout_tensor()
# Use async copy with thread layout matching p14 pattern
comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC)
@@ -68,8 +72,8 @@ def async_copy_overlap_convolution[
# Phase 4: Compute convolution
var global_i = block_idx.x * CONV_TILE_SIZE + local_i
- if local_i < CONV_TILE_SIZE and global_i < output.shape[0]():
- var result: output.element_type = 0
+ if local_i < CONV_TILE_SIZE and global_i < Int(output.dim[0]()):
+ var result: output.ElementType = 0
# Simple convolution avoiding boundary issues
if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE:
@@ -77,10 +81,12 @@ def async_copy_overlap_convolution[
for k in range(KERNEL_SIZE):
var input_idx = local_i + k - HALO_SIZE
if input_idx >= 0 and input_idx < CONV_TILE_SIZE:
- result += input_shared[input_idx] * kernel_shared[k]
+ result += rebind[Scalar[dtype]](
+ input_shared[input_idx]
+ ) * rebind[Scalar[dtype]](kernel_shared[k])
else:
# For boundary elements, just copy input (no convolution)
- result = input_shared[local_i]
+ result = rebind[Scalar[dtype]](input_shared[local_i])
output[global_i] = result
@@ -108,17 +114,17 @@ def test_async_copy_overlap_convolution() raises:
for i in range(KERNEL_SIZE):
kernel_host[i] = Scalar[dtype](i + 1)
- var input_tensor = LayoutTensor[dtype, layout_async, ImmutAnyOrigin](
- input_buf
+ var input_tensor = TileTensor[mut=False, dtype, AsyncLayoutType](
+ input_buf, layout_async
)
- var output_tensor = LayoutTensor[dtype, layout_async, MutAnyOrigin](
- output_buf
+ var output_tensor = TileTensor[mut=True, dtype, AsyncLayoutType](
+ output_buf, layout_async
+ )
+ var kernel_tensor = LayoutTensor[dtype, kernel_layout, ImmutAnyOrigin](
+ kernel_buf
)
- var kernel_tensor = LayoutTensor[
- mut=False, dtype, Layout.row_major(KERNEL_SIZE)
- ](kernel_buf)
- comptime kernel = async_copy_overlap_convolution[dtype, layout_async]
+ comptime kernel = async_copy_overlap_convolution[dtype]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
diff --git a/solutions/p29/p29.mojo b/solutions/p29/p29.mojo
index 31a9d858..9d92a2eb 100644
--- a/solutions/p29/p29.mojo
+++ b/solutions/p29/p29.mojo
@@ -6,7 +6,9 @@ from std.gpu.sync import (
)
from std.gpu.host import DeviceContext
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from layout.layout_tensor import copy_dram_to_sram_async
from std.sys import argv, info
from std.testing import assert_true, assert_almost_equal
@@ -16,7 +18,8 @@ comptime SIZE = 1024 # Image size (1D for simplicity)
comptime BLOCKS_PER_GRID = (4, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
-comptime layout = Layout.row_major(SIZE)
+comptime layout = row_major[SIZE]()
+comptime LayoutType = type_of(layout)
# Multi-stage processing configuration
comptime STAGE1_THREADS = TPB // 2
@@ -25,11 +28,9 @@ comptime BLUR_RADIUS = 2
# ANCHOR: multi_stage_pipeline_solution
-def multi_stage_image_blur_pipeline[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def multi_stage_image_blur_pipeline(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
size: Int,
):
"""Multi-stage image blur pipeline with barrier coordination.
@@ -40,18 +41,12 @@ def multi_stage_image_blur_pipeline[
"""
# Shared memory buffers for pipeline stages
- var input_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var blur_shared = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var input_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+ var blur_shared = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -133,11 +128,9 @@ comptime BUFFER_COUNT = 2
# ANCHOR: double_buffered_stencil_solution
-def double_buffered_stencil_computation[
- layout: Layout
-](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+def double_buffered_stencil_computation(
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
size: Int,
):
"""Double-buffered stencil computation with memory barrier coordination.
@@ -147,38 +140,23 @@ def double_buffered_stencil_computation[
"""
# Double-buffering: Two shared memory buffers
- var buffer_A = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var buffer_B = LayoutTensor[
- dtype,
- Layout.row_major(TPB),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var buffer_A = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
+ var buffer_B = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[TPB]())
# Memory barriers for coordinating buffer swaps
- var init_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var iter_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
- var final_barrier = LayoutTensor[
- DType.uint64,
- Layout.row_major(1),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var init_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
+ var iter_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
+ var final_barrier = stack_allocation[
+ dtype=DType.uint64, address_space=AddressSpace.SHARED
+ ](row_major[1]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
@@ -284,11 +262,11 @@ def test_multi_stage_pipeline() raises:
# Create a simple wave pattern for blurring
inp_host[i] = Scalar[dtype](i % 10) + Scalar[dtype](i) / 100.0
- # Create LayoutTensors
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
+ # Create TileTensors
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType](out, layout)
+ var inp_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout)
- comptime kernel = multi_stage_image_blur_pipeline[layout]
+ comptime kernel = multi_stage_image_blur_pipeline
ctx.enqueue_function[kernel, kernel](
out_tensor,
inp_tensor,
@@ -346,11 +324,11 @@ def test_double_buffered_stencil() raises:
# Create a step pattern that will be smoothed by stencil
inp_host[i] = Scalar[dtype](1.0 if i % 20 < 10 else 0.0)
- # Create LayoutTensors for Puzzle 26B
- var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out)
- var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp)
+ # Create TileTensors for Puzzle 26B
+ var out_tensor = TileTensor[mut=True, dtype, LayoutType](out, layout)
+ var inp_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout)
- comptime kernel = double_buffered_stencil_computation[layout]
+ comptime kernel = double_buffered_stencil_computation
ctx.enqueue_function[kernel, kernel](
out_tensor,
inp_tensor,
diff --git a/solutions/p33/p33.mojo b/solutions/p33/p33.mojo
index 0e57b8fd..6d6c4a10 100644
--- a/solutions/p33/p33.mojo
+++ b/solutions/p33/p33.mojo
@@ -1,6 +1,7 @@
from std.gpu import thread_idx, block_idx, block_dim, barrier, WARP_SIZE
from std.gpu.host import DeviceContext
-from layout import Layout, LayoutTensor
+from layout import Layout, LayoutTensor, TileTensor
+from layout.tile_layout import row_major
from layout.tensor_core import TensorCore
from layout.layout_tensor import copy_dram_to_sram_async
from std.gpu.memory import async_copy_wait_all, AddressSpace
@@ -10,7 +11,8 @@ from std.testing import assert_equal, assert_almost_equal
comptime dtype = DType.float32
comptime SIZE = 1024
-comptime layout = Layout.row_major(SIZE, SIZE)
+comptime layout = row_major[SIZE, SIZE]()
+comptime LayoutType = type_of(layout)
comptime BLOCK_DIM_COUNT = 2
comptime TILE_SIZE = 32
@@ -23,11 +25,11 @@ comptime THREADS_PER_BLOCK_TILED = (TILE_SIZE, TILE_SIZE)
# ANCHOR: matmul_idiomatic_tiled_solution
def matmul_idiomatic_tiled[
- layout: Layout, size: Int
+ size: Int
](
- output: LayoutTensor[dtype, layout, MutAnyOrigin],
- a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
- b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
+ a: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
+ b: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
# Use block_dim to get actual tile size dynamically
var tile_size_x = block_dim.x
@@ -53,7 +55,7 @@ def matmul_idiomatic_tiled[
address_space=AddressSpace.SHARED,
].stack_allocation()
- var acc: output.element_type = 0
+ var acc: output.ElementType = 0
comptime load_a_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading
comptime load_b_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading
@@ -62,8 +64,12 @@ def matmul_idiomatic_tiled[
for idx in range(size // TILE_SIZE): # Iterate over K tiles
# Get tiles from A and B matrices
- var a_tile = a.tile[TILE_SIZE, TILE_SIZE](block_idx.y, idx)
- var b_tile = b.tile[TILE_SIZE, TILE_SIZE](idx, block_idx.x)
+ var a_tile = a.tile[TILE_SIZE, TILE_SIZE](
+ block_idx.y, idx
+ ).to_layout_tensor()
+ var b_tile = b.tile[TILE_SIZE, TILE_SIZE](
+ idx, block_idx.x
+ ).to_layout_tensor()
# Asynchronously copy tiles to shared memory with consistent orientation
copy_dram_to_sram_async[
@@ -87,7 +93,9 @@ def matmul_idiomatic_tiled[
and local_col < TILE_SIZE
and k < TILE_SIZE
):
- acc += a_shared[local_row, k] * b_shared[k, local_col]
+ acc += rebind[Scalar[dtype]](a_shared[local_row, k]) * rebind[
+ Scalar[dtype]
+ ](b_shared[k, local_col])
barrier()
@@ -289,19 +297,29 @@ def main() raises:
inp1_host[i * SIZE + k] * inp2_host[k * SIZE + j]
)
# Create layout tensors
- var out_tensor_core_layout = LayoutTensor[dtype, layout](
+ comptime old_layout = Layout.row_major(SIZE, SIZE)
+ var out_tensor_core_layout = LayoutTensor[dtype, old_layout](
out_tensor_core.unsafe_ptr()
)
- var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1)
- var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2)
+ var a_tensor = LayoutTensor[dtype, old_layout, ImmutAnyOrigin](inp1)
+ var b_tensor = LayoutTensor[dtype, old_layout, ImmutAnyOrigin](inp2)
+
+ # Create TileTensors for the tiled kernel
+ var out_tile_tensor = TileTensor(out_tensor_core, layout)
+ var a_tile_tensor = TileTensor[mut=False, dtype, LayoutType](
+ inp1, layout
+ )
+ var b_tile_tensor = TileTensor[mut=False, dtype, LayoutType](
+ inp2, layout
+ )
if mode == "--tensor-core":
print("\n=== Running ACTUAL Tensor Core Matrix Multiplication ===")
comptime kernel = tensor_core_matrix_multiplication[
dtype,
- layout,
- layout,
- layout,
+ old_layout,
+ old_layout,
+ old_layout,
BM,
BN,
BK,
@@ -328,16 +346,14 @@ def main() raises:
# Create separate buffer for tiled result
out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_tiled.enqueue_fill(0)
- out_tiled_layout = LayoutTensor[dtype, layout](
- out_tiled.unsafe_ptr()
- )
+ out_tiled_layout = TileTensor(out_tiled, layout)
# Run idiomatic tiled version with proper 2D block configuration
- comptime kernel = matmul_idiomatic_tiled[layout, SIZE]
+ comptime kernel = matmul_idiomatic_tiled[SIZE]
ctx.enqueue_function[kernel, kernel](
out_tiled_layout,
- a_tensor,
- b_tensor,
+ a_tile_tensor,
+ b_tile_tensor,
grid_dim=BLOCK_PER_GRID_TILED,
block_dim=THREADS_PER_BLOCK_TILED,
)
@@ -356,9 +372,9 @@ def main() raises:
print("\n--- Test 1: Tensor Core vs CPU Reference ---")
comptime kernel = tensor_core_matrix_multiplication[
dtype,
- layout,
- layout,
- layout,
+ old_layout,
+ old_layout,
+ old_layout,
BM,
BN,
BK,
@@ -435,15 +451,13 @@ def main() raises:
print("\n--- Test 2: Idiomatic Tiled vs CPU Reference ---")
out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE)
out_tiled.enqueue_fill(0)
- out_tiled_layout = LayoutTensor[dtype, layout](
- out_tiled.unsafe_ptr()
- )
+ out_tiled_layout = TileTensor(out_tiled, layout)
- comptime kernel2 = matmul_idiomatic_tiled[layout, SIZE]
+ comptime kernel2 = matmul_idiomatic_tiled[SIZE]
ctx.enqueue_function[kernel2, kernel2](
out_tiled_layout,
- a_tensor,
- b_tensor,
+ a_tile_tensor,
+ b_tile_tensor,
grid_dim=BLOCK_PER_GRID_TILED,
block_dim=THREADS_PER_BLOCK_TILED,
)
diff --git a/solutions/p34/p34.mojo b/solutions/p34/p34.mojo
index e2a62a55..4aa02af4 100644
--- a/solutions/p34/p34.mojo
+++ b/solutions/p34/p34.mojo
@@ -8,7 +8,9 @@ from std.gpu.primitives.cluster import (
elect_one_sync,
)
from std.gpu.memory import AddressSpace
-from layout import Layout, LayoutTensor
+from layout import TileTensor
+from layout.tile_layout import row_major
+from layout.tile_tensor import stack_allocation
from std.sys import argv
from std.testing import assert_equal, assert_almost_equal, assert_true
@@ -16,16 +18,20 @@ comptime SIZE = 1024
comptime TPB = 256
comptime CLUSTER_SIZE = 4
comptime dtype = DType.float32
-comptime in_layout = Layout.row_major(SIZE)
-comptime out_layout = Layout.row_major(1)
+comptime in_layout = row_major[SIZE]()
+comptime out_layout = row_major[1]()
+comptime InLayout = type_of(in_layout)
+comptime OutLayout = type_of(out_layout)
+comptime cluster_layout = row_major[CLUSTER_SIZE]()
+comptime ClusterLayout = type_of(cluster_layout)
# ANCHOR: cluster_coordination_basics_solution
def cluster_coordination_basics[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
size: Int,
):
"""Real cluster coordination using SM90+ cluster APIs."""
@@ -36,12 +42,9 @@ def cluster_coordination_basics[
var my_block_rank = Int(block_rank_in_cluster())
var block_id = block_idx.x
- var shared_data = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_data = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
# FIX: Use block_idx.x for data distribution instead of cluster rank
# Each block should process different portions of the data
@@ -77,13 +80,11 @@ def cluster_coordination_basics[
# ANCHOR: cluster_collective_operations_solution
def cluster_collective_operations[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
- temp_storage: LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ],
+ output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
+ temp_storage: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
size: Int,
):
"""Cluster-wide collective operations using real cluster APIs."""
@@ -98,12 +99,9 @@ def cluster_collective_operations[
my_value = input[global_i][0]
# Block-level reduction using shared memory
- var shared_mem = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_mem = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
shared_mem[local_i] = my_value
barrier()
@@ -135,10 +133,10 @@ def cluster_collective_operations[
# ANCHOR: advanced_cluster_patterns_solution
def advanced_cluster_patterns[
- in_layout: Layout, out_layout: Layout, tpb: Int
+ tpb: Int
](
- output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
- input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
+ output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
+ input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
size: Int,
):
"""Advanced cluster programming using cluster masks and relaxed synchronization.
@@ -148,12 +146,9 @@ def advanced_cluster_patterns[
var my_block_rank = Int(block_rank_in_cluster())
var block_id = block_idx.x
- var shared_data = LayoutTensor[
- dtype,
- Layout.row_major(tpb),
- MutAnyOrigin,
- address_space=AddressSpace.SHARED,
- ].stack_allocation()
+ var shared_data = stack_allocation[
+ dtype=dtype, address_space=AddressSpace.SHARED
+ ](row_major[tpb]())
# Compute cluster mask for advanced coordination
# base_mask = cluster_mask_base() # Requires cluster_shape parameter
@@ -216,16 +211,14 @@ def main() raises:
for i in range(SIZE):
input_host[i] = Scalar[dtype](i % 10) * 0.1
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
+ input_tensor = TileTensor[mut=False, dtype, InLayout](
+ input_buf, in_layout
+ )
+ output_tensor = TileTensor[mut=True, dtype, ClusterLayout](
+ output_buf, cluster_layout
)
- output_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](output_buf)
- comptime kernel = cluster_coordination_basics[
- in_layout, Layout.row_major(CLUSTER_SIZE), TPB
- ]
+ comptime kernel = cluster_coordination_basics[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -280,19 +273,17 @@ def main() raises:
print("Expected sum:", expected_sum)
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
+ input_tensor = TileTensor[mut=False, dtype, InLayout](
+ input_buf, in_layout
+ )
+ var output_tensor = TileTensor[mut=True, dtype, OutLayout](
+ output_buf, out_layout
)
- var output_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](
- output_buf
+ var temp_tensor = TileTensor[mut=True, dtype, ClusterLayout](
+ temp_buf, cluster_layout
)
- var temp_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](temp_buf)
- comptime kernel = cluster_collective_operations[
- in_layout, out_layout, TPB
- ]
+ comptime kernel = cluster_collective_operations[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
@@ -332,16 +323,14 @@ def main() raises:
Scalar[dtype](i % 50) * 0.02
) # Pattern for testing
- input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](
- input_buf
+ input_tensor = TileTensor[mut=False, dtype, InLayout](
+ input_buf, in_layout
+ )
+ output_tensor = TileTensor[mut=True, dtype, ClusterLayout](
+ output_buf, cluster_layout
)
- output_tensor = LayoutTensor[
- dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
- ](output_buf)
- comptime kernel = advanced_cluster_patterns[
- in_layout, Layout.row_major(CLUSTER_SIZE), TPB
- ]
+ comptime kernel = advanced_cluster_patterns[TPB]
ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,