@@ -6,7 +6,7 @@ modern GPU programming abstracts low-level details while preserving high
66performance.
77
88** Key insight:** _ The
9- [ elementwise] ( https://docs.modular.com/mojo /std/algorithm/functional/elementwise/ )
9+ [ elementwise] ( https://mojolang.org/docs /std/algorithm/functional/elementwise/ )
1010function automatically handles thread management, SIMD vectorization, and memory
1111coalescing for you._
1212
@@ -26,13 +26,24 @@ The mathematical operation is simple element-wise addition:
2626The implementation covers fundamental patterns applicable to all GPU functional
2727programming in Mojo.
2828
29+ ** Where to start:** You begin from the ` elementwise ` template in the problem file
30+ — there is no manual shared memory or thread-index math here. The key shift from
31+ earlier puzzles is that each invocation of your nested function processes a whole
32+ SIMD vector, not a single element. That's why you load and store with
33+ ` aligned_load[simd_width] ` / ` store[simd_width] ` (vectorized) instead of indexing
34+ one scalar at a time.
35+
2936## Configuration
3037
3138- Vector size: ` SIZE = 1024 `
3239- Data type: ` DType.float32 `
3340- SIMD width: Target-dependent (determined by GPU architecture and data type)
3441- Layout: ` row_major[SIZE]() ` (1D row-major)
3542
43+ > ** Scope:** This is a single-kernel, per-element operation. The ` elementwise `
44+ > abstraction handles thread, block, and grid configuration for you — there is no
45+ > cross-thread or cross-block communication to reason about here.
46+
3647## Code to complete
3748
3849``` mojo
@@ -53,7 +64,9 @@ The `elementwise` function expects a nested function with this exact signature:
5364``` mojo
5465@parameter
5566@always_inline
56- def your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
67+ def your_function[
68+ simd_width: Int, alignment: Int = align_of[dtype]()
69+ ](indices: Coord) capturing -> None:
5770 # Your implementation here
5871```
5972
@@ -65,13 +78,13 @@ def your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturin
6578 kernels
6679- ` capturing ` : Allows access to variables from the outer scope (the input/output
6780 tensors)
68- - ` IndexList[rank] ` : Provides multi-dimensional indexing (rank=1 for vectors,
69- rank=2 for matrices)
81+ - ` Coord ` : Carries the per-dimension indices for the current SIMD chunk; use
82+ ` indices[0] ` for 1D operations
7083
7184### 2. ** Index extraction and SIMD processing**
7285
7386``` mojo
74- idx = indices[0] # Extract linear index for 1D operations
87+ idx = Int( indices[0].value()) # Extract linear index for 1D operations
7588```
7689
7790This ` idx ` represents the ** starting position** for a SIMD vector, not a single
@@ -239,25 +252,27 @@ elementwise[add_function, simd_width, target="gpu"](size, ctx)
239252``` mojo
240253@parameter
241254@always_inline
242- def add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
255+ def add[
256+ simd_width: Int, alignment: Int = align_of[dtype]()
257+ ](indices: Coord) capturing -> None:
243258```
244259
245260** Parameter Analysis:**
246261
247262- ** ` @parameter ` ** : This decorator provides ** compile-time specialization** . The
248- function is generated separately for each unique ` simd_width ` and ` rank ` ,
249- allowing aggressive optimization.
263+ function is generated separately for each unique ` simd_width ` , allowing
264+ aggressive optimization.
250265- ** ` @always_inline ` ** : Critical for GPU performance - eliminates function call
251266 overhead by embedding the code directly into the kernel.
252267- ** ` capturing ` ** : Enables ** lexical scoping** - the inner function can access
253268 variables from the outer scope without explicit parameter passing.
254- - ** ` IndexList[rank] ` ** : Provides ** dimension-agnostic indexing ** - the same
255- pattern works for 1D vectors, 2D matrices, 3D tensors, etc .
269+ - ** ` Coord ` ** : Carries the per-dimension indices for the SIMD chunk being
270+ processed; ` indices[0] ` is the linear start position for 1D operations .
256271
257272### 3. ** SIMD execution model deep dive**
258273
259274``` mojo
260- idx = indices[0] # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
275+ idx = Int( indices[0].value()) # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
261276a_simd = a.aligned_load[simd_width](Index(idx)) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
262277b_simd = b.aligned_load[simd_width](Index(idx)) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
263278ret = a_simd + b_simd # SIMD: 4 additions in parallel (GPU-dependent)
0 commit comments