Skip to content

Commit f410b78

Browse files
authored
Updates for puzzle 23 (#226)
Additional fixes following #203 for puzzle 23.
1 parent 34402ff commit f410b78

4 files changed

Lines changed: 13 additions & 13 deletions

File tree

book/src/puzzle_23/elementwise.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ This performs element-wise addition across the entire SIMD vector (if supported)
9090
### 5. **SIMD storing**
9191

9292
```mojo
93-
output.store[simd_width](idx, 0, result) # Store 4 results at once (GPU-dependent)
93+
output.store[simd_width](Index(idx), result) # Store 4 results at once (GPU-dependent)
9494
```
9595

9696
Writes the entire SIMD vector back to memory in one operation.
@@ -237,7 +237,7 @@ idx = indices[0] # Linear index: 0, 4, 8, 12...
237237
a_simd = a.aligned_load[simd_width](Index(idx)) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
238238
b_simd = b.aligned_load[simd_width](Index(idx)) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
239239
ret = a_simd + b_simd # SIMD: 4 additions in parallel (GPU-dependent)
240-
output.store[simd_width](Index(global_start), ret) # Store: 4 results simultaneously (GPU-dependent)
240+
output.store[simd_width](Index(idx), ret) # Store: 4 results simultaneously (GPU-dependent)
241241
```
242242

243243
**Execution Hierarchy Visualization:**

book/src/puzzle_23/gpu-thread-vs-simd.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,10 @@ Each GPU thread can process multiple data elements simultaneously using **SIMD (
3838

3939
```mojo
4040
# Within one GPU thread:
41-
a_simd = a.load[simd_width](idx, 0) # Load 4 floats simultaneously
42-
b_simd = b.load[simd_width](idx, 0) # Load 4 floats simultaneously
41+
a_simd = a.load[simd_width](Index(idx)) # Load 4 floats simultaneously
42+
b_simd = b.load[simd_width](Index(idx)) # Load 4 floats simultaneously
4343
result = a_simd + b_simd # Add 4 pairs simultaneously
44-
output.store[simd_width](idx, 0, result) # Store 4 results simultaneously
44+
output.store[simd_width](Index(idx), result) # Store 4 results simultaneously
4545
```
4646

4747
## Pattern comparison and thread-to-work mapping

book/src/puzzle_23/tile.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,10 @@ This `@parameter` loop unrolls at compile-time for optimal performance.
8484
### 4. **SIMD operations within tile elements**
8585

8686
```mojo
87-
a_vec = a_tile.load[simd_width](i, 0) # Load from position i in tile
88-
b_vec = b_tile.load[simd_width](i, 0) # Load from position i in tile
87+
a_vec = a_tile.load[simd_width](Index(i)) # Load from position i in tile
88+
b_vec = b_tile.load[simd_width](Index(i)) # Load from position i in tile
8989
result = a_vec + b_vec # SIMD addition (GPU-dependent width)
90-
out_tile.store[simd_width](i, 0, result) # Store to position i in tile
90+
out_tile.store[simd_width](Index(i), result) # Store to position i in tile
9191
```
9292

9393
### 5. **Thread configuration difference**
@@ -232,10 +232,10 @@ Tile 31 (thread 31): [992, 993, ..., 1023] ← Elements 992-1023
232232
```mojo
233233
@parameter
234234
for i in range(tile_size):
235-
a_vec = a_tile.load[simd_width](i, 0)
236-
b_vec = b_tile.load[simd_width](i, 0)
235+
a_vec = a_tile.load[simd_width](Index(i))
236+
b_vec = b_tile.load[simd_width](Index(i))
237237
ret = a_vec + b_vec
238-
out_tile.store[simd_width](i, 0, ret)
238+
out_tile.store[simd_width](Index(i), ret)
239239
```
240240

241241
**Why sequential processing?**

book/src/puzzle_23/vectorize.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,8 @@ This calculates the exact global position for each SIMD vector within the chunk.
6868
### 3. **Direct tensor access**
6969

7070
```mojo
71-
a_vec = a.load[simd_width](global_start, 0) # Load from global tensor
72-
output.store[simd_width](global_start, 0, ret) # Store to global tensor
71+
a_vec = a.aligned_load[simd_width](Index(global_start)) # Load from global tensor
72+
output.store[simd_width](Index(global_start), ret) # Store to global tensor
7373
```
7474

7575
Note: Access the original tensors, not the tile views.

0 commit comments

Comments
 (0)