[EDU] Improve cross-puzzle consistency

dunnoconnor · modularbot · commit 9fb9c25e52b8 · 2026-06-25T15:52:45.000Z
MODULAR_ORIG_COMMIT_REV_ID: 233276c087fd504f86549164fe577e6b2837c7c2
diff --git a/book/src/puzzle_18/puzzle_18.md b/book/src/puzzle_18/puzzle_18.md
@@ -37,6 +37,11 @@ Our GPU implementation uses parallel reduction for both finding the maximum
 value and computing the sum of exponentials, making it highly efficient for
 large vectors.
 
+> **Scope:** This puzzle runs in a single block (grid is \\(1 \times 1\\)). Both
+> reductions use shared memory and `barrier()` within that one block — there is
+> no cross-block communication here, which is why the vector fits in a single
+> block's threads.
+
 ## Key concepts
 
 - Parallel reduction for efficient maximum and sum calculations
diff --git a/book/src/puzzle_19/puzzle_19.md b/book/src/puzzle_19/puzzle_19.md
@@ -31,6 +31,11 @@ The computation involves three main steps:
 3. **Weighted Sum**: Combine value vectors using attention weights to produce
    the final output
 
+> **Scope:** This puzzle composes existing kernels (transpose, tiled matmul,
+> softmax) into one attention op for a single query vector. Cross-block
+> coordination lives inside each reused kernel — your focus is the transpose
+> kernel and the host-side orchestration that connects the pieces.
+
 ## Understanding attention: a step-by-step breakdown
 
 Think of attention as a **smart lookup mechanism**. Given a query (what you're
@@ -71,15 +76,22 @@ Step 3: Weights(1,16) @ V(16,16) → Output(1,16) → reshape → Output(16,)
 
 **Key insight**: We reshape the query vector \\(Q\\) from shape \\((16,)\\) to
 \\((1,16)\\) so we can use matrix multiplication instead of manual dot products.
-This allows us to leverage the highly optimized tiled matmul kernel from Puzzle
-18!
+This allows us to leverage the highly optimized
+[tiled matmul kernel from Puzzle 16](../puzzle_16/tiled.md)!
+
+In Mojo, you reshape a `LayoutTensor` by calling `reshape[new_layout]()` with the
+target layout as a compile-time parameter (for example,
+`q_tensor.reshape[layout_q_2d]()`) rather than copying or mutating data in place.
+You'll see this idiom in the orchestration code below.
 
 Our GPU implementation
-**reuses and combines optimized kernels from previous puzzles**:
+**reuses and combines optimized kernels, mostly from previous puzzles**:
 
 - **[Tiled matrix multiplication from Puzzle 16](../puzzle_16/puzzle_16.md)**
   for efficient \\(Q \cdot K^T\\) and \\(\text{weights} \cdot V\\) operations
-- **Shared memory transpose** for computing \\(K^T\\) efficiently
+- **[Shared memory transpose](#1-implement-the-transpose-kernel)** for computing
+  \\(K^T\\) efficiently — this is the one kernel you implement in this puzzle
+  (see below)
 - **[Parallel softmax from Puzzle 18](../puzzle_18/puzzle_18.md)** for
   numerically stable attention weight computation
 
@@ -88,6 +100,13 @@ Our GPU implementation
 > Rather than writing everything from scratch, we leverage the
 > `matmul_idiomatic_tiled` from Puzzle 16 and `softmax_kernel` from Puzzle 18,
 > showcasing the power of modular GPU kernel design.
+>
+> **Reuse checkpoint**: Before continuing, revisit the kernels you're about to
+> compose — `matmul_idiomatic_tiled` in
+> [Puzzle 16's tiled solution](../puzzle_16/tiled.md) and `softmax_kernel` in
+> [Puzzle 18](../puzzle_18/puzzle_18.md). Treat this puzzle as a
+> composition/refactor exercise: your job is to wire these existing building
+> blocks together (plus the transpose you write here), not to reinvent them.
 
 ## Key concepts
 
@@ -199,6 +218,22 @@ kernel in the Mojo file using shared memory.
 
 ### 2. Orchestrate the attention
 
+So far you've written a single kernel. Attention, however, is a *pipeline* of
+kernels: the transpose you just implemented, the tiled matmul from Puzzle 16, the
+softmax from Puzzle 18, and a second matmul. **Orchestration** is the host-side
+code that runs these kernels in sequence and wires the output of each step into
+the input of the next:
+
+```text
+K → transpose → Kᵀ → matmul(Q, Kᵀ) → scores → softmax → weights → matmul(weights, V) → output
+```
+
+The orchestration function below allocates the intermediate buffers (`Kᵀ`,
+`scores`, `weights`), reshapes \\(Q\\) to \\((1, 16)\\) with `reshape[...]()` as
+shown above, and enqueues each kernel launch on the GPU. There's no new kernel
+math here — the work is choosing buffer layouts and calling the existing kernels
+in the right order.
+
 ```mojo
 {{#include ../../../problems/p19/op/attention.mojo:attention_orchestration}}
 ```
diff --git a/book/src/puzzle_23/elementwise.md b/book/src/puzzle_23/elementwise.md
@@ -6,7 +6,7 @@ modern GPU programming abstracts low-level details while preserving high
 performance.
 
 **Key insight:** _The
-[elementwise](https://docs.modular.com/mojo/std/algorithm/functional/elementwise/)
+[elementwise](https://mojolang.org/docs/std/algorithm/functional/elementwise/)
 function automatically handles thread management, SIMD vectorization, and memory
 coalescing for you._
 
@@ -26,13 +26,24 @@ The mathematical operation is simple element-wise addition:
 The implementation covers fundamental patterns applicable to all GPU functional
 programming in Mojo.
 
+**Where to start:** You begin from the `elementwise` template in the problem file
+— there is no manual shared memory or thread-index math here. The key shift from
+earlier puzzles is that each invocation of your nested function processes a whole
+SIMD vector, not a single element. That's why you load and store with
+`aligned_load[simd_width]` / `store[simd_width]` (vectorized) instead of indexing
+one scalar at a time.
+
 ## Configuration
 
 - Vector size: `SIZE = 1024`
 - Data type: `DType.float32`
 - SIMD width: Target-dependent (determined by GPU architecture and data type)
 - Layout: `row_major[SIZE]()` (1D row-major)
 
+> **Scope:** This is a single-kernel, per-element operation. The `elementwise`
+> abstraction handles thread, block, and grid configuration for you — there is no
+> cross-thread or cross-block communication to reason about here.
+
 ## Code to complete
 
 ```mojo
@@ -53,7 +64,9 @@ The `elementwise` function expects a nested function with this exact signature:
 ```mojo
 @parameter
 @always_inline
-def your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
+def your_function[
+    simd_width: Int, alignment: Int = align_of[dtype]()
+](indices: Coord) capturing -> None:
     # Your implementation here
 ```
 
@@ -65,13 +78,13 @@ def your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturin
   kernels
 - `capturing`: Allows access to variables from the outer scope (the input/output
   tensors)
-- `IndexList[rank]`: Provides multi-dimensional indexing (rank=1 for vectors,
-  rank=2 for matrices)
+- `Coord`: Carries the per-dimension indices for the current SIMD chunk; use
+  `indices[0]` for 1D operations
 
 ### 2. **Index extraction and SIMD processing**
 
 ```mojo
-idx = indices[0]  # Extract linear index for 1D operations
+idx = Int(indices[0].value())  # Extract linear index for 1D operations
 ```
 
 This `idx` represents the **starting position** for a SIMD vector, not a single
@@ -239,25 +252,27 @@ elementwise[add_function, simd_width, target="gpu"](size, ctx)
 ```mojo
 @parameter
 @always_inline
-def add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
+def add[
+    simd_width: Int, alignment: Int = align_of[dtype]()
+](indices: Coord) capturing -> None:
 ```
 
 **Parameter Analysis:**
 
 - **`@parameter`**: This decorator provides **compile-time specialization**. The
-  function is generated separately for each unique `simd_width` and `rank`,
-  allowing aggressive optimization.
+  function is generated separately for each unique `simd_width`, allowing
+  aggressive optimization.
 - **`@always_inline`**: Critical for GPU performance - eliminates function call
   overhead by embedding the code directly into the kernel.
 - **`capturing`**: Enables **lexical scoping** - the inner function can access
   variables from the outer scope without explicit parameter passing.
-- **`IndexList[rank]`**: Provides **dimension-agnostic indexing** - the same
-  pattern works for 1D vectors, 2D matrices, 3D tensors, etc.
+- **`Coord`**: Carries the per-dimension indices for the SIMD chunk being
+  processed; `indices[0]` is the linear start position for 1D operations.
 
 ### 3. **SIMD execution model deep dive**
 
 ```mojo
-idx = indices[0]                                  # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
+idx = Int(indices[0].value())                     # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
 a_simd = a.aligned_load[simd_width](Index(idx))       # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
 b_simd = b.aligned_load[simd_width](Index(idx))       # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
 ret = a_simd + b_simd                             # SIMD: 4 additions in parallel (GPU-dependent)
diff --git a/book/src/puzzle_23/vectorize.md b/book/src/puzzle_23/vectorize.md
@@ -4,7 +4,7 @@
 
 This puzzle explores **advanced vectorization techniques** using manual
 vectorization and
-[vectorize](https://docs.modular.com/mojo/std/algorithm/functional/vectorize/)
+[vectorize](https://mojolang.org/docs/std/algorithm/backend/vectorize/vectorize/)
 that give you precise control over SIMD operations within GPU kernels. You'll
 implement two different approaches to vectorized computation:
 
@@ -42,6 +42,10 @@ But with sophisticated vectorization strategies for maximum performance.
 - SIMD width: GPU-dependent
 - Layout: `row_major[SIZE]()` (1D row-major)
 
+> **Scope:** Both approaches operate within a single tile at a time; bounds
+> checking is per-tile and there is no cross-tile or cross-block communication.
+> The focus is SIMD control inside a tile, not coordination across them.
+
 ## 1. Manual vectorization approach
 
 ### Code to complete
@@ -235,6 +239,32 @@ for i in range(tile_size):  # i = 0, 1, 2, ..., 31
 
 <div class="solution-tips">
 
+### 0. **From scalar to vectorized**
+
+Start by writing the addition as a plain scalar loop over a tile, then convert it
+to `vectorize`. The transformation is mechanical: replace the per-element loop
+body with a SIMD load/add/store, and hand the loop to `vectorize`, which calls
+your body in `width`-sized steps and processes the leftover remainder for you.
+
+```mojo
+# Before: scalar loop over the tile (one element at a time)
+for i in range(actual_tile_size):
+    global_idx = tile_start + i
+    out_lt[global_idx] = a_lt[global_idx] + b_lt[global_idx]
+
+# After: same logic, but the body operates on a SIMD vector of `width`
+def vectorized_add[width: Int](i: Int) {read tile_start, read a_lt, read b_lt, mut out_lt}:
+    global_idx = tile_start + i
+    if global_idx + width <= size:                       # bounds check
+        a_vec = a_lt.aligned_load[width](Index(global_idx))
+        b_vec = b_lt.aligned_load[width](Index(global_idx))
+        out_lt.store[width](Index(global_idx), a_vec + b_vec)
+
+vectorize[simd_width](actual_tile_size, vectorized_add)  # drives the loop + remainder
+```
+
+The remaining tips break this down piece by piece.
+
 ### 1. **Tile boundary calculation**
 
 ```mojo
diff --git a/book/src/puzzle_24/warp_sum.md b/book/src/puzzle_24/warp_sum.md
@@ -35,6 +35,10 @@ programming in Mojo.
 - Grid configuration: `(1, 1)` blocks per grid
 - Layout: `row_major[SIZE]()` (1D row-major)
 
+> **Scope:** This puzzle works within a single warp (`SIZE = WARP_SIZE`). The
+> reduction happens across lanes of one warp via `warp.sum()`; there is no
+> cross-warp or cross-block reduction here.
+
 ## The traditional complexity (from Puzzle 12)
 
 Recall the complex approach from
@@ -55,6 +59,12 @@ memory, barriers, and tree reduction:
 This works, but it's verbose, error-prone, and requires deep understanding of
 GPU synchronization.
 
+> **Note:** This is intentionally a *different* approach from the
+> [Puzzle 12 solution](../../../solutions/p12/p12.mojo). Puzzle 12 uses shared
+> memory, `barrier()`, and a tree reduction; this puzzle deliberately replaces
+> all of that with a single `warp.sum()`. The code below won't match the P12
+> solution line-for-line — that contrast is the point.
+
 **Test the traditional approach:**
 <div class="code-tabs" data-tab-group="package-manager">
   <div class="tab-buttons">
diff --git a/book/src/puzzle_25/warp_shuffle_down.md b/book/src/puzzle_25/warp_shuffle_down.md
@@ -37,6 +37,11 @@ This transforms complex neighbor access patterns into simple warp-level
 operations, enabling efficient stencil computations without explicit memory
 indexing.
 
+> **Scope:** `shuffle_down()` only moves data *within a warp*. In the multi-block
+> section below, each block's warp handles its own boundary lanes independently —
+> there is no cross-warp or cross-block data exchange. Lanes at the top of a warp
+> simply have no neighbor to read from, which is why boundary handling matters.
+
 ## 1. Basic neighbor difference
 
 ### Configuration
@@ -393,6 +398,7 @@ boundary lanes of each block.
   <div class="tab-buttons">
     <button class="tab-button">pixi NVIDIA (default)</button>
     <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi Apple</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -411,6 +417,13 @@ pixi run -e amd p25 --average
   </div>
   <div class="tab-content">
 
+```bash
+pixi run -e apple p25 --average
+```
+
+  </div>
+  <div class="tab-content">
+
 ```bash
 uv run poe p25 --average
 ```
diff --git a/book/src/puzzle_26/warp_shuffle_xor.md b/book/src/puzzle_26/warp_shuffle_xor.md
@@ -38,6 +38,11 @@ This transforms complex parallel algorithms into elegant butterfly communication
 patterns, enabling efficient tree reductions and sorting networks without
 explicit coordination.
 
+> **Scope:** `shuffle_xor()` exchanges data *within a single warp*. Every
+> reduction and butterfly here is per-warp; the results are global only because
+> each section runs a single warp over the data. There is no cross-warp or
+> cross-block communication.
+
 ## 1. Basic butterfly pair swap
 
 ### Configuration
@@ -388,6 +393,7 @@ Result: All lanes have global maximum = 7
   <div class="tab-buttons">
     <button class="tab-button">pixi NVIDIA (default)</button>
     <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi Apple</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -406,6 +412,13 @@ pixi run -e amd p26 --parallel-max
   </div>
   <div class="tab-content">
 
+```bash
+pixi run -e apple p26 --parallel-max
+```
+
+  </div>
+  <div class="tab-content">
+
 ```bash
 uv run poe p26 --parallel-max
 ```
@@ -595,6 +608,7 @@ This puzzle uses multiple blocks. Consider how this affects the reduction scope.
   <div class="tab-buttons">
     <button class="tab-button">pixi NVIDIA (default)</button>
     <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi Apple</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -613,6 +627,13 @@ pixi run -e amd p26 --conditional-max
   </div>
   <div class="tab-content">
 
+```bash
+pixi run -e apple p26 --conditional-max
+```
+
+  </div>
+  <div class="tab-content">
+
 ```bash
 uv run poe p26 --conditional-max
 ```