restore definition of terms

dunnoconnor · dunnoconnor · commit f7d1b1fb962a · 2026-04-14T19:46:12.000Z
diff --git a/book/src/puzzle_05/puzzle_05.md b/book/src/puzzle_05/puzzle_05.md
@@ -4,6 +4,8 @@
 
 Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` and stores it in 2D TileTensor `output`.
 
+**Broadcasting** in parallel programming refers to the operation where lower-dimensional arrays are automatically expanded to match the shape of higher-dimensional arrays during element-wise operations. Instead of physically replicating data in memory, values are logically repeated across the additional dimensions. For example, adding a 1D vector to each row (or column) of a 2D matrix applies the same vector elements repeatedly without creating multiple copies.
+
 **Note:** _You have more threads than positions._
 
 <img src="./media/05.png" alt="Broadcast visualization" class="light-mode-img">
@@ -13,14 +15,16 @@ Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` a
 
 In this puzzle, you'll learn about:
 
-- Using `TileTensor` for broadcast operations
-- Working with different tensor shapes
-- Handling 2D indexing with `TileTensor`
+- Broadcasting 1D vectors across different dimensions with `TileTensor`
+- Using 2D thread indices to map GPU threads to a 2D output matrix
+- Working with different tensor shapes for mixed-dimension operations
+- Handling boundary conditions in broadcast patterns
 
 The key insight is that `TileTensor` allows natural broadcasting through different tensor shapes: \\((1, n)\\) and \\((n, 1)\\) to \\((n,n)\\), while still requiring bounds checking.
 
 - **Tensor shapes**: Input vectors have shapes \\((1, n)\\) and \\((n, 1)\\)
-- **Broadcasting**: Output combines both dimensions to \\((n,n)\\)
+- **Broadcasting**: Each element of `a` combines with each element of `b`; output expands both dimensions to \\((n,n)\\)
+- **Access patterns**: `a[0, col]` broadcasts horizontally across rows; `b[row, 0]` broadcasts vertically across columns
 - **Guard condition**: Still need bounds checking for output size
 - **Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\)
 
diff --git a/book/src/puzzle_06/puzzle_06.md b/book/src/puzzle_06/puzzle_06.md
@@ -4,6 +4,8 @@
 
 Implement a kernel that adds 10 to each position of vector `a` and stores it in `output`.
 
+A **thread block** (or just **block**) is a group of threads that execute together on a single GPU multiprocessor. All threads in a block share the same shared memory and can synchronize with each other. When data is larger than one block can handle, the GPU schedules multiple blocks — each block independently processes its portion of the data. The global position of a thread is computed from both its position within the block (`thread_idx.x`) and which block it belongs to (`block_idx.x`): `global_i = block_dim.x * block_idx.x + thread_idx.x`.
+
 **Note:** _You have fewer threads per block than the size of a._
 
 <img src="./media/06.png" alt="Blocks visualization" class="light-mode-img">
diff --git a/book/src/puzzle_08/puzzle_08.md b/book/src/puzzle_08/puzzle_08.md
@@ -4,6 +4,8 @@
 
 Implement a kernel that adds 10 to each position of a 1D TileTensor `a` and stores it in 1D TileTensor `output`.
 
+**Shared memory** is fast, on-chip storage that is visible to all threads within the same block. Unlike global memory (which all blocks can access but is slow), shared memory has latency similar to a CPU register cache. Each block gets its own private shared memory region — threads in one block cannot see the shared memory of another block. Because threads can read and write to the same shared memory locations, coordination via `barrier()` is required to prevent one thread from reading a value before another thread has finished writing it.
+
 **Note:** _You have fewer threads per block than the size of `a`._
 
 <img src="./media/08.png" alt="Shared memory visualization" class="light-mode-img">
diff --git a/book/src/puzzle_11/puzzle_11.md b/book/src/puzzle_11/puzzle_11.md
@@ -4,6 +4,8 @@
 
 Implement a kernel that compute the running sum of the last 3 positions of 1D TileTensor `a` and stores it in 1D TileTensor `output`.
 
+**Pooling** is an operation that condenses a region of values into a single summary value — for example, their sum, maximum, or average. A **sliding window** applies this condensation repeatedly by moving a fixed-size window one step at a time across the input, producing one output value per window position. Here the window is 3 elements wide and the summary function is a sum, so each output element equals the sum of the current element and the two preceding it (with special cases at the boundaries where fewer than 3 elements are available).
+
 **Note:** _You have 1 thread per position. You only need 1 global read and 1 global write per thread._
 
 <img src="./media/11-w.png" alt="Pooling visualization" class="light-mode-img">
diff --git a/book/src/puzzle_12/puzzle_12.md b/book/src/puzzle_12/puzzle_12.md
@@ -19,6 +19,8 @@ For example, if you have two vectors:
 
 ## Key concepts
 
+**Parallel reduction** is an algorithm that combines \\(n\\) values into one using a binary operation (here, addition) in \\(O(\log n)\\) steps instead of \\(O(n)\\) sequential steps. In each step, half the active threads each add one value into another, halving the number of remaining partial results. After \\(\log_2 n\\) steps, thread 0 holds the final sum. This tree-shaped computation requires a `barrier()` between steps so no thread reads a partially-updated value.
+
 This puzzle covers:
 
 - Similar to [puzzle 8](../puzzle_08/puzzle_08.md) and [puzzle 11](../puzzle_11/puzzle_11.md), implementing parallel reduction with TileTensor
diff --git a/tests/canary_tile_tensor.mojo b/tests/canary_tile_tensor.mojo