Skip to content

Commit f7d1b1f

Browse files
committed
restore definition of terms
1 parent dcd5afc commit f7d1b1f

6 files changed

Lines changed: 16 additions & 270 deletions

File tree

book/src/puzzle_05/puzzle_05.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` and stores it in 2D TileTensor `output`.
66

7+
**Broadcasting** in parallel programming refers to the operation where lower-dimensional arrays are automatically expanded to match the shape of higher-dimensional arrays during element-wise operations. Instead of physically replicating data in memory, values are logically repeated across the additional dimensions. For example, adding a 1D vector to each row (or column) of a 2D matrix applies the same vector elements repeatedly without creating multiple copies.
8+
79
**Note:** _You have more threads than positions._
810

911
<img src="./media/05.png" alt="Broadcast visualization" class="light-mode-img">
@@ -13,14 +15,16 @@ Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` a
1315

1416
In this puzzle, you'll learn about:
1517

16-
- Using `TileTensor` for broadcast operations
17-
- Working with different tensor shapes
18-
- Handling 2D indexing with `TileTensor`
18+
- Broadcasting 1D vectors across different dimensions with `TileTensor`
19+
- Using 2D thread indices to map GPU threads to a 2D output matrix
20+
- Working with different tensor shapes for mixed-dimension operations
21+
- Handling boundary conditions in broadcast patterns
1922

2023
The key insight is that `TileTensor` allows natural broadcasting through different tensor shapes: \\((1, n)\\) and \\((n, 1)\\) to \\((n,n)\\), while still requiring bounds checking.
2124

2225
- **Tensor shapes**: Input vectors have shapes \\((1, n)\\) and \\((n, 1)\\)
23-
- **Broadcasting**: Output combines both dimensions to \\((n,n)\\)
26+
- **Broadcasting**: Each element of `a` combines with each element of `b`; output expands both dimensions to \\((n,n)\\)
27+
- **Access patterns**: `a[0, col]` broadcasts horizontally across rows; `b[row, 0]` broadcasts vertically across columns
2428
- **Guard condition**: Still need bounds checking for output size
2529
- **Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\)
2630

book/src/puzzle_06/puzzle_06.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Implement a kernel that adds 10 to each position of vector `a` and stores it in `output`.
66

7+
A **thread block** (or just **block**) is a group of threads that execute together on a single GPU multiprocessor. All threads in a block share the same shared memory and can synchronize with each other. When data is larger than one block can handle, the GPU schedules multiple blocks — each block independently processes its portion of the data. The global position of a thread is computed from both its position within the block (`thread_idx.x`) and which block it belongs to (`block_idx.x`): `global_i = block_dim.x * block_idx.x + thread_idx.x`.
8+
79
**Note:** _You have fewer threads per block than the size of a._
810

911
<img src="./media/06.png" alt="Blocks visualization" class="light-mode-img">

book/src/puzzle_08/puzzle_08.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Implement a kernel that adds 10 to each position of a 1D TileTensor `a` and stores it in 1D TileTensor `output`.
66

7+
**Shared memory** is fast, on-chip storage that is visible to all threads within the same block. Unlike global memory (which all blocks can access but is slow), shared memory has latency similar to a CPU register cache. Each block gets its own private shared memory region — threads in one block cannot see the shared memory of another block. Because threads can read and write to the same shared memory locations, coordination via `barrier()` is required to prevent one thread from reading a value before another thread has finished writing it.
8+
79
**Note:** _You have fewer threads per block than the size of `a`._
810

911
<img src="./media/08.png" alt="Shared memory visualization" class="light-mode-img">

book/src/puzzle_11/puzzle_11.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Implement a kernel that compute the running sum of the last 3 positions of 1D TileTensor `a` and stores it in 1D TileTensor `output`.
66

7+
**Pooling** is an operation that condenses a region of values into a single summary value — for example, their sum, maximum, or average. A **sliding window** applies this condensation repeatedly by moving a fixed-size window one step at a time across the input, producing one output value per window position. Here the window is 3 elements wide and the summary function is a sum, so each output element equals the sum of the current element and the two preceding it (with special cases at the boundaries where fewer than 3 elements are available).
8+
79
**Note:** _You have 1 thread per position. You only need 1 global read and 1 global write per thread._
810

911
<img src="./media/11-w.png" alt="Pooling visualization" class="light-mode-img">

book/src/puzzle_12/puzzle_12.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ For example, if you have two vectors:
1919

2020
## Key concepts
2121

22+
**Parallel reduction** is an algorithm that combines \\(n\\) values into one using a binary operation (here, addition) in \\(O(\log n)\\) steps instead of \\(O(n)\\) sequential steps. In each step, half the active threads each add one value into another, halving the number of remaining partial results. After \\(\log_2 n\\) steps, thread 0 holds the final sum. This tree-shaped computation requires a `barrier()` between steps so no thread reads a partially-updated value.
23+
2224
This puzzle covers:
2325

2426
- Similar to [puzzle 8](../puzzle_08/puzzle_08.md) and [puzzle 11](../puzzle_11/puzzle_11.md), implementing parallel reduction with TileTensor

tests/canary_tile_tensor.mojo

Lines changed: 0 additions & 266 deletions
This file was deleted.

0 commit comments

Comments
 (0)