You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/src/puzzle_05/puzzle_05.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
5
5
Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` and stores it in 2D TileTensor `output`.
6
6
7
+
**Broadcasting** in parallel programming refers to the operation where lower-dimensional arrays are automatically expanded to match the shape of higher-dimensional arrays during element-wise operations. Instead of physically replicating data in memory, values are logically repeated across the additional dimensions. For example, adding a 1D vector to each row (or column) of a 2D matrix applies the same vector elements repeatedly without creating multiple copies.
@@ -13,14 +15,16 @@ Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` a
13
15
14
16
In this puzzle, you'll learn about:
15
17
16
-
- Using `TileTensor` for broadcast operations
17
-
- Working with different tensor shapes
18
-
- Handling 2D indexing with `TileTensor`
18
+
- Broadcasting 1D vectors across different dimensions with `TileTensor`
19
+
- Using 2D thread indices to map GPU threads to a 2D output matrix
20
+
- Working with different tensor shapes for mixed-dimension operations
21
+
- Handling boundary conditions in broadcast patterns
19
22
20
23
The key insight is that `TileTensor` allows natural broadcasting through different tensor shapes: \\((1, n)\\) and \\((n, 1)\\) to \\((n,n)\\), while still requiring bounds checking.
21
24
22
25
-**Tensor shapes**: Input vectors have shapes \\((1, n)\\) and \\((n, 1)\\)
23
-
-**Broadcasting**: Output combines both dimensions to \\((n,n)\\)
26
+
-**Broadcasting**: Each element of `a` combines with each element of `b`; output expands both dimensions to \\((n,n)\\)
27
+
-**Access patterns**: `a[0, col]` broadcasts horizontally across rows; `b[row, 0]` broadcasts vertically across columns
24
28
-**Guard condition**: Still need bounds checking for output size
25
29
-**Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\)
Copy file name to clipboardExpand all lines: book/src/puzzle_06/puzzle_06.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
5
5
Implement a kernel that adds 10 to each position of vector `a` and stores it in `output`.
6
6
7
+
A **thread block** (or just **block**) is a group of threads that execute together on a single GPU multiprocessor. All threads in a block share the same shared memory and can synchronize with each other. When data is larger than one block can handle, the GPU schedules multiple blocks — each block independently processes its portion of the data. The global position of a thread is computed from both its position within the block (`thread_idx.x`) and which block it belongs to (`block_idx.x`): `global_i = block_dim.x * block_idx.x + thread_idx.x`.
8
+
7
9
**Note:**_You have fewer threads per block than the size of a._
Copy file name to clipboardExpand all lines: book/src/puzzle_08/puzzle_08.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
5
5
Implement a kernel that adds 10 to each position of a 1D TileTensor `a` and stores it in 1D TileTensor `output`.
6
6
7
+
**Shared memory** is fast, on-chip storage that is visible to all threads within the same block. Unlike global memory (which all blocks can access but is slow), shared memory has latency similar to a CPU register cache. Each block gets its own private shared memory region — threads in one block cannot see the shared memory of another block. Because threads can read and write to the same shared memory locations, coordination via `barrier()` is required to prevent one thread from reading a value before another thread has finished writing it.
8
+
7
9
**Note:**_You have fewer threads per block than the size of `a`._
Copy file name to clipboardExpand all lines: book/src/puzzle_11/puzzle_11.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
5
5
Implement a kernel that compute the running sum of the last 3 positions of 1D TileTensor `a` and stores it in 1D TileTensor `output`.
6
6
7
+
**Pooling** is an operation that condenses a region of values into a single summary value — for example, their sum, maximum, or average. A **sliding window** applies this condensation repeatedly by moving a fixed-size window one step at a time across the input, producing one output value per window position. Here the window is 3 elements wide and the summary function is a sum, so each output element equals the sum of the current element and the two preceding it (with special cases at the boundaries where fewer than 3 elements are available).
8
+
7
9
**Note:**_You have 1 thread per position. You only need 1 global read and 1 global write per thread._
Copy file name to clipboardExpand all lines: book/src/puzzle_12/puzzle_12.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,8 @@ For example, if you have two vectors:
19
19
20
20
## Key concepts
21
21
22
+
**Parallel reduction** is an algorithm that combines \\(n\\) values into one using a binary operation (here, addition) in \\(O(\log n)\\) steps instead of \\(O(n)\\) sequential steps. In each step, half the active threads each add one value into another, halving the number of remaining partial results. After \\(\log_2 n\\) steps, thread 0 holds the final sum. This tree-shaped computation requires a `barrier()` between steps so no thread reads a partially-updated value.
23
+
22
24
This puzzle covers:
23
25
24
26
- Similar to [puzzle 8](../puzzle_08/puzzle_08.md) and [puzzle 11](../puzzle_11/puzzle_11.md), implementing parallel reduction with TileTensor
0 commit comments