Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 30 additions & 30 deletions book/src/puzzle_09/second_case.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,20 +153,20 @@ Each position should sum its neighbors: [left + center + right]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_...
<<<(1,1,1),(4,1,1)>>> (output=..., input=...)
at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30
30 input: TileTensor[mut=False, dtype, vector_layout],
<<<(1,1,1),(4,1,1)>>> (output=..., a=...)
at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:36
36 a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin],
```

#### Step 4: Navigate to the main logic

```bash
(cuda-gdb) n
29 output: TileTensor[mut=True, dtype, vector_layout],
35 output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
(cuda-gdb) n
32 thread_id = thread_idx.x
38 var thread_id = thread_idx.x
(cuda-gdb) n
38 for offset in range(ITER):
44 for offset in range(ITER):
```

#### Step 5: Test variable accessibility - crucial discovery
Expand Down Expand Up @@ -206,15 +206,15 @@ $3 = {{0}, {1}, {2}, {3}}
#### Step 6: Set up loop monitoring

```bash
(cuda-gdb) b 42
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 42.
(cuda-gdb) b 45
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 45.
(cuda-gdb) c
Continuing.

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
<<<(1,1,1),(4,1,1)>>> (output=..., input=...)
at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:42
42 idx = thread_id + offset - 1
<<<(1,1,1),(4,1,1)>>> (output=..., a=...)
at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:45
45 var idx = Int(thread_id) + offset - 1
```

**🔍 We're now inside the loop body. Let's count iterations manually.**
Expand All @@ -223,50 +223,50 @@ CUDA thread hit Breakpoint 1, p09_process_sliding_window_...

```bash
(cuda-gdb) n
43 if 0 <= idx < SIZE:
46 if 0 <= idx < SIZE:
(cuda-gdb) n
41 for offset in range(ITER):
44 for offset in range(ITER):
```

**First iteration complete**: Loop went from line 4243 → back to 41. The loop continues.
**First iteration complete**: Loop went from line 4546 → back to 44. The loop continues.

#### Step 8: Second loop iteration (offset = 1)

```bash
(cuda-gdb) n

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
42 idx = thread_id + offset - 1
45 var idx = Int(thread_id) + offset - 1
(cuda-gdb) n
43 if 0 <= idx < SIZE:
46 if 0 <= idx < SIZE:
(cuda-gdb) n
44 value = rebind[Scalar[dtype]](input[idx])
47 var value = rebind[Scalar[dtype]](a[idx])
(cuda-gdb) n
45 window_sum += value
48 window_sum += value
(cuda-gdb) n
43 if 0 <= idx < SIZE:
46 if 0 <= idx < SIZE:
(cuda-gdb) n
41 for offset in range(ITER):
44 for offset in range(ITER):
```

**Second iteration complete**: This time it went through the if-block (lines 44-45).
**Second iteration complete**: This time it went through the if-block (lines 47-48).

#### Step 9: testing for third iteration

```bash
(cuda-gdb) n
47 output[thread_id] = window_sum
50 output[thread_id] = window_sum
```

**CRITICAL DISCOVERY**: The loop exited after only 2 iterations! It went directly to line 47 instead of hitting our breakpoint at line 42 again.
**CRITICAL DISCOVERY**: The loop exited after only 2 iterations! It went directly to line 50 instead of hitting our breakpoint at line 45 again.

**Conclusion**: The loop ran exactly **2 iterations** and then exited.

#### Step 10: Complete kernel execution and context loss

```bash
(cuda-gdb) n
31 fn process_sliding_window(
34 def process_sliding_window(
(cuda-gdb) n
[Switching to Thread 0x7ffff7cc0e00 (LWP 110927)]
0x00007ffff064f84a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
Expand All @@ -290,10 +290,10 @@ From our debugging session, we observed:

Looking at what each thread should compute:

- **Thread 0**: window_sum = input[-1] + input[0] + input[1] = (boundary) + 0 + 1 = 1.0
- **Thread 1**: window_sum = input[0] + input[1] + input[2] = 0 + 1 + 2 = 3.0
- **Thread 2**: window_sum = input[1] + input[2] + input[3] = 1 + 2 + 3 = 6.0
- **Thread 3**: window_sum = input[2] + input[3] + input[4] = 2 + 3 + (boundary) = 5.0
- **Thread 0**: window_sum = a[-1] + a[0] + a[1] = (boundary) + 0 + 1 = 1.0
- **Thread 1**: window_sum = a[0] + a[1] + a[2] = 0 + 1 + 2 = 3.0
- **Thread 2**: window_sum = a[1] + a[2] + a[3] = 1 + 2 + 3 = 6.0
- **Thread 3**: window_sum = a[2] + a[3] + a[4] = 2 + 3 + (boundary) = 5.0

#### Step 12: Trace the actual execution for thread 0

Expand All @@ -309,13 +309,13 @@ With only 2 iterations (offset = 0, 1):

- `idx = thread_id + offset - 1 = 0 + 1 - 1 = 0`
- `if 0 <= idx < SIZE:` → `if 0 <= 0 < 4:` → **True**
- `window_sum += input[0]` → `window_sum += 0`
- `window_sum += a[0]` → `window_sum += 0`

**Missing Iteration 3 (offset = 2)**:

- `idx = thread_id + offset - 1 = 0 + 2 - 1 = 1`
- `if 0 <= idx < SIZE:` → `if 0 <= 1 < 4:` → **True**
- `window_sum += input[1]` → `window_sum += 1` ← **THIS NEVER HAPPENS**
- `window_sum += a[1]` → `window_sum += 1` ← **THIS NEVER HAPPENS**

**Result**: Thread 0 gets `window_sum = 0` instead of `window_sum = 0 + 1 = 1`

Expand Down
Loading