Skip to content

Commit 5a5fd74

Browse files
dunnoconnormodularbot
authored andcommitted
[Edu] Fix Puzzle 29 barrier sync and mbarrier_test_wait polling
Implement pre-barrier synchronization mbarrier_test_wait called in a loop Remove non-NVIDIA run commands MODULAR_ORIG_COMMIT_REV_ID: 4cd429bc9af3c36bb0464e8523a0a9b51f2e4efb
1 parent 70cb76f commit 5a5fd74

7 files changed

Lines changed: 120 additions & 14440 deletions

File tree

book/i18n/ko/src/puzzle_29/barrier.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22

33
# 다단계 파이프라인 조정
44

5+
**중요 사항**: 퍼즐 29는 현재 NVIDIA 전용입니다.
6+
[동반 스텐실 퍼즐](memory_barrier.md)에서 사용되는
7+
[`mbarrier` API](https://docs.modular.com/mojo/std/gpu/sync/sync/)
8+
NVIDIA 하드웨어가 필요하며, `pixi` 작업이 이를 동일한 컴파일 단위로
9+
가져옵니다.
10+
511
## 개요
612

713
조율된 3단계 파이프라인을 통해 이미지를 처리하는 커널을 구현합니다. 서로 다른
@@ -184,8 +190,7 @@
184190

185191
<div class="code-tabs" data-tab-group="package-manager">
186192
<div class="tab-buttons">
187-
<button class="tab-button">pixi NVIDIA (default)</button>
188-
<button class="tab-button">pixi AMD</button>
193+
<button class="tab-button">pixi NVIDIA</button>
189194
<button class="tab-button">uv</button>
190195
</div>
191196
<div class="tab-content">
@@ -197,13 +202,6 @@ pixi run p29 --multi-stage
197202
</div>
198203
<div class="tab-content">
199204

200-
```bash
201-
pixi run -e amd p29 --multi-stage
202-
```
203-
204-
</div>
205-
<div class="tab-content">
206-
207205
```bash
208206
uv run poe p29 --multi-stage
209207
```

book/i18n/ko/src/puzzle_29/memory_barrier.md

Lines changed: 35 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
# 더블 버퍼링 스텐실 연산
44

5+
**중요 사항**: 이 퍼즐은 NVIDIA GPU 하드웨어가 필요합니다.
6+
[`mbarrier` API](https://docs.modular.com/mojo/std/gpu/sync/sync/)
7+
NVIDIA 전용입니다.
8+
59
> **🔬 세밀한 동기화: mbarrier vs barrier()**
610
>
711
> 이 퍼즐은 이전 퍼즐에서 사용한 기본
@@ -223,12 +227,20 @@ buffer\_B} & \\text{if} i \\bmod 2 = 0 \\\\
223227
- 각 스레드가 쓰기 연산을 완료한 후
224228
[`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
225229
호출
226-
- 버퍼 교체 전 모든 스레드가 완료하도록
230+
- 그 다음
227231
[`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
228-
사용
232+
를 폴링 루프로 호출: 이 API는 **비차단** 검사이므로
233+
`while not mbarrier_test_wait(...): pass` 안에서 호출해야 모든 스레드가
234+
도착할 때까지 실제로 대기할 수 있음
229235
- 재사용을 위해 반복 간에 배리어 재초기화:
230236
[`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
231237
- 경쟁 상태를 피하기 위해 스레드 0만 배리어를 재초기화
238+
- 모든 `mbarrier_init` 호출(초기 설정과 반복마다의 재초기화) 직후에는
239+
[`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
240+
를 삽입해, 어떤 스레드도 `mbarrier_arrive`를 호출하기 전에 모든 스레드가
241+
초기화된 배리어를 관찰하도록 보장하세요. 이는
242+
[NVIDIA Async Barriers 초기화 패턴](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
243+
과 일치합니다.
232244

233245
### **출력 선택**
234246

@@ -246,8 +258,7 @@ buffer\_B} & \\text{if} i \\bmod 2 = 0 \\\\
246258

247259
<div class="code-tabs" data-tab-group="package-manager">
248260
<div class="tab-buttons">
249-
<button class="tab-button">pixi NVIDIA (default)</button>
250-
<button class="tab-button">pixi AMD</button>
261+
<button class="tab-button">pixi NVIDIA</button>
251262
<button class="tab-button">uv</button>
252263
</div>
253264
<div class="tab-content">
@@ -259,13 +270,6 @@ pixi run p29 --double-buffer
259270
</div>
260271
<div class="tab-content">
261272

262-
```bash
263-
pixi run -e amd p29 --double-buffer
264-
```
265-
266-
</div>
267-
<div class="tab-content">
268-
269273
```bash
270274
uv run poe p29 --double-buffer
271275
```
@@ -359,14 +363,23 @@ mbarrier 조정 패턴의 이해:
359363

360364
**핵심 타이밍 순서:**
361365

362-
1. **모든 스레드 쓰기**: 각 스레드가 할당된 버퍼 요소를 업데이트
363-
2. **완료 알림**: 각 스레드가
366+
1. **초기화 + 동기화**: 스레드 0이
367+
[`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
368+
를 호출한 다음, 모든 스레드가
369+
[`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
370+
를 실행하여 어떤 스레드도 `mbarrier_arrive`를 호출하기 전에 초기화된
371+
상태가 블록 전체에 보이도록 합니다
372+
([NVIDIA Async Barriers 문서](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
373+
참조)
374+
2. **모든 스레드 쓰기**: 각 스레드가 할당된 버퍼 요소를 업데이트
375+
3. **완료 알림**: 각 스레드가
364376
[`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
365377
호출
366-
3. **전체 대기**: 모든 스레드가
378+
4. **전원 도착까지 폴링**: 모든 스레드가
379+
`while not mbarrier_test_wait(...): pass` 안에서 회전 —
367380
[`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
368-
호출
369-
4. **진행 안전**: 이제 다음 반복을 위해 버퍼 역할을 안전하게 교체 가능
381+
는 비차단 검사이므로 단일 호출은 대기가 아닙니다
382+
5. **진행 안전**: 이제 다음 반복을 위해 버퍼 역할을 안전하게 교체 가능
370383

371384
## **스텐실 연산 메커니즘**
372385

@@ -429,13 +442,15 @@ stencil_input = buffer_B[10] // 미정의 동작!
429442
buffer_B[local_i] = stencil_result
430443
431444
# 쓰기 완료 알림
432-
mbarrier_arrive(barrier)
445+
_ = mbarrier_arrive(barrier)
433446
434-
# 모든 스레드의 쓰기 완료까지 대기
435-
mbarrier_test_wait(barrier, TPB)
447+
# 모든 스레드의 쓰기 완료까지 폴링. mbarrier_test_wait는 비차단 호출이므로
448+
# 단일 호출은 대기가 아니며 반드시 루프에서 실행되어야 합니다.
449+
while not mbarrier_test_wait(barrier, TPB):
450+
pass
436451
437452
# 이제 읽기 안전 - 모든 쓰기 완료 보장
438-
stencil_input = buffer_B[neighbor_index] // 항상 올바른 값을 읽음
453+
stencil_input = buffer_B[neighbor_index] # 항상 올바른 값을 읽음
439454
```
440455

441456
## **출력 버퍼 선택**

book/src/puzzle_29/barrier.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Multi-Stage Pipeline Coordination
22

3+
**Important note**: Puzzle 29 is currently NVIDIA-only — the
4+
[`mbarrier` APIs](https://docs.modular.com/mojo/std/gpu/sync/sync/) used in
5+
the [companion stencil puzzle](memory_barrier.md) require NVIDIA hardware,
6+
and the `pixi` task pulls them into the same compilation unit.
7+
38
## Overview
49

510
Implement a kernel that processes an image through a coordinated 3-stage
@@ -195,8 +200,7 @@ To test your solution, run the following command in your terminal:
195200

196201
<div class="code-tabs" data-tab-group="package-manager">
197202
<div class="tab-buttons">
198-
<button class="tab-button">pixi NVIDIA (default)</button>
199-
<button class="tab-button">pixi AMD</button>
203+
<button class="tab-button">pixi NVIDIA</button>
200204
<button class="tab-button">uv</button>
201205
</div>
202206
<div class="tab-content">
@@ -208,13 +212,6 @@ pixi run p29 --multi-stage
208212
</div>
209213
<div class="tab-content">
210214

211-
```bash
212-
pixi run -e amd p29 --multi-stage
213-
```
214-
215-
</div>
216-
<div class="tab-content">
217-
218215
```bash
219216
uv run poe p29 --multi-stage
220217
```

book/src/puzzle_29/memory_barrier.md

Lines changed: 34 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Double-Buffered Stencil Computation
22

3+
**Important note**: This puzzle requires NVIDIA GPU hardware. The
4+
[`mbarrier` APIs](https://docs.modular.com/mojo/std/gpu/sync/sync/) are
5+
NVIDIA-only.
6+
37
> **🔬 Fine-Grained Synchronization: mbarrier vs barrier()**
48
>
59
> This puzzle introduces **explicit memory barrier APIs** that provide
@@ -232,12 +236,19 @@ when memory operations complete, essential for complex memory access patterns.
232236
- Call
233237
[`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
234238
after each thread completes its write operations
235-
- Use
236-
[`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
237-
to ensure all threads finish before buffer swap
239+
- Follow it with a polling loop on
240+
[`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait):
241+
the API is a **non-blocking** check, so call it inside
242+
`while not mbarrier_test_wait(...): pass` to actually wait for every thread
243+
to arrive before the buffer swap
238244
- Reinitialize barriers between iterations for reuse:
239245
[`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
240246
- Only thread 0 should reinitialize barriers to avoid race conditions
247+
- Insert a [`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
248+
after each `mbarrier_init` call (initial setup and per-iteration reinit) so
249+
every thread observes the initialized barrier before any thread calls
250+
`mbarrier_arrive`. This matches the
251+
[NVIDIA Async Barriers initialization pattern](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
241252

242253
### **Output selection**
243254

@@ -255,8 +266,7 @@ To test your solution, run the following command in your terminal:
255266

256267
<div class="code-tabs" data-tab-group="package-manager">
257268
<div class="tab-buttons">
258-
<button class="tab-button">pixi NVIDIA (default)</button>
259-
<button class="tab-button">pixi AMD</button>
269+
<button class="tab-button">pixi NVIDIA</button>
260270
<button class="tab-button">uv</button>
261271
</div>
262272
<div class="tab-content">
@@ -268,13 +278,6 @@ pixi run p29 --double-buffer
268278
</div>
269279
<div class="tab-content">
270280

271-
```bash
272-
pixi run -e amd p29 --double-buffer
273-
```
274-
275-
</div>
276-
<div class="tab-content">
277-
278281
```bash
279282
uv run poe p29 --double-buffer
280283
```
@@ -380,12 +383,21 @@ Understanding the mbarrier coordination pattern:
380383

381384
**Critical timing sequence:**
382385

383-
1. **All threads write**: Each thread updates its assigned buffer element
384-
2. **Signal completion**: Each thread calls
386+
1. **Init + sync**: Thread 0 calls
387+
[`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init),
388+
then every thread executes a
389+
[`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/) so
390+
the initialized state is visible block-wide before any
391+
`mbarrier_arrive` call (see the
392+
[NVIDIA Async Barriers docs](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization))
393+
2. **All threads write**: Each thread updates its assigned buffer element
394+
3. **Signal completion**: Each thread calls
385395
[`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
386-
3. **Wait for all**: All threads call
396+
4. **Poll until all arrived**: Every thread spins in
397+
`while not mbarrier_test_wait(...): pass`
387398
[`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
388-
4. **Safe to proceed**: Now safe to swap buffer roles for next iteration
399+
is a non-blocking check, so a single call is not a wait
400+
5. **Safe to proceed**: Now safe to swap buffer roles for next iteration
389401

390402
## **Stencil operation mechanics**
391403

@@ -448,13 +460,15 @@ stencil_input = buffer_B[10] // Undefined behavior!
448460
buffer_B[local_i] = stencil_result
449461
450462
# Signal write completion
451-
mbarrier_arrive(barrier)
463+
_ = mbarrier_arrive(barrier)
452464
453-
# Wait for ALL threads to complete writes
454-
mbarrier_test_wait(barrier, TPB)
465+
# Poll until ALL threads have completed writes. mbarrier_test_wait is
466+
# non-blocking, so a single call is NOT a wait — it must run in a loop.
467+
while not mbarrier_test_wait(barrier, TPB):
468+
pass
455469
456470
# Now safe to read - all writes guaranteed complete
457-
stencil_input = buffer_B[neighbor_index] // Always sees correct values
471+
stencil_input = buffer_B[neighbor_index] # Always sees correct values
458472
```
459473

460474
## **Output buffer selection**

0 commit comments

Comments
 (0)