modular
diff --git a/‎book/i18n/ko/src/puzzle_29/barrier.md‎
Lines changed: 7 additions & 9 deletions b/‎book/i18n/ko/src/puzzle_29/barrier.md‎
Lines changed: 7 additions & 9 deletions
diff --git a/‎book/i18n/ko/src/puzzle_29/memory_barrier.md‎
Lines changed: 35 additions & 20 deletions b/‎book/i18n/ko/src/puzzle_29/memory_barrier.md‎
Lines changed: 35 additions & 20 deletions
diff --git a/‎book/src/puzzle_29/barrier.md‎
Lines changed: 6 additions & 9 deletions b/‎book/src/puzzle_29/barrier.md‎
Lines changed: 6 additions & 9 deletions
diff --git a/‎book/src/puzzle_29/memory_barrier.md‎
Lines changed: 34 additions & 20 deletions b/‎book/src/puzzle_29/memory_barrier.md‎
Lines changed: 34 additions & 20 deletions
@@ -2,6 +2,12 @@
 
 # 다단계 파이프라인 조정
 
+**중요 사항**: 퍼즐 29는 현재 NVIDIA 전용입니다.
+[동반 스텐실 퍼즐](memory_barrier.md)에서 사용되는
+[`mbarrier` API](https://docs.modular.com/mojo/std/gpu/sync/sync/) 는
+NVIDIA 하드웨어가 필요하며, `pixi` 작업이 이를 동일한 컴파일 단위로
+가져옵니다.
+
 ## 개요
 
 조율된 3단계 파이프라인을 통해 이미지를 처리하는 커널을 구현합니다. 서로 다른
@@ -184,8 +190,7 @@
 
 <div class="code-tabs" data-tab-group="package-manager">
   <div class="tab-buttons">
-    <button class="tab-button">pixi NVIDIA (default)</button>
-    <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi NVIDIA</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -197,13 +202,6 @@ pixi run p29 --multi-stage
   </div>
   <div class="tab-content">
 
-```bash
-pixi run -e amd p29 --multi-stage
-```
-
-  </div>
-  <div class="tab-content">
-
 ```bash
 uv run poe p29 --multi-stage
 ```
 
@@ -2,6 +2,10 @@
 
 # 더블 버퍼링 스텐실 연산
 
+**중요 사항**: 이 퍼즐은 NVIDIA GPU 하드웨어가 필요합니다.
+[`mbarrier` API](https://docs.modular.com/mojo/std/gpu/sync/sync/)는
+NVIDIA 전용입니다.
+
 > **🔬 세밀한 동기화: mbarrier vs barrier()**
 >
 > 이 퍼즐은 이전 퍼즐에서 사용한 기본
@@ -223,12 +227,20 @@ buffer\_B} & \\text{if} i \\bmod 2 = 0 \\\\
 - 각 스레드가 쓰기 연산을 완료한 후
   [`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
   호출
-- 버퍼 교체 전 모든 스레드가 완료하도록
+- 그 다음
   [`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
-  사용
+  를 폴링 루프로 호출: 이 API는 **비차단** 검사이므로
+  `while not mbarrier_test_wait(...): pass` 안에서 호출해야 모든 스레드가
+  도착할 때까지 실제로 대기할 수 있음
 - 재사용을 위해 반복 간에 배리어 재초기화:
   [`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
 - 경쟁 상태를 피하기 위해 스레드 0만 배리어를 재초기화
+- 모든 `mbarrier_init` 호출(초기 설정과 반복마다의 재초기화) 직후에는
+  [`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
+  를 삽입해, 어떤 스레드도 `mbarrier_arrive`를 호출하기 전에 모든 스레드가
+  초기화된 배리어를 관찰하도록 보장하세요. 이는
+  [NVIDIA Async Barriers 초기화 패턴](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
+  과 일치합니다.
 
 ### **출력 선택**
 
@@ -246,8 +258,7 @@ buffer\_B} & \\text{if} i \\bmod 2 = 0 \\\\
 
 <div class="code-tabs" data-tab-group="package-manager">
   <div class="tab-buttons">
-    <button class="tab-button">pixi NVIDIA (default)</button>
-    <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi NVIDIA</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -259,13 +270,6 @@ pixi run p29 --double-buffer
   </div>
   <div class="tab-content">
 
-```bash
-pixi run -e amd p29 --double-buffer
-```
-
-  </div>
-  <div class="tab-content">
-
 ```bash
 uv run poe p29 --double-buffer
 ```
@@ -359,14 +363,23 @@ mbarrier 조정 패턴의 이해:
 
 **핵심 타이밍 순서:**
 
-1. **모든 스레드 쓰기**: 각 스레드가 할당된 버퍼 요소를 업데이트
-2. **완료 알림**: 각 스레드가
+1. **초기화 + 동기화**: 스레드 0이
+   [`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
+   를 호출한 다음, 모든 스레드가
+   [`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
+   를 실행하여 어떤 스레드도 `mbarrier_arrive`를 호출하기 전에 초기화된
+   상태가 블록 전체에 보이도록 합니다
+   ([NVIDIA Async Barriers 문서](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
+   참조)
+2. **모든 스레드 쓰기**: 각 스레드가 할당된 버퍼 요소를 업데이트
+3. **완료 알림**: 각 스레드가
    [`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
    호출
-3. **전체 대기**: 모든 스레드가
+4. **전원 도착까지 폴링**: 모든 스레드가
+   `while not mbarrier_test_wait(...): pass` 안에서 회전 —
    [`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
-   호출
-4. **진행 안전**: 이제 다음 반복을 위해 버퍼 역할을 안전하게 교체 가능
+   는 비차단 검사이므로 단일 호출은 대기가 아닙니다
+5. **진행 안전**: 이제 다음 반복을 위해 버퍼 역할을 안전하게 교체 가능
 
 ## **스텐실 연산 메커니즘**
 
@@ -429,13 +442,15 @@ stencil_input = buffer_B[10]  // 미정의 동작!
 buffer_B[local_i] = stencil_result
 
 # 쓰기 완료 알림
-mbarrier_arrive(barrier)
+_ = mbarrier_arrive(barrier)
 
-# 모든 스레드의 쓰기 완료까지 대기
-mbarrier_test_wait(barrier, TPB)
+# 모든 스레드의 쓰기 완료까지 폴링. mbarrier_test_wait는 비차단 호출이므로
+# 단일 호출은 대기가 아니며 반드시 루프에서 실행되어야 합니다.
+while not mbarrier_test_wait(barrier, TPB):
+    pass
 
 # 이제 읽기 안전 - 모든 쓰기 완료 보장
-stencil_input = buffer_B[neighbor_index]  // 항상 올바른 값을 읽음
+stencil_input = buffer_B[neighbor_index]  # 항상 올바른 값을 읽음
 ```
 
 ## **출력 버퍼 선택**
 
@@ -1,5 +1,10 @@
 # Multi-Stage Pipeline Coordination
 
+**Important note**: Puzzle 29 is currently NVIDIA-only — the
+[`mbarrier` APIs](https://docs.modular.com/mojo/std/gpu/sync/sync/) used in
+the [companion stencil puzzle](memory_barrier.md) require NVIDIA hardware,
+and the `pixi` task pulls them into the same compilation unit.
+
 ## Overview
 
 Implement a kernel that processes an image through a coordinated 3-stage
@@ -195,8 +200,7 @@ To test your solution, run the following command in your terminal:
 
 <div class="code-tabs" data-tab-group="package-manager">
   <div class="tab-buttons">
-    <button class="tab-button">pixi NVIDIA (default)</button>
-    <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi NVIDIA</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -208,13 +212,6 @@ pixi run p29 --multi-stage
   </div>
   <div class="tab-content">
 
-```bash
-pixi run -e amd p29 --multi-stage
-```
-
-  </div>
-  <div class="tab-content">
-
 ```bash
 uv run poe p29 --multi-stage
 ```
 
@@ -1,5 +1,9 @@
 # Double-Buffered Stencil Computation
 
+**Important note**: This puzzle requires NVIDIA GPU hardware. The
+[`mbarrier` APIs](https://docs.modular.com/mojo/std/gpu/sync/sync/) are
+NVIDIA-only.
+
 > **🔬 Fine-Grained Synchronization: mbarrier vs barrier()**
 >
 > This puzzle introduces **explicit memory barrier APIs** that provide
@@ -232,12 +236,19 @@ when memory operations complete, essential for complex memory access patterns.
 - Call
   [`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
   after each thread completes its write operations
-- Use
-  [`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
-  to ensure all threads finish before buffer swap
+- Follow it with a polling loop on
+  [`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait):
+  the API is a **non-blocking** check, so call it inside
+  `while not mbarrier_test_wait(...): pass` to actually wait for every thread
+  to arrive before the buffer swap
 - Reinitialize barriers between iterations for reuse:
   [`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init)
 - Only thread 0 should reinitialize barriers to avoid race conditions
+- Insert a [`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)
+  after each `mbarrier_init` call (initial setup and per-iteration reinit) so
+  every thread observes the initialized barrier before any thread calls
+  `mbarrier_arrive`. This matches the
+  [NVIDIA Async Barriers initialization pattern](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization)
 
 ### **Output selection**
 
@@ -255,8 +266,7 @@ To test your solution, run the following command in your terminal:
 
 <div class="code-tabs" data-tab-group="package-manager">
   <div class="tab-buttons">
-    <button class="tab-button">pixi NVIDIA (default)</button>
-    <button class="tab-button">pixi AMD</button>
+    <button class="tab-button">pixi NVIDIA</button>
     <button class="tab-button">uv</button>
   </div>
   <div class="tab-content">
@@ -268,13 +278,6 @@ pixi run p29 --double-buffer
   </div>
   <div class="tab-content">
 
-```bash
-pixi run -e amd p29 --double-buffer
-```
-
-  </div>
-  <div class="tab-content">
-
 ```bash
 uv run poe p29 --double-buffer
 ```
@@ -380,12 +383,21 @@ Understanding the mbarrier coordination pattern:
 
 **Critical timing sequence:**
 
-1. **All threads write**: Each thread updates its assigned buffer element
-2. **Signal completion**: Each thread calls
+1. **Init + sync**: Thread 0 calls
+   [`mbarrier_init()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init),
+   then every thread executes a
+   [`barrier()`](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/) so
+   the initialized state is visible block-wide before any
+   `mbarrier_arrive` call (see the
+   [NVIDIA Async Barriers docs](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization))
+2. **All threads write**: Each thread updates its assigned buffer element
+3. **Signal completion**: Each thread calls
    [`mbarrier_arrive()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive)
-3. **Wait for all**: All threads call
+4. **Poll until all arrived**: Every thread spins in
+   `while not mbarrier_test_wait(...): pass` —
    [`mbarrier_test_wait()`](https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait)
-4. **Safe to proceed**: Now safe to swap buffer roles for next iteration
+   is a non-blocking check, so a single call is not a wait
+5. **Safe to proceed**: Now safe to swap buffer roles for next iteration
 
 ## **Stencil operation mechanics**
 
@@ -448,13 +460,15 @@ stencil_input = buffer_B[10]  // Undefined behavior!
 buffer_B[local_i] = stencil_result
 
 # Signal write completion
-mbarrier_arrive(barrier)
+_ = mbarrier_arrive(barrier)
 
-# Wait for ALL threads to complete writes
-mbarrier_test_wait(barrier, TPB)
+# Poll until ALL threads have completed writes. mbarrier_test_wait is
+# non-blocking, so a single call is NOT a wait — it must run in a loop.
+while not mbarrier_test_wait(barrier, TPB):
+    pass
 
 # Now safe to read - all writes guaranteed complete
-stencil_input = buffer_B[neighbor_index]  // Always sees correct values
+stencil_input = buffer_B[neighbor_index]  # Always sees correct values
 ```
 
 ## **Output buffer selection**