11# Double-Buffered Stencil Computation
22
3+ ** Important note** : This puzzle requires NVIDIA GPU hardware. The
4+ [ ` mbarrier ` APIs] ( https://docs.modular.com/mojo/std/gpu/sync/sync/ ) are
5+ NVIDIA-only.
6+
37> ** 🔬 Fine-Grained Synchronization: mbarrier vs barrier()**
48>
59> This puzzle introduces ** explicit memory barrier APIs** that provide
@@ -232,12 +236,19 @@ when memory operations complete, essential for complex memory access patterns.
232236- Call
233237 [ ` mbarrier_arrive() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive )
234238 after each thread completes its write operations
235- - Use
236- [ ` mbarrier_test_wait() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait )
237- to ensure all threads finish before buffer swap
239+ - Follow it with a polling loop on
240+ [ ` mbarrier_test_wait() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait ) :
241+ the API is a ** non-blocking** check, so call it inside
242+ ` while not mbarrier_test_wait(...): pass ` to actually wait for every thread
243+ to arrive before the buffer swap
238244- Reinitialize barriers between iterations for reuse:
239245 [ ` mbarrier_init() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init )
240246- Only thread 0 should reinitialize barriers to avoid race conditions
247+ - Insert a [ ` barrier() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/ )
248+ after each ` mbarrier_init ` call (initial setup and per-iteration reinit) so
249+ every thread observes the initialized barrier before any thread calls
250+ ` mbarrier_arrive ` . This matches the
251+ [ NVIDIA Async Barriers initialization pattern] ( https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization )
241252
242253### ** Output selection**
243254
@@ -255,8 +266,7 @@ To test your solution, run the following command in your terminal:
255266
256267<div class =" code-tabs " data-tab-group =" package-manager " >
257268 <div class =" tab-buttons " >
258- <button class="tab-button">pixi NVIDIA (default)</button>
259- <button class="tab-button">pixi AMD</button>
269+ <button class="tab-button">pixi NVIDIA</button>
260270 <button class="tab-button">uv</button>
261271 </div >
262272 <div class =" tab-content " >
@@ -268,13 +278,6 @@ pixi run p29 --double-buffer
268278 </div >
269279 <div class =" tab-content " >
270280
271- ``` bash
272- pixi run -e amd p29 --double-buffer
273- ```
274-
275- </div >
276- <div class =" tab-content " >
277-
278281``` bash
279282uv run poe p29 --double-buffer
280283```
@@ -380,12 +383,21 @@ Understanding the mbarrier coordination pattern:
380383
381384** Critical timing sequence:**
382385
383- 1 . ** All threads write** : Each thread updates its assigned buffer element
384- 2 . ** Signal completion** : Each thread calls
386+ 1 . ** Init + sync** : Thread 0 calls
387+ [ ` mbarrier_init() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_init ) ,
388+ then every thread executes a
389+ [ ` barrier() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/ ) so
390+ the initialized state is visible block-wide before any
391+ ` mbarrier_arrive ` call (see the
392+ [ NVIDIA Async Barriers docs] ( https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization ) )
393+ 2 . ** All threads write** : Each thread updates its assigned buffer element
394+ 3 . ** Signal completion** : Each thread calls
385395 [ ` mbarrier_arrive() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_arrive )
386- 3 . ** Wait for all** : All threads call
396+ 4 . ** Poll until all arrived** : Every thread spins in
397+ ` while not mbarrier_test_wait(...): pass ` —
387398 [ ` mbarrier_test_wait() ` ] ( https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait )
388- 4 . ** Safe to proceed** : Now safe to swap buffer roles for next iteration
399+ is a non-blocking check, so a single call is not a wait
400+ 5 . ** Safe to proceed** : Now safe to swap buffer roles for next iteration
389401
390402## ** Stencil operation mechanics**
391403
@@ -448,13 +460,15 @@ stencil_input = buffer_B[10] // Undefined behavior!
448460buffer_B[local_i] = stencil_result
449461
450462# Signal write completion
451- mbarrier_arrive(barrier)
463+ _ = mbarrier_arrive(barrier)
452464
453- # Wait for ALL threads to complete writes
454- mbarrier_test_wait(barrier, TPB)
465+ # Poll until ALL threads have completed writes. mbarrier_test_wait is
466+ # non-blocking, so a single call is NOT a wait — it must run in a loop.
467+ while not mbarrier_test_wait(barrier, TPB):
468+ pass
455469
456470# Now safe to read - all writes guaranteed complete
457- stencil_input = buffer_B[neighbor_index] // Always sees correct values
471+ stencil_input = buffer_B[neighbor_index] # Always sees correct values
458472```
459473
460474## ** Output buffer selection**
0 commit comments