Skip to content

Commit acd8a6d

Browse files
committed
added an update
1 parent f223cb2 commit acd8a6d

File tree

1 file changed

+42
-0
lines changed
  • blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia

1 file changed

+42
-0
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -364,3 +364,45 @@ Unlike AMD's RDNA ISAs where we can verify that the compiler is doing what it sh
364364

365365
----------------------------
366366
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
367+
368+
## Update as of February 2026
369+
370+
We've been informed that Nvidia noticed this blog post in June 2025 and issued a fix in July 2025.
371+
We then re-ran our benchmarks for subgroup operations on Nvidia driver version 591.86, released 27th January 2026, and can confirm there are improvements.
372+
373+
### Benchmarks for workgroup size 256, items per invocation=1
374+
375+
When pre-scanning only 1 item per invocation, inclusive scan performance is observed to be about equal between native and emulated.
376+
However, for exclusive scan, there is about a 1.17x speedup when using emulated, as opposed to native.
377+
378+
#### Inclusive scan
379+
380+
| Operation mode | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
381+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
382+
| Native | 69.5 | 97.9 | 16 | 12.24 |
383+
| Emulated | 37.9 | 97.7 | 16 | 12.22 |
384+
385+
#### Exclusive scan
386+
387+
| Operation | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
388+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
389+
| Native | 60.2 | 97.8 | 16 | 17.15 |
390+
| Emulated | 36.9 | 98.1 | 16 | 14.66 |
391+
392+
### Benchmarks for workgroup size 256, items per invocation=4
393+
394+
In contrast, with 4 items per invocation, we observe around 1.17x speedup when using emulated over native operations for both inclusive and exclusive scans.
395+
396+
#### Inclusive scan
397+
398+
| Operation mode | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
399+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
400+
| Native | 68.3 | 92.7 | 18 | 4.45 |
401+
| Emulated | 44.8 | 92.6 | 17 | 3.78 |
402+
403+
#### Exclusive scan
404+
405+
| Operation | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
406+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
407+
| Native | 70.0 | 92.7 | 18 | 4.45 |
408+
| Emulated | 42.6 | 92.7 | 16 | 3.78 |

0 commit comments

Comments
 (0)