You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md
+42Lines changed: 42 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -364,3 +364,45 @@ Unlike AMD's RDNA ISAs where we can verify that the compiler is doing what it sh
364
364
365
365
----------------------------
366
366
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
367
+
368
+
## Update as of February 2026
369
+
370
+
We've been informed that Nvidia noticed this blog post in June 2025 and issued a fix in July 2025.
371
+
We then re-ran our benchmarks for subgroup operations on Nvidia driver version 591.86, released 27th January 2026, and can confirm there are improvements.
372
+
373
+
### Benchmarks for workgroup size 256, items per invocation=1
374
+
375
+
When pre-scanning only 1 item per invocation, inclusive scan performance is observed to be about equal between native and emulated.
376
+
However, for exclusive scan, there is about a 1.17x speedup when using emulated, as opposed to native.
377
+
378
+
#### Inclusive scan
379
+
380
+
| Operation mode | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
### Benchmarks for workgroup size 256, items per invocation=4
393
+
394
+
In contrast, with 4 items per invocation, we observe around 1.17x speedup when using emulated over native operations for both inclusive and exclusive scans.
395
+
396
+
#### Inclusive scan
397
+
398
+
| Operation mode | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
0 commit comments