You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/lectures/lecture_11/lecture.md
+193-3Lines changed: 193 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -403,7 +403,7 @@ c = similar(a)
403
403
@cuda threads=1024 blocks=cld(length(a), 1024) vadd!(c, a, b, length(a))
404
404
```
405
405
406
-
== KernelAbstractions
406
+
== KernelAbstractions with Metal
407
407
408
408
```julia
409
409
using Metal
@@ -437,6 +437,11 @@ where
437
437
While the `vadd` example is nice, it is trivial and can be achieved by `map` as shown above. A simple operation that is everything but trivial to implement is *reduction*, since it ends up in a single operation. It also allows to demonstrate, why efficient kernels needs to be written at three levels: warp, block, and grid. The exposition below is based on [JuliaCon tutorial on GPU programming](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb).
and it is pretty terrible, because all the hard work is done by a single thread. The result of the kernel is different from that of `sum` operation. Why is that? This discrepancy is caused by the order of the arithmetic operations, which can be verified by computing the sum as in the kernel as
456
482
```julia
457
483
foldl(+, x, init=0f0)
458
484
```
459
485
For the sake of completness, we benchmark the speed of the kernel for comparison later on
We can use **atomic** operations to mark that the reduction operation has to be performed exclusively. This have the advantage that we can do some operation while fetching the data, but it is still a very bad idea.
502
+
503
+
::: tabs
504
+
505
+
== Cuda
506
+
465
507
```julia
466
508
functionreduce_atomic(op, a, b)
467
509
i =threadIdx().x + (blockIdx().x -1) *blockDim().x
This solution is better then the single-threadded version, but still very poor.
484
551
485
552
Let's take the problem seriously. If we want to use paralelism in reduction, we need to perform parallel reduction as shown in the figure below[^2]
@@ -489,6 +556,10 @@ Let's take the problem seriously. If we want to use paralelism in reduction, we
489
556
490
557
The parallel reduction is tricky. **Let's assume that we are allowed to overwrite the first argument a**. This is relatively safe assumption, since we can always create a copy of `a` before launching the kernel.
* The while loop iterates over the levels of the reduction, performing $$2^{\log(\textrm{blockDim}) - d + 1})$$ reductions.
524
635
* We need to sychronize threads by `sync_threads`, such that all reductions on the level below are finished
525
636
* The output of the reduction will be stored in `a[1]`
526
637
* We use `@cuprintln` which allows us to print what is happening inside the thread execution.
527
638
* Notice how the number of threads doing some work decreases, which unfortunately inevitable consequence of `reduce` operation.
528
639
529
640
To extend the above for multiple blocks, we need to add reduction over blocks. The idea would be to execute the above loop for each block independently, and then, on the end, the first thread would do the reduction over blocks, as
641
+
642
+
::: tabs
643
+
644
+
== Cuda
645
+
530
646
```julia
531
647
functionreduce_grid_atomic(op, a, b)
532
648
elements =2*blockDim().x
@@ -561,7 +677,50 @@ CUDA.@allowscalar cb[]
561
677
sum(x)
562
678
```
563
679
680
+
== KernelAbstractions with Metal
681
+
682
+
```julia
683
+
684
+
@kernelfunctionreduce_grid_atomic(a, b)
685
+
block_dim =prod(@groupsize())
686
+
elements =2*block_dim
687
+
offset =2*(@index(Group) -1) * block_dim
688
+
thread =@index(Local)
689
+
690
+
# parallel reduction of values within the single block
691
+
d =1
692
+
while d < elements
693
+
@synchronize()
694
+
index =2* d * (thread-1) +1
695
+
if index <= elements && index+d+offset <=length(a)
Recall that each block is executed on a separate SM, each equipped with the local memory. So far, we have been doing all computations in the global memory, which is slow. So how about to copy everything to the local memory and then perform the reduction. This would also have the benefit of not modifying the original arrays.
719
+
720
+
::: tabs
721
+
722
+
== CUDA
723
+
565
724
```julia
566
725
functionreduce_grid_localmem(op, a::AbstractArray{T}, b) where {T}
567
726
elements =2*blockDim().x
@@ -578,8 +737,7 @@ function reduce_grid_localmem(op, a::AbstractArray{T}, b) where {T}
578
737
sync_threads()
579
738
index =2* d * (thread-1) +1
580
739
@inboundsif index <= elements && index+d+offset <=length(a)
# parallel reduction of values within the single block
776
+
d =1
777
+
while d < elements
778
+
index =2* d * (thread-1) +1
779
+
if index + d <= elements
780
+
@inbounds shmem[index] += shmem[index+d]
781
+
end
782
+
d *=2
783
+
@synchronize()
784
+
end
785
+
786
+
# atomic reduction of this block's value to the global accumulator
787
+
if thread ==1
788
+
Atomix.@atomic b[] += shmem[1]
789
+
end
790
+
end
791
+
```
603
792
The performance improvement is negligible, but that's because we have a relatively new GPU with lots of global memory bandwith. On older or lower-end GPUs, using shared memory would be valuable. But at least, we are not modifying the original array.
604
793
605
794
If we inspect the above kernel in profiler, we can read that it uses 32 registers per thread. But if the SM has 16384 registers, then block of size 1024 will have to share registers, which might lead to poor utilization. Changing the blocksize to 512 improves the throughput a bit as can be seen from below
@@ -753,6 +942,7 @@ When we wish to launch kernel using `@cuda (config...) function(args...)`, the j
753
942
*[Using CUDA Warp-Level Primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)
754
943
*https://juliagpu.org/post/2020-11-05-oneapi_0.1/
755
944
*https://www.youtube.com/watch?v=aKRv-W9Eg8g
945
+
*[Kernels without borders: Parallel programming with KernelAbstractions.jl, Tim Bessard, 2015](https://www.youtube.com/watch?v=F4S6LpLPO7A&list=PLP8iPy9hna6TJMLEiZZiWAXlyGtOyJSL7&index=21)
756
946
757
947
[^bpf]: https://ebpf.io/
758
948
[^bessard18]: Besard, Tim, Christophe Foket, and Bjorn De Sutter. "Effective extensible programming: unleashing Julia on GPUs." IEEE Transactions on Parallel and Distributed Systems 30.4 (2018): 827-841.
0 commit comments