Skip to content

Commit c2ab77e

Browse files
author
pevnak
committed
finished with kernel abstractions
1 parent 725ca86 commit c2ab77e

1 file changed

Lines changed: 193 additions & 3 deletions

File tree

docs/src/lectures/lecture_11/lecture.md

Lines changed: 193 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -403,7 +403,7 @@ c = similar(a)
403403
@cuda threads=1024 blocks=cld(length(a), 1024) vadd!(c, a, b, length(a))
404404
```
405405

406-
== KernelAbstractions
406+
== KernelAbstractions with Metal
407407

408408
```julia
409409
using Metal
@@ -437,6 +437,11 @@ where
437437
While the `vadd` example is nice, it is trivial and can be achieved by `map` as shown above. A simple operation that is everything but trivial to implement is *reduction*, since it ends up in a single operation. It also allows to demonstrate, why efficient kernels needs to be written at three levels: warp, block, and grid. The exposition below is based on [JuliaCon tutorial on GPU programming](https://github.com/maleadt/juliacon21-gpu_workshop/blob/main/deep_dive/CUDA.ipynb).
438438

439439
The first naive implementation might looks like
440+
441+
::: tabs
442+
443+
== Cuda
444+
440445
```julia
441446
function reduce_singlethread(op, a, b)
442447
for i in 1:length(a)
@@ -452,16 +457,53 @@ cb = CUDA.zeros(1)
452457
CUDA.@allowscalar cb[]
453458
sum(x)
454459
```
460+
461+
== KernelAbstractions with Metal
462+
463+
```julia
464+
@kernel function reduce_singlethread(op, a, b)
465+
for i in 1:length(a)
466+
@inbounds b[] = op(b[], a[i])
467+
end
468+
end
469+
470+
x = rand(Float32, 1024, 1024)
471+
cx = MtlArray(x)
472+
cb = MtlArray([0f0])
473+
backend = KA.get_backend(cx)
474+
reduce_singlethread(backend, 64)(+, cx, cb, ndrange=(1,))
475+
Metal.GPUArraysCore.@allowscalar cb[]
476+
sum(x)
477+
```
478+
479+
:::
480+
455481
and it is pretty terrible, because all the hard work is done by a single thread. The result of the kernel is different from that of `sum` operation. Why is that? This discrepancy is caused by the order of the arithmetic operations, which can be verified by computing the sum as in the kernel as
456482
```julia
457483
foldl(+, x, init=0f0)
458484
```
459485
For the sake of completness, we benchmark the speed of the kernel for comparison later on
486+
487+
::: tabs
488+
489+
== Cuda
490+
460491
```julia
461492
@benchmark CUDA.@sync @cuda threads=1 reduce_singlethread(+, $(CUDA.rand(1024,1024)), $(CUDA.zeros(1)))
462493
```
463494

495+
== KernelAbstractions with Metal
496+
497+
```julia
498+
@benchmark Metal.@sync reduce_singlethread(backend, 64)(+, cx, cb, ndrange=(1,))
499+
```
500+
464501
We can use **atomic** operations to mark that the reduction operation has to be performed exclusively. This have the advantage that we can do some operation while fetching the data, but it is still a very bad idea.
502+
503+
::: tabs
504+
505+
== Cuda
506+
465507
```julia
466508
function reduce_atomic(op, a, b)
467509
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
@@ -480,6 +522,31 @@ sum(x)
480522

481523
@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 reduce_atomic(+, $(CUDA.rand(1024,1024)), $(CUDA.zeros(1)))
482524
```
525+
526+
== KernelAbstractions with Metal
527+
528+
```julia
529+
using Atomix
530+
531+
@kernel function reduce_atomic(a, b)
532+
i = @index(Global)
533+
Atomix.@atomic b[] += a[i]
534+
end
535+
536+
x = rand(Float32, 1024, 1024);
537+
cx = MtlArray(x);
538+
backend = KA.get_backend(cx);
539+
# cb = zeros(backend, Float32, 1)
540+
cb = MtlArray([0f0]);
541+
reduce_atomic(backend, 64)(+, cx, cb, ndrange=size(cx))
542+
Metal.GPUArraysCore.@allowscalar cb[]
543+
sum(x)
544+
545+
@benchmark Metal.@sync reduce_atomic(backend, 64)(cx, cb, ndrange=size(cx))
546+
```
547+
548+
:::
549+
483550
This solution is better then the single-threadded version, but still very poor.
484551

485552
Let's take the problem seriously. If we want to use paralelism in reduction, we need to perform parallel reduction as shown in the figure below[^2]
@@ -489,6 +556,10 @@ Let's take the problem seriously. If we want to use paralelism in reduction, we
489556

490557
The parallel reduction is tricky. **Let's assume that we are allowed to overwrite the first argument a**. This is relatively safe assumption, since we can always create a copy of `a` before launching the kernel.
491558

559+
::: tabs
560+
561+
== CUDA
562+
492563
```julia
493564
function reduce_block(op, a, b)
494565
elements = 2* blockDim().x
@@ -520,13 +591,58 @@ b = CuArray([0]);
520591
CUDA.@allowscalar b[]
521592

522593
```
594+
595+
== KernelAbstractions with Metal
596+
597+
```julia
598+
using Metal, BenchmarkTools
599+
using KernelAbstractions
600+
import KernelAbstractions as KA
601+
using Atomix
602+
603+
604+
@kernel function reduce_block(a, b)
605+
elements = 2 * prod(@groupsize())
606+
thread = @index(Local)
607+
608+
# parallel reduction of values in a block
609+
d = 1
610+
while d < elements
611+
index = 2 * d * (thread-1) + 1
612+
if index <= elements && index+d <= length(a)
613+
KA.@print "thread $thread: a[$index] + a[$(index+d)] = $(a[index]) + $(a[index+d]) = $(a[index] + a[index+d]))"
614+
a[index] += a[index+d]
615+
end
616+
d *= 2
617+
thread == 1 && KA.@print "\n"
618+
@synchronize
619+
end
620+
621+
if thread == 1
622+
b[] = a[1]
623+
end
624+
end
625+
626+
a = MtlArray(1:16);
627+
b = MtlArray([0]);
628+
backend = KA.get_backend(a);
629+
reduce_block(backend, 64)(a, b, ndrange = size(a));
630+
Metal.GPUArraysCore.@allowscalar b[]
631+
```
632+
633+
:::
523634
* The while loop iterates over the levels of the reduction, performing $$2^{\log(\textrm{blockDim}) - d + 1})$$ reductions.
524635
* We need to sychronize threads by `sync_threads`, such that all reductions on the level below are finished
525636
* The output of the reduction will be stored in `a[1]`
526637
* We use `@cuprintln` which allows us to print what is happening inside the thread execution.
527638
* Notice how the number of threads doing some work decreases, which unfortunately inevitable consequence of `reduce` operation.
528639

529640
To extend the above for multiple blocks, we need to add reduction over blocks. The idea would be to execute the above loop for each block independently, and then, on the end, the first thread would do the reduction over blocks, as
641+
642+
::: tabs
643+
644+
== Cuda
645+
530646
```julia
531647
function reduce_grid_atomic(op, a, b)
532648
elements = 2*blockDim().x
@@ -561,7 +677,50 @@ CUDA.@allowscalar cb[]
561677
sum(x)
562678
```
563679

680+
== KernelAbstractions with Metal
681+
682+
```julia
683+
684+
@kernel function reduce_grid_atomic(a, b)
685+
block_dim = prod(@groupsize())
686+
elements = 2*block_dim
687+
offset = 2*(@index(Group) - 1) * block_dim
688+
thread = @index(Local)
689+
690+
# parallel reduction of values within the single block
691+
d = 1
692+
while d < elements
693+
@synchronize()
694+
index = 2 * d * (thread-1) + 1
695+
if index <= elements && index+d+offset <= length(a)
696+
index += offset
697+
@inbounds a[index] += a[index+d]
698+
end
699+
d *= 2
700+
end
701+
702+
# atomic reduction of this block's value
703+
if thread == 1
704+
Atomix.@atomic b[] += a[offset + 1]
705+
end
706+
end
707+
708+
x = rand(Float32, 1024, 1024)
709+
cx = MtlArray(x)
710+
cb = MtlArray([0f0])
711+
reduce_grid_atomic(backend, 64)(cx, cb, ndrange = size(cx));
712+
Metal.GPUArraysCore.@allowscalar cb[]
713+
sum(x)
714+
```
715+
716+
:::
717+
564718
Recall that each block is executed on a separate SM, each equipped with the local memory. So far, we have been doing all computations in the global memory, which is slow. So how about to copy everything to the local memory and then perform the reduction. This would also have the benefit of not modifying the original arrays.
719+
720+
::: tabs
721+
722+
== CUDA
723+
565724
```julia
566725
function reduce_grid_localmem(op, a::AbstractArray{T}, b) where {T}
567726
elements = 2*blockDim().x
@@ -578,8 +737,7 @@ function reduce_grid_localmem(op, a::AbstractArray{T}, b) where {T}
578737
sync_threads()
579738
index = 2 * d * (thread-1) + 1
580739
@inbounds if index <= elements && index+d+offset <= length(a)
581-
index += offset
582-
a[index] = op(a[index], a[index+d])
740+
shared[index] = op(shared[index], shared[index+d])
583741
end
584742
d *= 2
585743
end
@@ -600,6 +758,37 @@ sum(x)
600758

601759
@benchmark CUDA.@sync @cuda threads=1024 blocks=512 reduce_grid_localmem(+, $(CUDA.rand(1024,1024)), $(CUDA.zeros(1)))
602760
```
761+
762+
== Metal
763+
764+
```julia
765+
@kernel function reduce_grid_localmem(a, b)
766+
block_dim = prod(@groupsize())
767+
elements = 2*block_dim
768+
offset = 2*(@index(Group) - 1) * block_dim
769+
thread = @index(Local)
770+
771+
shmem = @localmem eltype(a) 2048
772+
@inbounds shmem[thread] = offset+thread length(a) ? a[offset+thread] : 0
773+
@inbounds shmem[thread+block_dim] = offset+thread+block_dim length(a) ? a[offset+thread+block_dim] : 0
774+
@synchronize()
775+
# parallel reduction of values within the single block
776+
d = 1
777+
while d < elements
778+
index = 2 * d * (thread-1) + 1
779+
if index + d <= elements
780+
@inbounds shmem[index] += shmem[index+d]
781+
end
782+
d *= 2
783+
@synchronize()
784+
end
785+
786+
# atomic reduction of this block's value to the global accumulator
787+
if thread == 1
788+
Atomix.@atomic b[] += shmem[1]
789+
end
790+
end
791+
```
603792
The performance improvement is negligible, but that's because we have a relatively new GPU with lots of global memory bandwith. On older or lower-end GPUs, using shared memory would be valuable. But at least, we are not modifying the original array.
604793

605794
If we inspect the above kernel in profiler, we can read that it uses 32 registers per thread. But if the SM has 16384 registers, then block of size 1024 will have to share registers, which might lead to poor utilization. Changing the blocksize to 512 improves the throughput a bit as can be seen from below
@@ -753,6 +942,7 @@ When we wish to launch kernel using `@cuda (config...) function(args...)`, the j
753942
* [Using CUDA Warp-Level Primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/)
754943
* https://juliagpu.org/post/2020-11-05-oneapi_0.1/
755944
* https://www.youtube.com/watch?v=aKRv-W9Eg8g
945+
* [Kernels without borders: Parallel programming with KernelAbstractions.jl, Tim Bessard, 2015](https://www.youtube.com/watch?v=F4S6LpLPO7A&list=PLP8iPy9hna6TJMLEiZZiWAXlyGtOyJSL7&index=21)
756946

757947
[^bpf]: https://ebpf.io/
758948
[^bessard18]: Besard, Tim, Christophe Foket, and Bjorn De Sutter. "Effective extensible programming: unleashing Julia on GPUs." IEEE Transactions on Parallel and Distributed Systems 30.4 (2018): 827-841.

0 commit comments

Comments
 (0)