Skip to content

Commit b44acb5

Browse files
author
pevnak
committed
general part with Metal
1 parent 444b94c commit b44acb5

1 file changed

Lines changed: 47 additions & 1 deletion

File tree

docs/src/lectures/lecture_11/lecture.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,6 @@ img = juliaset_pixel.(cuis, cujs, n);
225225

226226
=== Metal
227227

228-
229228
```julia
230229
using Metal
231230
using BenchmarkTools
@@ -290,6 +289,11 @@ is about `315` μs, which still 160x faster.
290289
In the output of the profiler we see that there is a lot of overhead caused by launching the kernel itself and then, the execution is relatively fast.
291290

292291
While Julia's JAoT greatly enhances the power of prepared kernels, you might quickly run into a case, when you are able to perform the operation on GPU, but it is very slow. Sometimes, it might be just faster to move the array to CPU, perform the operation there and move it back to GPU. Although this sounds like a pretty bad idea, it actually works very well see below.
292+
293+
:::tabs
294+
295+
== CUDA
296+
293297
```julia
294298
using Mill
295299
using Random
@@ -326,6 +330,48 @@ naive(cx, bags, cz);
326330
@btime CUDA.@sync CuArray(builtin(Array(cx), bags, Array(cz)));
327331
```
328332

333+
== Metal
334+
335+
336+
```julia
337+
using Mill
338+
using Random
339+
using CUDA
340+
using BenchmarkTools
341+
n = vcat(rand(1:10,1000), rand(11:100, 100), rand(101:1000,10))
342+
x = randn(Float32, 128, sum(n))
343+
z = zeros(Float32, 128, 1)
344+
bags = Mill.length2bags(n)
345+
346+
builtin(x, bags, z) = Mill.segmented_sum_forw(x, vec(z), bags, nothing)
347+
348+
function naive(x, bags, z)
349+
o = similar(x, size(x,1), length(bags))
350+
foreach(enumerate(bags)) do (i,b)
351+
if isempty(b)
352+
o[:,i] .= z
353+
else
354+
@inbounds o[:,i] = sum(@view(x[:,b]), dims = 2)
355+
end
356+
end
357+
o
358+
end
359+
360+
builtin(x, bags, z) naive(x, bags, z)
361+
@btime builtin(x, bags, z);
362+
@btime naive(x, bags, z);
363+
364+
365+
cx = CuArray(x);
366+
cz = CuArray(z);
367+
naive(cx, bags, cz);
368+
@btime CUDA.@sync naive(cx, bags, cz);
369+
@btime CUDA.@sync CuArray(builtin(Array(cx), bags, Array(cz)));
370+
```
371+
372+
:::
373+
374+
329375
## [Writing own CUDA kernels](@id gpu_lecture_yes_kernel)
330376
Before diving into details, let's recall some basic from the above HW section:
331377
* In CUDA programming model, you usually write *kernels*, which represent *body* of a for loop.

0 commit comments

Comments
 (0)