|
1 | | -# Writing kernels |
| 1 | +# Writing kernels |
2 | 2 |
|
3 | | -These kernel language constructs are intended to be used as part |
4 | | -of [`@kernel`](@ref) functions and not valid outside that context. |
| 3 | +These kernel language constructs are intended to be used inside [`@kernel`](@ref) functions. |
| 4 | +They are not valid in ordinary Julia code (except when using experimental `@kernel cpu=false`). |
5 | 5 |
|
6 | 6 | ## Constant arguments |
7 | 7 |
|
8 | | -Kernel functions allow for input arguments to be marked with the |
9 | | -[`@Const`](@ref) macro. It informs the compiler that the memory |
10 | | -accessed through that marked input argument, will not be written |
11 | | -to as part of the kernel. This has the implication that input arguments |
12 | | -are **not** allowed to alias each other. If you are used to CUDA C this |
13 | | -is similar to `const restrict`. |
| 8 | +Kernel functions allow input arguments to be marked with the [`@Const`](@ref) macro. It informs |
| 9 | +the compiler that the memory accessed through that argument will not be written to as part of |
| 10 | +the kernel, and that it does not alias any other memory in the kernel. If you are used to CUDA C, |
| 11 | +this is similar to `const restrict`. |
| 12 | + |
| 13 | +```julia |
| 14 | +@kernel function saxpy!(a, @Const(X), Y) |
| 15 | + I = @index(Global) |
| 16 | + @inbounds Y[I] = a * X[I] + Y[I] |
| 17 | +end |
| 18 | +``` |
14 | 19 |
|
15 | 20 | ## Indexing |
16 | 21 |
|
17 | | -There are several [`@index`](@ref) variants. |
| 22 | +The [`@index`](@ref) macro returns the index of the current work item. Choose a **granularity** |
| 23 | +and an optional **kind**: |
| 24 | + |
| 25 | +| Granularity | Meaning | |
| 26 | +|-------------|---------| |
| 27 | +| `Global` | Index over the full `ndrange` (use for global memory) | |
| 28 | +| `Group` | Index of the current workgroup | |
| 29 | +| `Local` | Index within the current workgroup | |
| 30 | + |
| 31 | +| Kind | Result type | |
| 32 | +|------|-------------| |
| 33 | +| `Linear` (default) | `Int` linear index | |
| 34 | +| `Cartesian` | `CartesianIndex` for multi-dimensional `ndrange` | |
| 35 | +| `NTuple` | `NTuple` of `Int` indices | |
| 36 | + |
| 37 | +```julia |
| 38 | +@kernel function fill_diagonal!(A, val) |
| 39 | + I = @index(Global, Cartesian) |
| 40 | + if I[1] == I[2] |
| 41 | + @inbounds A[I] = val |
| 42 | + end |
| 43 | +end |
| 44 | + |
| 45 | +@kernel function linear_example(A) |
| 46 | + I = @index(Global, Linear) # 1, 2, 3, ... |
| 47 | + g = @index(Group, Linear) # workgroup id |
| 48 | + l = @index(Local, Linear) # lane within workgroup |
| 49 | + @inbounds A[I] = g + l |
| 50 | +end |
| 51 | +``` |
| 52 | + |
| 53 | +Inside a kernel, [`@groupsize`](@ref) and [`@ndrange`](@ref) query the launch configuration: |
| 54 | + |
| 55 | +```julia |
| 56 | +@kernel function scale!(A, factor) |
| 57 | + N = prod(@groupsize()) |
| 58 | + I = @index(Global, Linear) |
| 59 | + lmem = @localmem Float32 (N,) |
| 60 | + i = @index(Local, Linear) |
| 61 | + lmem[i] = factor |
| 62 | + @synchronize() |
| 63 | + @inbounds A[I] = lmem[i] |
| 64 | +end |
| 65 | +``` |
| 66 | + |
| 67 | +## Local memory, synchronization, and private memory |
| 68 | + |
| 69 | +[`@localmem`](@ref) declares storage shared by all work items in a workgroup. Reads and writes |
| 70 | +must be separated by [`@synchronize`](@ref) if they are performed by different work items: |
| 71 | + |
| 72 | +```julia |
| 73 | +@kernel function reverse_block!(A) |
| 74 | + I = @index(Global, Linear) |
| 75 | + i = @index(Local, Linear) |
| 76 | + N = prod(@groupsize()) |
| 77 | + buf = @localmem Int (N,) |
| 78 | + buf[i] = i |
| 79 | + @synchronize() |
| 80 | + @inbounds A[I] = buf[N - i + 1] |
| 81 | +end |
| 82 | +``` |
| 83 | + |
| 84 | +[`@private`](@ref) and [`@uniform`](@ref) are deprecated for KernelAbstractions 1.0. Prefer |
| 85 | +`MArray` for per-lane scratch storage that does not need to survive across `@synchronize`. |
| 86 | + |
| 87 | +## Launching kernels |
| 88 | + |
| 89 | +Construct a kernel by calling the kernel function on a backend and optional static sizes, then |
| 90 | +launch it with `ndrange`: |
| 91 | + |
| 92 | +```julia |
| 93 | +# dynamic sizes — supply ndrange (and optionally workgroupsize) at launch |
| 94 | +kernel = my_kernel(backend) |
| 95 | +kernel(A, ndrange=size(A)) |
18 | 96 |
|
19 | | -## Local memory, variable lifetime and private memory |
| 97 | +# static workgroup size |
| 98 | +kernel = my_kernel(backend, 256) |
| 99 | +kernel(A, ndrange=size(A)) |
20 | 100 |
|
21 | | -[`@localmem`](@ref), [`@synchronize`](@ref), [`@private`](@ref) |
| 101 | +# static workgroup size and ndrange — fewer runtime checks, may reduce recompilation |
| 102 | +kernel = my_kernel(backend, 32, size(A)) |
| 103 | +kernel(A) |
| 104 | +``` |
22 | 105 |
|
23 | | -# Launching kernels |
| 106 | +On GPU backends, obtain the backend from an array with [`get_backend`](@ref) and always call |
| 107 | +[`synchronize`](@ref) before reading results on the host. See the [Quickstart](@ref) for a full walkthrough and the Examples section of the manual |
| 108 | +for larger patterns. |
0 commit comments