Skip to content

Commit 3b2a3ad

Browse files
giordanocursoragent
andcommitted
Improve public API docs and expand the manual.
Add missing docstrings for backends, kernel handles, and reflection macros; expand quickstart, kernels, and API pages with examples. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent b179e7d commit 3b2a3ad

11 files changed

Lines changed: 391 additions & 75 deletions

File tree

.github/workflows/ci.yml

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,7 @@ jobs:
159159
- uses: codecov/codecov-action@v6
160160
with:
161161
files: lcov.info
162+
162163
docs:
163164
name: Documentation
164165
runs-on: ubuntu-latest
@@ -167,12 +168,14 @@ jobs:
167168
- uses: julia-actions/setup-julia@v3
168169
with:
169170
version: '1'
170-
- run: |
171-
julia --project=docs -e 'import Pkg; Pkg.develop(path=".")'
172-
julia --project=docs docs/make.jl
171+
- uses: julia-actions/cache@v3
172+
- name: "Build docs"
173+
run: |
174+
julia --project=docs --color=yes docs/make.jl
173175
env:
174176
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
175177
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
178+
176179
doctests:
177180
name: Doctests
178181
runs-on: ubuntu-latest
@@ -181,9 +184,10 @@ jobs:
181184
- uses: julia-actions/setup-julia@v3
182185
with:
183186
version: '1'
184-
- run: |
185-
julia --project=docs -e 'import Pkg; Pkg.develop(path=".")'
186-
julia --project=docs -e '
187+
- uses: julia-actions/cache@v3
188+
- name: "Run doctests"
189+
shell: julia --project=docs --color=yes {0}
190+
run: |
187191
using Documenter: doctest
188192
using KernelAbstractions
189-
doctest(KernelAbstractions; manual = true)'
193+
doctest(KernelAbstractions; manual = true)

docs/Project.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
[deps]
22
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
3+
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
34

45
[compat]
56
Documenter = "1"
7+
8+
[sources]
9+
KernelAbstractions = {path = ".."}

docs/make.jl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
push!(Base.LOAD_PATH, dirname(@__DIR__))
2-
31
using KernelAbstractions
42
using Documenter
53

docs/src/api.md

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,21 +13,63 @@
1313
@uniform
1414
@groupsize
1515
@ndrange
16-
synchronize
17-
allocate
1816
```
1917

2018
## Host language
2119

20+
### Backends and arrays
21+
2222
```@docs
23+
Backend
24+
GPU
25+
CPU
26+
POCLBackend
27+
get_backend
28+
KernelAbstractions.allocate
2329
KernelAbstractions.zeros
30+
KernelAbstractions.ones
31+
KernelAbstractions.copyto!
32+
KernelAbstractions.pagelock!
33+
KernelAbstractions.unsafe_free!
34+
KernelAbstractions.functional
2435
KernelAbstractions.supports_unified
36+
KernelAbstractions.supports_atomics
37+
KernelAbstractions.supports_float64
2538
```
2639

27-
## Internal
40+
### Devices and execution
41+
42+
```@docs
43+
synchronize
44+
KernelAbstractions.device
45+
KernelAbstractions.ndevices
46+
KernelAbstractions.device!
47+
KernelAbstractions.priority!
48+
```
49+
50+
### Kernel handles
2851

2952
```@docs
3053
KernelAbstractions.Kernel
54+
KernelAbstractions.workgroupsize
55+
KernelAbstractions.ndrange
56+
KernelAbstractions.backend
57+
```
58+
59+
## Reflection
60+
61+
These macros help inspect the generated kernel code. GPU LLVM reflection is only supported
62+
on the CPU backend via [`@ka_code_llvm`](@ref).
63+
64+
```@docs
65+
@ka_code_typed
66+
@ka_code_llvm
67+
```
68+
69+
## Internal
70+
71+
```@docs
3172
KernelAbstractions.partition
3273
KernelAbstractions.@context
74+
KernelAbstractions.argconvert
3375
```

docs/src/examples/memcopy_static.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Memcopy with static NDRange
1+
# [Memcopy with static NDRange](@id memcopy_static)
22

33
The first example simple copies memory from `B` to `A`. In contrast to the previous examples
44
it uses a fully static kernel configuration. Specializing the kernel on the iteration range itself.

docs/src/implementations.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Notes for backend implementations
1+
# [Notes for backend implementations](@id implementations_notes)
22

33
## Semantics of `KernelAbstractions.synchronize`
44

docs/src/kernels.md

Lines changed: 98 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,108 @@
1-
# Writing kernels
1+
# Writing kernels
22

3-
These kernel language constructs are intended to be used as part
4-
of [`@kernel`](@ref) functions and not valid outside that context.
3+
These kernel language constructs are intended to be used inside [`@kernel`](@ref) functions.
4+
They are not valid in ordinary Julia code (except when using experimental `@kernel cpu=false`).
55

66
## Constant arguments
77

8-
Kernel functions allow for input arguments to be marked with the
9-
[`@Const`](@ref) macro. It informs the compiler that the memory
10-
accessed through that marked input argument, will not be written
11-
to as part of the kernel. This has the implication that input arguments
12-
are **not** allowed to alias each other. If you are used to CUDA C this
13-
is similar to `const restrict`.
8+
Kernel functions allow input arguments to be marked with the [`@Const`](@ref) macro. It informs
9+
the compiler that the memory accessed through that argument will not be written to as part of
10+
the kernel, and that it does not alias any other memory in the kernel. If you are used to CUDA C,
11+
this is similar to `const restrict`.
12+
13+
```julia
14+
@kernel function saxpy!(a, @Const(X), Y)
15+
I = @index(Global)
16+
@inbounds Y[I] = a * X[I] + Y[I]
17+
end
18+
```
1419

1520
## Indexing
1621

17-
There are several [`@index`](@ref) variants.
22+
The [`@index`](@ref) macro returns the index of the current work item. Choose a **granularity**
23+
and an optional **kind**:
24+
25+
| Granularity | Meaning |
26+
|-------------|---------|
27+
| `Global` | Index over the full `ndrange` (use for global memory) |
28+
| `Group` | Index of the current workgroup |
29+
| `Local` | Index within the current workgroup |
30+
31+
| Kind | Result type |
32+
|------|-------------|
33+
| `Linear` (default) | `Int` linear index |
34+
| `Cartesian` | `CartesianIndex` for multi-dimensional `ndrange` |
35+
| `NTuple` | `NTuple` of `Int` indices |
36+
37+
```julia
38+
@kernel function fill_diagonal!(A, val)
39+
I = @index(Global, Cartesian)
40+
if I[1] == I[2]
41+
@inbounds A[I] = val
42+
end
43+
end
44+
45+
@kernel function linear_example(A)
46+
I = @index(Global, Linear) # 1, 2, 3, ...
47+
g = @index(Group, Linear) # workgroup id
48+
l = @index(Local, Linear) # lane within workgroup
49+
@inbounds A[I] = g + l
50+
end
51+
```
52+
53+
Inside a kernel, [`@groupsize`](@ref) and [`@ndrange`](@ref) query the launch configuration:
54+
55+
```julia
56+
@kernel function scale!(A, factor)
57+
N = prod(@groupsize())
58+
I = @index(Global, Linear)
59+
lmem = @localmem Float32 (N,)
60+
i = @index(Local, Linear)
61+
lmem[i] = factor
62+
@synchronize()
63+
@inbounds A[I] = lmem[i]
64+
end
65+
```
66+
67+
## Local memory, synchronization, and private memory
68+
69+
[`@localmem`](@ref) declares storage shared by all work items in a workgroup. Reads and writes
70+
must be separated by [`@synchronize`](@ref) if they are performed by different work items:
71+
72+
```julia
73+
@kernel function reverse_block!(A)
74+
I = @index(Global, Linear)
75+
i = @index(Local, Linear)
76+
N = prod(@groupsize())
77+
buf = @localmem Int (N,)
78+
buf[i] = i
79+
@synchronize()
80+
@inbounds A[I] = buf[N - i + 1]
81+
end
82+
```
83+
84+
[`@private`](@ref) and [`@uniform`](@ref) are deprecated for KernelAbstractions 1.0. Prefer
85+
`MArray` for per-lane scratch storage that does not need to survive across `@synchronize`.
86+
87+
## Launching kernels
88+
89+
Construct a kernel by calling the kernel function on a backend and optional static sizes, then
90+
launch it with `ndrange`:
91+
92+
```julia
93+
# dynamic sizes — supply ndrange (and optionally workgroupsize) at launch
94+
kernel = my_kernel(backend)
95+
kernel(A, ndrange=size(A))
1896

19-
## Local memory, variable lifetime and private memory
97+
# static workgroup size
98+
kernel = my_kernel(backend, 256)
99+
kernel(A, ndrange=size(A))
20100

21-
[`@localmem`](@ref), [`@synchronize`](@ref), [`@private`](@ref)
101+
# static workgroup size and ndrange — fewer runtime checks, may reduce recompilation
102+
kernel = my_kernel(backend, 32, size(A))
103+
kernel(A)
104+
```
22105

23-
# Launching kernels
106+
On GPU backends, obtain the backend from an array with [`get_backend`](@ref) and always call
107+
[`synchronize`](@ref) before reading results on the host. See the [Quickstart](@ref) for a full walkthrough and the Examples section of the manual
108+
for larger patterns.

docs/src/quickstart.md

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,21 @@ all(A .== 2.0)
4343
All kernels are launched asynchronously.
4444
The [`synchronize`](@ref) blocks the *host* until the kernel has completed on the backend.
4545

46+
### Static workgroup size and `ndrange`
47+
48+
When the workgroup size and `ndrange` are known ahead of time, pass them to the kernel
49+
constructor to enable additional compile-time optimizations and avoid supplying them at
50+
every launch:
51+
52+
```julia
53+
# workgroup size 32, ndrange (128, 128) — fixed for this kernel object
54+
kernel = mul2_kernel(dev, 32, size(A))
55+
kernel(A) # ndrange inferred from construction
56+
synchronize(dev)
57+
```
58+
59+
See also [Memcopy with static NDRange](@ref memcopy_static).
60+
4661
## Launching kernel on the backend
4762

4863
To launch the kernel on a backend-supported backend `isa(backend, KA.GPU)` (e.g., `CUDABackend()`, `ROCBackend()`, `oneAPIBackend()`, `MetalBackend()`), we generate the kernel
@@ -108,6 +123,38 @@ function mymul(A, B)
108123
end
109124
```
110125

111-
## Using task programming to launch kernels in parallel.
126+
## Using task programming to launch kernels in parallel
127+
128+
As shown in the [Synchronization](@ref) section above, multiple kernels can be enqueued on the
129+
same backend before a single [`synchronize`](@ref) call. The same pattern extends to Julia's
130+
task-based parallelism: launch kernels from [`Threads.@spawn`](https://docs.julialang.org/en/stable/base/multi-threading/#Base.Threads.@spawn)
131+
tasks when you want to overlap kernel execution with other asynchronous host work.
132+
133+
On GPU backends, [`synchronize`](@ref) is **cooperative** — it yields to the Julia scheduler
134+
rather than blocking inside a driver call, so other tasks can make progress while a kernel runs.
135+
See [Notes for backend implementations](@ref implementations_notes) for the contract backend authors must follow.
136+
137+
```julia
138+
function cooperative_wait(task::Task)
139+
while !Base.istaskdone(task)
140+
yield()
141+
end
142+
return wait(task)
143+
end
144+
145+
function exchange_and_compute!(backend, A, B)
146+
recv = Threads.@spawn begin
147+
mul2_kernel(backend, 64)(A, ndrange=length(A))
148+
synchronize(backend) # cooperative on GPU backends
149+
end
150+
send = Threads.@spawn begin
151+
mul2_kernel(backend, 64)(B, ndrange=length(B))
152+
synchronize(backend)
153+
end
154+
cooperative_wait(recv)
155+
cooperative_wait(send)
156+
end
157+
```
112158

113-
TODO
159+
A full MPI example that overlaps communication with device copies is in
160+
[`examples/mpi.jl`](https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/mpi.jl).

0 commit comments

Comments
 (0)