Improve public API docs and expand the manual.

giordano · cursoragent · giordano · commit 3b2a3add9ae6 · 2026-06-09T23:45:32.000+01:00
Add missing docstrings for backends, kernel handles, and reflection macros;
expand quickstart, kernels, and API pages with examples.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -159,6 +159,7 @@ jobs:
       - uses: codecov/codecov-action@v6
         with:
           files: lcov.info
+
   docs:
     name: Documentation
     runs-on: ubuntu-latest
@@ -167,12 +168,14 @@ jobs:
       - uses: julia-actions/setup-julia@v3
         with:
           version: '1'
-      - run: |
-          julia --project=docs -e 'import Pkg; Pkg.develop(path=".")'
-          julia --project=docs docs/make.jl
+      - uses: julia-actions/cache@v3
+      - name: "Build docs"
+        run: |
+          julia --project=docs --color=yes docs/make.jl
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
           DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
+
   doctests:
     name: Doctests
     runs-on: ubuntu-latest
@@ -181,9 +184,10 @@ jobs:
       - uses: julia-actions/setup-julia@v3
         with:
           version: '1'
-      - run: |
-          julia --project=docs -e 'import Pkg; Pkg.develop(path=".")'
-          julia --project=docs -e '
+      - uses: julia-actions/cache@v3
+      - name: "Run doctests"
+        shell: julia --project=docs --color=yes {0}
+        run: |
             using Documenter: doctest
             using KernelAbstractions
-            doctest(KernelAbstractions; manual = true)'
+            doctest(KernelAbstractions; manual = true)
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,5 +1,9 @@
 [deps]
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1ce7c"
 
 [compat]
 Documenter = "1"
+
+[sources]
+KernelAbstractions = {path = ".."}
diff --git a/docs/make.jl b/docs/make.jl
@@ -1,5 +1,3 @@
-push!(Base.LOAD_PATH, dirname(@__DIR__))
-
 using KernelAbstractions
 using Documenter
 
diff --git a/docs/src/api.md b/docs/src/api.md
@@ -13,21 +13,63 @@
 @uniform
 @groupsize
 @ndrange
-synchronize
-allocate
 ```
 
 ## Host language
 
+### Backends and arrays
+
 ```@docs
+Backend
+GPU
+CPU
+POCLBackend
+get_backend
+KernelAbstractions.allocate
 KernelAbstractions.zeros
+KernelAbstractions.ones
+KernelAbstractions.copyto!
+KernelAbstractions.pagelock!
+KernelAbstractions.unsafe_free!
+KernelAbstractions.functional
 KernelAbstractions.supports_unified
+KernelAbstractions.supports_atomics
+KernelAbstractions.supports_float64
 ```
 
-## Internal
+### Devices and execution
+
+```@docs
+synchronize
+KernelAbstractions.device
+KernelAbstractions.ndevices
+KernelAbstractions.device!
+KernelAbstractions.priority!
+```
+
+### Kernel handles
 
 ```@docs
 KernelAbstractions.Kernel
+KernelAbstractions.workgroupsize
+KernelAbstractions.ndrange
+KernelAbstractions.backend
+```
+
+## Reflection
+
+These macros help inspect the generated kernel code. GPU LLVM reflection is only supported
+on the CPU backend via [`@ka_code_llvm`](@ref).
+
+```@docs
+@ka_code_typed
+@ka_code_llvm
+```
+
+## Internal
+
+```@docs
 KernelAbstractions.partition
 KernelAbstractions.@context
+KernelAbstractions.argconvert
 ```
diff --git a/docs/src/examples/memcopy_static.md b/docs/src/examples/memcopy_static.md
@@ -1,4 +1,4 @@
-# Memcopy with static NDRange
+# [Memcopy with static NDRange](@id memcopy_static)
 
 The first example simple copies memory from `B` to `A`. In contrast to the previous examples
 it uses a fully static kernel configuration. Specializing the kernel on the iteration range itself.
diff --git a/docs/src/implementations.md b/docs/src/implementations.md
@@ -1,4 +1,4 @@
-# Notes for backend implementations
+# [Notes for backend implementations](@id implementations_notes)
 
 ## Semantics of `KernelAbstractions.synchronize`
 
diff --git a/docs/src/kernels.md b/docs/src/kernels.md
@@ -1,23 +1,108 @@
-# Writing kernels 
+# Writing kernels
 
-These kernel language constructs are intended to be used as part
-of [`@kernel`](@ref) functions and not valid outside that context.
+These kernel language constructs are intended to be used inside [`@kernel`](@ref) functions.
+They are not valid in ordinary Julia code (except when using experimental `@kernel cpu=false`).
 
 ## Constant arguments
 
-Kernel functions allow for input arguments to be marked with the
-[`@Const`](@ref) macro. It informs the compiler that the memory
-accessed through that marked input argument, will not be written
-to as part of the kernel. This has the implication that input arguments
-are **not** allowed to alias each other. If you are used to CUDA C this
-is similar to `const restrict`.
+Kernel functions allow input arguments to be marked with the [`@Const`](@ref) macro. It informs
+the compiler that the memory accessed through that argument will not be written to as part of
+the kernel, and that it does not alias any other memory in the kernel. If you are used to CUDA C,
+this is similar to `const restrict`.
+
+```julia
+@kernel function saxpy!(a, @Const(X), Y)
+    I = @index(Global)
+    @inbounds Y[I] = a * X[I] + Y[I]
+end
+```
 
 ## Indexing
 
-There are several [`@index`](@ref) variants.
+The [`@index`](@ref) macro returns the index of the current work item. Choose a **granularity**
+and an optional **kind**:
+
+| Granularity | Meaning |
+|-------------|---------|
+| `Global` | Index over the full `ndrange` (use for global memory) |
+| `Group` | Index of the current workgroup |
+| `Local` | Index within the current workgroup |
+
+| Kind | Result type |
+|------|-------------|
+| `Linear` (default) | `Int` linear index |
+| `Cartesian` | `CartesianIndex` for multi-dimensional `ndrange` |
+| `NTuple` | `NTuple` of `Int` indices |
+
+```julia
+@kernel function fill_diagonal!(A, val)
+    I = @index(Global, Cartesian)
+    if I[1] == I[2]
+        @inbounds A[I] = val
+    end
+end
+
+@kernel function linear_example(A)
+    I = @index(Global, Linear)   # 1, 2, 3, ...
+    g = @index(Group, Linear)    # workgroup id
+    l = @index(Local, Linear)    # lane within workgroup
+    @inbounds A[I] = g + l
+end
+```
+
+Inside a kernel, [`@groupsize`](@ref) and [`@ndrange`](@ref) query the launch configuration:
+
+```julia
+@kernel function scale!(A, factor)
+    N = prod(@groupsize())
+    I = @index(Global, Linear)
+    lmem = @localmem Float32 (N,)
+    i = @index(Local, Linear)
+    lmem[i] = factor
+    @synchronize()
+    @inbounds A[I] = lmem[i]
+end
+```
+
+## Local memory, synchronization, and private memory
+
+[`@localmem`](@ref) declares storage shared by all work items in a workgroup. Reads and writes
+must be separated by [`@synchronize`](@ref) if they are performed by different work items:
+
+```julia
+@kernel function reverse_block!(A)
+    I = @index(Global, Linear)
+    i = @index(Local, Linear)
+    N = prod(@groupsize())
+    buf = @localmem Int (N,)
+    buf[i] = i
+    @synchronize()
+    @inbounds A[I] = buf[N - i + 1]
+end
+```
+
+[`@private`](@ref) and [`@uniform`](@ref) are deprecated for KernelAbstractions 1.0. Prefer
+`MArray` for per-lane scratch storage that does not need to survive across `@synchronize`.
+
+## Launching kernels
+
+Construct a kernel by calling the kernel function on a backend and optional static sizes, then
+launch it with `ndrange`:
+
+```julia
+# dynamic sizes — supply ndrange (and optionally workgroupsize) at launch
+kernel = my_kernel(backend)
+kernel(A, ndrange=size(A))
 
-## Local memory, variable lifetime and private memory
+# static workgroup size
+kernel = my_kernel(backend, 256)
+kernel(A, ndrange=size(A))
 
-[`@localmem`](@ref), [`@synchronize`](@ref), [`@private`](@ref)
+# static workgroup size and ndrange — fewer runtime checks, may reduce recompilation
+kernel = my_kernel(backend, 32, size(A))
+kernel(A)
+```
 
-# Launching kernels
+On GPU backends, obtain the backend from an array with [`get_backend`](@ref) and always call
+[`synchronize`](@ref) before reading results on the host. See the [Quickstart](@ref) for a full walkthrough and the Examples section of the manual
+for larger patterns.
diff --git a/docs/src/quickstart.md b/docs/src/quickstart.md
@@ -43,6 +43,21 @@ all(A .== 2.0)
 All kernels are launched asynchronously.
 The [`synchronize`](@ref) blocks the *host* until the kernel has completed on the backend.
 
+### Static workgroup size and `ndrange`
+
+When the workgroup size and `ndrange` are known ahead of time, pass them to the kernel
+constructor to enable additional compile-time optimizations and avoid supplying them at
+every launch:
+
+```julia
+# workgroup size 32, ndrange (128, 128) — fixed for this kernel object
+kernel = mul2_kernel(dev, 32, size(A))
+kernel(A)  # ndrange inferred from construction
+synchronize(dev)
+```
+
+See also [Memcopy with static NDRange](@ref memcopy_static).
+
 ## Launching kernel on the backend
 
 To launch the kernel on a backend-supported backend `isa(backend, KA.GPU)` (e.g., `CUDABackend()`, `ROCBackend()`, `oneAPIBackend()`, `MetalBackend()`), we generate the kernel
@@ -108,6 +123,38 @@ function mymul(A, B)
 end
 ```
 
-## Using task programming to launch kernels in parallel.
+## Using task programming to launch kernels in parallel
+
+As shown in the [Synchronization](@ref) section above, multiple kernels can be enqueued on the
+same backend before a single [`synchronize`](@ref) call. The same pattern extends to Julia's
+task-based parallelism: launch kernels from [`Threads.@spawn`](https://docs.julialang.org/en/stable/base/multi-threading/#Base.Threads.@spawn)
+tasks when you want to overlap kernel execution with other asynchronous host work.
+
+On GPU backends, [`synchronize`](@ref) is **cooperative** — it yields to the Julia scheduler
+rather than blocking inside a driver call, so other tasks can make progress while a kernel runs.
+See [Notes for backend implementations](@ref implementations_notes) for the contract backend authors must follow.
+
+```julia
+function cooperative_wait(task::Task)
+    while !Base.istaskdone(task)
+        yield()
+    end
+    return wait(task)
+end
+
+function exchange_and_compute!(backend, A, B)
+    recv = Threads.@spawn begin
+        mul2_kernel(backend, 64)(A, ndrange=length(A))
+        synchronize(backend)  # cooperative on GPU backends
+    end
+    send = Threads.@spawn begin
+        mul2_kernel(backend, 64)(B, ndrange=length(B))
+        synchronize(backend)
+    end
+    cooperative_wait(recv)
+    cooperative_wait(send)
+end
+```
 
-TODO
+A full MPI example that overlaps communication with device copies is in
+[`examples/mpi.jl`](https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/mpi.jl).
diff --git a/src/KernelAbstractions.jl b/src/KernelAbstractions.jl
diff --git a/src/nditeration.jl b/src/nditeration.jl
diff --git a/src/reflection.jl b/src/reflection.jl

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-push!(Base.LOAD_PATH, dirname(@__DIR__))`
`2`		`-`
`3`	`1`	`using KernelAbstractions`
`4`	`2`	`using Documenter`
`5`	`3`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Memcopy with static NDRange`
	`1`	`+# [Memcopy with static NDRange](@id memcopy_static)`
`2`	`2`
`3`	`3`	The first example simple copies memory from `B` to `A`. In contrast to the previous examples
`4`	`4`	`it uses a fully static kernel configuration. Specializing the kernel on the iteration range itself.`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Notes for backend implementations`
	`1`	`+# [Notes for backend implementations](@id implementations_notes)`
`2`	`2`
`3`	`3`	## Semantics of `KernelAbstractions.synchronize`
`4`	`4`