Add precompilation workload and move SpecialFunctions to an extension by gbaraldi · Pull Request #936 · JuliaGPU/AMDGPU.jl

gbaraldi · 2026-06-23T16:33:55Z

Mirrors CUDA.jl's startup strategy to cut time-to-first-kernel and using cost.

Precompilation workload

Adds src/precompile.jl (included at the end of AMDGPU.jl) with a
PrecompileTools @compile_workload that runs the GPUCompiler → AMDGPU codegen
pipeline on a dummy kernel during precompilation, so the first real kernel
launch no longer JIT-compiles the entire compiler.

Measured on Julia 1.12.6 (gfx, single GPU):

metric	before	after
first kernel (cold)	~8.2 s	~1.0 s
first `A * B`	full first-use compile	0 methods compiled

Details:

The workload builds the compiler config manually (baseline gfx900 target), so
it needs neither a GPU nor ROCm discovery — it only uses LLVM. Guarded by
:AMDGPU in LLVM.backends() and VERSION >= v"1.12-" (matching CUDA.jl's
foreign-MI-during-precompile workaround).
Because the workload compiles before __init__/ROCm discovery runs,
libdevice_libs is empty, which would poison the DEVICE_LIBS /
_global_hostcalls caches with empty entries and bake them into the image
(breaking __ocml_* linking at runtime). They are empty!-ed after the
workload so they repopulate correctly once discovery has run.
Adds the same explicit launch/library precompile directives CUDA.jl uses:
hipfunction/actual_compilation/hiplink, plus the hot entry points of the
bundled libraries (rocBLAS gemm!/*/mul!, rocSOLVER factorizations, rocFFT
plans, rand).

SpecialFunctions extension

Moves SpecialFunctions from a hard dependency to a weakdep +
AMDGPUSpecialFunctionsExt, extracting its device overrides out of
device/gcn/math.jl (the Base-math overrides stay in the package). using AMDGPU no longer loads SpecialFunctions and its dependency tree; the device
overrides are registered when a user loads SpecialFunctions, same as CUDA.jl.

Verified: Base-math kernels (sqrt/exp/sin/pow), matmul, reductions, and
the extension's erf/loggamma device overrides all produce correct results.

🤖 Generated with Claude Code

github-actions

AMDGPU.jl Benchmarks

Details

Benchmark suite	Current: `b39908f`	Previous: `7c9aab0`	Ratio
`amdgpu/synchronization/context/device`	`600` ns	`600` ns	`1`
`amdgpu/synchronization/stream/blocking`	`250` ns	`250` ns	`1`
`amdgpu/synchronization/stream/nonblocking`	`330` ns	`330` ns	`1`
`array/accumulate/Float32/1d`	`87581` ns	`85972` ns	`1.02`
`array/accumulate/Float32/dims=1`	`340805` ns	`412075` ns	`0.83`
`array/accumulate/Float32/dims=1L`	`136932` ns	`137091` ns	`1.00`
`array/accumulate/Float32/dims=2`	`130092` ns	`130332` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`2810929` ns	`2810115` ns	`1.00`
`array/accumulate/Int64/1d`	`98781` ns	`102751` ns	`0.96`
`array/accumulate/Int64/dims=1`	`342704` ns	`442706` ns	`0.77`
`array/accumulate/Int64/dims=1L`	`168402` ns	`167432` ns	`1.01`
`array/accumulate/Int64/dims=2`	`122861` ns	`127031` ns	`0.97`
`array/accumulate/Int64/dims=2L`	`2988292` ns	`2984467` ns	`1.00`
`array/broadcast`	`147262` ns	`70231` ns	`2.10`
`array/construct`	`1660` ns	`1700` ns	`0.98`
`array/copy`	`39831` ns	`40561` ns	`0.98`
`array/copyto!/cpu_to_gpu`	`182913` ns	`121541` ns	`1.50`
`array/copyto!/gpu_to_cpu`	`114891` ns	`114461` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`131672` ns	`66551` ns	`1.98`
`array/iteration/findall/bool`	`183593` ns	`181832` ns	`1.01`
`array/iteration/findall/int`	`198563` ns	`192932` ns	`1.03`
`array/iteration/findfirst/bool`	`121652` ns	`122251` ns	`1.00`
`array/iteration/findfirst/int`	`117022` ns	`116342` ns	`1.01`
`array/iteration/findmin/1d`	`167612` ns	`170152` ns	`0.99`
`array/iteration/findmin/2d`	`156902` ns	`153822` ns	`1.02`
`array/iteration/logical`	`351255` ns	`350744` ns	`1.00`
`array/iteration/scalar`	`297134` ns	`296083` ns	`1.00`
`array/permutedims/2d`	`75101` ns	`74481` ns	`1.01`
`array/permutedims/3d`	`75361` ns	`74251` ns	`1.01`
`array/permutedims/4d`	`77521` ns	`76951` ns	`1.01`
`array/random/rand/Float32`	`54780` ns	`52171` ns	`1.05`
`array/random/rand/Int64`	`57970` ns	`58731` ns	`0.99`
`array/random/rand!/Float32`	`144282` ns	`85101` ns	`1.70`
`array/random/rand!/Int64`	`148042` ns	`69261` ns	`2.14`
`array/random/randn/Float32`	`88831` ns	`98642` ns	`0.90`
`array/random/randn!/Float32`	`85702` ns	`101231` ns	`0.85`
`array/reductions/mapreduce/Float32/1d`	`134732` ns	`134242` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1`	`96252` ns	`95431` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`776151` ns	`774349` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`97541` ns	`97531` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`300234` ns	`297464` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`135342` ns	`134951` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`96162` ns	`95301` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1L`	`782481` ns	`781800` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`97292` ns	`96801` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2L`	`298984` ns	`299524` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`134262` ns	`133912` ns	`1.00`
`array/reductions/reduce/Float32/dims=1`	`95851` ns	`95711` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`776211` ns	`775219` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`98121` ns	`97621` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`299415` ns	`297424` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`135342` ns	`134602` ns	`1.01`
`array/reductions/reduce/Int64/dims=1`	`96441` ns	`95311` ns	`1.01`
`array/reductions/reduce/Int64/dims=1L`	`785441` ns	`780269` ns	`1.01`
`array/reductions/reduce/Int64/dims=2`	`97792` ns	`97121` ns	`1.01`
`array/reductions/reduce/Int64/dims=2L`	`300154` ns	`299264` ns	`1.00`
`array/reverse/1d`	`43911` ns	`44550` ns	`0.99`
`array/reverse/1dL`	`75561` ns	`76661` ns	`0.99`
`array/reverse/1dL_inplace`	`166332` ns	`173202` ns	`0.96`
`array/reverse/1d_inplace`	`75971` ns	`84571` ns	`0.90`
`array/reverse/2d`	`51641` ns	`52831` ns	`0.98`
`array/reverse/2dL`	`102322` ns	`102811` ns	`1.00`
`array/reverse/2dL_inplace`	`159442` ns	`178873` ns	`0.89`
`array/reverse/2d_inplace`	`144072` ns	`96051` ns	`1.50`
`array/sorting/1d`	`341175` ns	`379995` ns	`0.90`
`integration/byval/reference`	`39811` ns	`39540` ns	`1.01`
`integration/byval/slices=1`	`40810` ns	`40350` ns	`1.01`
`integration/byval/slices=2`	`146032` ns	`159152` ns	`0.92`
`integration/byval/slices=3`	`237953` ns	`238933` ns	`1.00`
`integration/volumerhs`	`5047571` ns	`5031334` ns	`1.00`
`kernel/indexing`	`130511` ns	`65521` ns	`1.99`
`kernel/indexing_checked`	`65341` ns	`72491` ns	`0.90`
`kernel/launch`	`1330` ns	`1280` ns	`1.04`
`kernel/rand`	`123352` ns	`124252` ns	`0.99`
`latency/import`	`1516539925` ns	`1491816057` ns	`1.02`
`latency/precompile`	`33049701658` ns	`11773992921` ns	`2.81`
`latency/ttfp`	`2079499186` ns	`10954774141` ns	`0.19`

This comment was automatically generated by workflow using github-action-benchmark.

…extension Mirror CUDA.jl's startup strategy: - Add `src/precompile.jl` with a PrecompileTools `@compile_workload` that runs the GPUCompiler -> AMDGPU codegen pipeline on a dummy kernel during precompilation. The first kernel launch no longer has to JIT-compile the whole compiler: cold first-kernel time drops ~8.2s -> ~1.0s. The workload builds the compiler config manually (baseline gfx900 target) so it needs neither a GPU nor ROCm discovery, and is guarded by `:AMDGPU in LLVM.backends()` and Julia >= 1.12 (matching CUDA.jl's foreign-MI workaround). Because the workload compiles before `__init__`/ROCm discovery runs, `libdevice_libs` is empty and the device-lib caches (`DEVICE_LIBS`, `_global_hostcalls`) would be poisoned with empty entries and baked into the image, breaking `__ocml_*` linking at runtime. We `empty!` them after the workload so they repopulate correctly once discovery has run. Adds the same explicit launch/library precompile directives CUDA.jl uses (hipfunction/actual_compilation/hiplink, plus rocBLAS/rocSOLVER/rocFFT/rand entry points). - Move SpecialFunctions to a weakdep + `AMDGPUSpecialFunctionsExt`, extracting the SpecialFunctions device overrides out of `device/gcn/math.jl` (Base-math overrides stay in the package). `using AMDGPU` no longer loads SpecialFunctions and its dependency tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

luraess · 2026-06-24T06:05:47Z

Thanks! Feel free to merge when ready.

luraess · 2026-06-25T09:30:27Z

@gbaraldi is this ready?

gbaraldi force-pushed the gb/precompile-startup branch from 349cf73 to 2a3506a Compare June 23, 2026 17:15

github-actions Bot reviewed Jun 23, 2026

View reviewed changes

gbaraldi force-pushed the gb/precompile-startup branch from 2a3506a to 1610237 Compare June 23, 2026 21:57

gbaraldi force-pushed the gb/precompile-startup branch from 1610237 to b39908f Compare June 23, 2026 22:22

luraess approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add precompilation workload and move SpecialFunctions to an extension#936

Add precompilation workload and move SpecialFunctions to an extension#936
gbaraldi wants to merge 1 commit into
mainfrom
gb/precompile-startup

gbaraldi commented Jun 23, 2026

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

luraess commented Jun 24, 2026

Uh oh!

luraess commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gbaraldi commented Jun 23, 2026

Precompilation workload

SpecialFunctions extension

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMDGPU.jl Benchmarks

Uh oh!

luraess commented Jun 24, 2026

Uh oh!

luraess commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot left a comment •

edited

Loading