Skip to content

Add precompilation workload and move SpecialFunctions to an extension#936

Open
gbaraldi wants to merge 1 commit into
mainfrom
gb/precompile-startup
Open

Add precompilation workload and move SpecialFunctions to an extension#936
gbaraldi wants to merge 1 commit into
mainfrom
gb/precompile-startup

Conversation

@gbaraldi

Copy link
Copy Markdown
Member

Mirrors CUDA.jl's startup strategy to cut time-to-first-kernel and using cost.

Precompilation workload

Adds src/precompile.jl (included at the end of AMDGPU.jl) with a
PrecompileTools @compile_workload that runs the GPUCompiler → AMDGPU codegen
pipeline on a dummy kernel during precompilation, so the first real kernel
launch no longer JIT-compiles the entire compiler.

Measured on Julia 1.12.6 (gfx, single GPU):

metric before after
first kernel (cold) ~8.2 s ~1.0 s
first A * B full first-use compile 0 methods compiled

Details:

  • The workload builds the compiler config manually (baseline gfx900 target), so
    it needs neither a GPU nor ROCm discovery — it only uses LLVM. Guarded by
    :AMDGPU in LLVM.backends() and VERSION >= v"1.12-" (matching CUDA.jl's
    foreign-MI-during-precompile workaround).
  • Because the workload compiles before __init__/ROCm discovery runs,
    libdevice_libs is empty, which would poison the DEVICE_LIBS /
    _global_hostcalls caches with empty entries and bake them into the image
    (breaking __ocml_* linking at runtime). They are empty!-ed after the
    workload so they repopulate correctly once discovery has run.
  • Adds the same explicit launch/library precompile directives CUDA.jl uses:
    hipfunction/actual_compilation/hiplink, plus the hot entry points of the
    bundled libraries (rocBLAS gemm!/*/mul!, rocSOLVER factorizations, rocFFT
    plans, rand).

SpecialFunctions extension

Moves SpecialFunctions from a hard dependency to a weakdep +
AMDGPUSpecialFunctionsExt, extracting its device overrides out of
device/gcn/math.jl (the Base-math overrides stay in the package). using AMDGPU no longer loads SpecialFunctions and its dependency tree; the device
overrides are registered when a user loads SpecialFunctions, same as CUDA.jl.

Verified: Base-math kernels (sqrt/exp/sin/pow), matmul, reductions, and
the extension's erf/loggamma device overrides all produce correct results.

🤖 Generated with Claude Code

@gbaraldi gbaraldi force-pushed the gb/precompile-startup branch from 349cf73 to 2a3506a Compare June 23, 2026 17:15

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU.jl Benchmarks

Details
Benchmark suite Current: b39908f Previous: 7c9aab0 Ratio
amdgpu/synchronization/context/device 600 ns 600 ns 1
amdgpu/synchronization/stream/blocking 250 ns 250 ns 1
amdgpu/synchronization/stream/nonblocking 330 ns 330 ns 1
array/accumulate/Float32/1d 87581 ns 85972 ns 1.02
array/accumulate/Float32/dims=1 340805 ns 412075 ns 0.83
array/accumulate/Float32/dims=1L 136932 ns 137091 ns 1.00
array/accumulate/Float32/dims=2 130092 ns 130332 ns 1.00
array/accumulate/Float32/dims=2L 2810929 ns 2810115 ns 1.00
array/accumulate/Int64/1d 98781 ns 102751 ns 0.96
array/accumulate/Int64/dims=1 342704 ns 442706 ns 0.77
array/accumulate/Int64/dims=1L 168402 ns 167432 ns 1.01
array/accumulate/Int64/dims=2 122861 ns 127031 ns 0.97
array/accumulate/Int64/dims=2L 2988292 ns 2984467 ns 1.00
array/broadcast 147262 ns 70231 ns 2.10
array/construct 1660 ns 1700 ns 0.98
array/copy 39831 ns 40561 ns 0.98
array/copyto!/cpu_to_gpu 182913 ns 121541 ns 1.50
array/copyto!/gpu_to_cpu 114891 ns 114461 ns 1.00
array/copyto!/gpu_to_gpu 131672 ns 66551 ns 1.98
array/iteration/findall/bool 183593 ns 181832 ns 1.01
array/iteration/findall/int 198563 ns 192932 ns 1.03
array/iteration/findfirst/bool 121652 ns 122251 ns 1.00
array/iteration/findfirst/int 117022 ns 116342 ns 1.01
array/iteration/findmin/1d 167612 ns 170152 ns 0.99
array/iteration/findmin/2d 156902 ns 153822 ns 1.02
array/iteration/logical 351255 ns 350744 ns 1.00
array/iteration/scalar 297134 ns 296083 ns 1.00
array/permutedims/2d 75101 ns 74481 ns 1.01
array/permutedims/3d 75361 ns 74251 ns 1.01
array/permutedims/4d 77521 ns 76951 ns 1.01
array/random/rand/Float32 54780 ns 52171 ns 1.05
array/random/rand/Int64 57970 ns 58731 ns 0.99
array/random/rand!/Float32 144282 ns 85101 ns 1.70
array/random/rand!/Int64 148042 ns 69261 ns 2.14
array/random/randn/Float32 88831 ns 98642 ns 0.90
array/random/randn!/Float32 85702 ns 101231 ns 0.85
array/reductions/mapreduce/Float32/1d 134732 ns 134242 ns 1.00
array/reductions/mapreduce/Float32/dims=1 96252 ns 95431 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 776151 ns 774349 ns 1.00
array/reductions/mapreduce/Float32/dims=2 97541 ns 97531 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 300234 ns 297464 ns 1.01
array/reductions/mapreduce/Int64/1d 135342 ns 134951 ns 1.00
array/reductions/mapreduce/Int64/dims=1 96162 ns 95301 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 782481 ns 781800 ns 1.00
array/reductions/mapreduce/Int64/dims=2 97292 ns 96801 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 298984 ns 299524 ns 1.00
array/reductions/reduce/Float32/1d 134262 ns 133912 ns 1.00
array/reductions/reduce/Float32/dims=1 95851 ns 95711 ns 1.00
array/reductions/reduce/Float32/dims=1L 776211 ns 775219 ns 1.00
array/reductions/reduce/Float32/dims=2 98121 ns 97621 ns 1.01
array/reductions/reduce/Float32/dims=2L 299415 ns 297424 ns 1.01
array/reductions/reduce/Int64/1d 135342 ns 134602 ns 1.01
array/reductions/reduce/Int64/dims=1 96441 ns 95311 ns 1.01
array/reductions/reduce/Int64/dims=1L 785441 ns 780269 ns 1.01
array/reductions/reduce/Int64/dims=2 97792 ns 97121 ns 1.01
array/reductions/reduce/Int64/dims=2L 300154 ns 299264 ns 1.00
array/reverse/1d 43911 ns 44550 ns 0.99
array/reverse/1dL 75561 ns 76661 ns 0.99
array/reverse/1dL_inplace 166332 ns 173202 ns 0.96
array/reverse/1d_inplace 75971 ns 84571 ns 0.90
array/reverse/2d 51641 ns 52831 ns 0.98
array/reverse/2dL 102322 ns 102811 ns 1.00
array/reverse/2dL_inplace 159442 ns 178873 ns 0.89
array/reverse/2d_inplace 144072 ns 96051 ns 1.50
array/sorting/1d 341175 ns 379995 ns 0.90
integration/byval/reference 39811 ns 39540 ns 1.01
integration/byval/slices=1 40810 ns 40350 ns 1.01
integration/byval/slices=2 146032 ns 159152 ns 0.92
integration/byval/slices=3 237953 ns 238933 ns 1.00
integration/volumerhs 5047571 ns 5031334 ns 1.00
kernel/indexing 130511 ns 65521 ns 1.99
kernel/indexing_checked 65341 ns 72491 ns 0.90
kernel/launch 1330 ns 1280 ns 1.04
kernel/rand 123352 ns 124252 ns 0.99
latency/import 1516539925 ns 1491816057 ns 1.02
latency/precompile 33049701658 ns 11773992921 ns 2.81
latency/ttfp 2079499186 ns 10954774141 ns 0.19

This comment was automatically generated by workflow using github-action-benchmark.

@gbaraldi gbaraldi force-pushed the gb/precompile-startup branch from 2a3506a to 1610237 Compare June 23, 2026 21:57
…extension

Mirror CUDA.jl's startup strategy:

- Add `src/precompile.jl` with a PrecompileTools `@compile_workload` that runs
  the GPUCompiler -> AMDGPU codegen pipeline on a dummy kernel during
  precompilation. The first kernel launch no longer has to JIT-compile the whole
  compiler: cold first-kernel time drops ~8.2s -> ~1.0s. The workload builds the
  compiler config manually (baseline gfx900 target) so it needs neither a GPU nor
  ROCm discovery, and is guarded by `:AMDGPU in LLVM.backends()` and Julia >= 1.12
  (matching CUDA.jl's foreign-MI workaround).

  Because the workload compiles before `__init__`/ROCm discovery runs,
  `libdevice_libs` is empty and the device-lib caches (`DEVICE_LIBS`,
  `_global_hostcalls`) would be poisoned with empty entries and baked into the
  image, breaking `__ocml_*` linking at runtime. We `empty!` them after the
  workload so they repopulate correctly once discovery has run.

  Adds the same explicit launch/library precompile directives CUDA.jl uses
  (hipfunction/actual_compilation/hiplink, plus rocBLAS/rocSOLVER/rocFFT/rand
  entry points).

- Move SpecialFunctions to a weakdep + `AMDGPUSpecialFunctionsExt`, extracting the
  SpecialFunctions device overrides out of `device/gcn/math.jl` (Base-math
  overrides stay in the package). `using AMDGPU` no longer loads SpecialFunctions
  and its dependency tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gbaraldi gbaraldi force-pushed the gb/precompile-startup branch from 1610237 to b39908f Compare June 23, 2026 22:22
@luraess

luraess commented Jun 24, 2026

Copy link
Copy Markdown
Member

Thanks! Feel free to merge when ready.

@luraess

luraess commented Jun 25, 2026

Copy link
Copy Markdown
Member

@gbaraldi is this ready?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants