Add precompilation workload and move SpecialFunctions to an extension#936
Open
gbaraldi wants to merge 1 commit into
Open
Add precompilation workload and move SpecialFunctions to an extension#936gbaraldi wants to merge 1 commit into
gbaraldi wants to merge 1 commit into
Conversation
349cf73 to
2a3506a
Compare
Contributor
There was a problem hiding this comment.
AMDGPU.jl Benchmarks
Details
| Benchmark suite | Current: b39908f | Previous: 7c9aab0 | Ratio |
|---|---|---|---|
amdgpu/synchronization/context/device |
600 ns |
600 ns |
1 |
amdgpu/synchronization/stream/blocking |
250 ns |
250 ns |
1 |
amdgpu/synchronization/stream/nonblocking |
330 ns |
330 ns |
1 |
array/accumulate/Float32/1d |
87581 ns |
85972 ns |
1.02 |
array/accumulate/Float32/dims=1 |
340805 ns |
412075 ns |
0.83 |
array/accumulate/Float32/dims=1L |
136932 ns |
137091 ns |
1.00 |
array/accumulate/Float32/dims=2 |
130092 ns |
130332 ns |
1.00 |
array/accumulate/Float32/dims=2L |
2810929 ns |
2810115 ns |
1.00 |
array/accumulate/Int64/1d |
98781 ns |
102751 ns |
0.96 |
array/accumulate/Int64/dims=1 |
342704 ns |
442706 ns |
0.77 |
array/accumulate/Int64/dims=1L |
168402 ns |
167432 ns |
1.01 |
array/accumulate/Int64/dims=2 |
122861 ns |
127031 ns |
0.97 |
array/accumulate/Int64/dims=2L |
2988292 ns |
2984467 ns |
1.00 |
array/broadcast |
147262 ns |
70231 ns |
2.10 |
array/construct |
1660 ns |
1700 ns |
0.98 |
array/copy |
39831 ns |
40561 ns |
0.98 |
array/copyto!/cpu_to_gpu |
182913 ns |
121541 ns |
1.50 |
array/copyto!/gpu_to_cpu |
114891 ns |
114461 ns |
1.00 |
array/copyto!/gpu_to_gpu |
131672 ns |
66551 ns |
1.98 |
array/iteration/findall/bool |
183593 ns |
181832 ns |
1.01 |
array/iteration/findall/int |
198563 ns |
192932 ns |
1.03 |
array/iteration/findfirst/bool |
121652 ns |
122251 ns |
1.00 |
array/iteration/findfirst/int |
117022 ns |
116342 ns |
1.01 |
array/iteration/findmin/1d |
167612 ns |
170152 ns |
0.99 |
array/iteration/findmin/2d |
156902 ns |
153822 ns |
1.02 |
array/iteration/logical |
351255 ns |
350744 ns |
1.00 |
array/iteration/scalar |
297134 ns |
296083 ns |
1.00 |
array/permutedims/2d |
75101 ns |
74481 ns |
1.01 |
array/permutedims/3d |
75361 ns |
74251 ns |
1.01 |
array/permutedims/4d |
77521 ns |
76951 ns |
1.01 |
array/random/rand/Float32 |
54780 ns |
52171 ns |
1.05 |
array/random/rand/Int64 |
57970 ns |
58731 ns |
0.99 |
array/random/rand!/Float32 |
144282 ns |
85101 ns |
1.70 |
array/random/rand!/Int64 |
148042 ns |
69261 ns |
2.14 |
array/random/randn/Float32 |
88831 ns |
98642 ns |
0.90 |
array/random/randn!/Float32 |
85702 ns |
101231 ns |
0.85 |
array/reductions/mapreduce/Float32/1d |
134732 ns |
134242 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
96252 ns |
95431 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
776151 ns |
774349 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
97541 ns |
97531 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
300234 ns |
297464 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
135342 ns |
134951 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
96162 ns |
95301 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
782481 ns |
781800 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
97292 ns |
96801 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
298984 ns |
299524 ns |
1.00 |
array/reductions/reduce/Float32/1d |
134262 ns |
133912 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
95851 ns |
95711 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
776211 ns |
775219 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
98121 ns |
97621 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
299415 ns |
297424 ns |
1.01 |
array/reductions/reduce/Int64/1d |
135342 ns |
134602 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
96441 ns |
95311 ns |
1.01 |
array/reductions/reduce/Int64/dims=1L |
785441 ns |
780269 ns |
1.01 |
array/reductions/reduce/Int64/dims=2 |
97792 ns |
97121 ns |
1.01 |
array/reductions/reduce/Int64/dims=2L |
300154 ns |
299264 ns |
1.00 |
array/reverse/1d |
43911 ns |
44550 ns |
0.99 |
array/reverse/1dL |
75561 ns |
76661 ns |
0.99 |
array/reverse/1dL_inplace |
166332 ns |
173202 ns |
0.96 |
array/reverse/1d_inplace |
75971 ns |
84571 ns |
0.90 |
array/reverse/2d |
51641 ns |
52831 ns |
0.98 |
array/reverse/2dL |
102322 ns |
102811 ns |
1.00 |
array/reverse/2dL_inplace |
159442 ns |
178873 ns |
0.89 |
array/reverse/2d_inplace |
144072 ns |
96051 ns |
1.50 |
array/sorting/1d |
341175 ns |
379995 ns |
0.90 |
integration/byval/reference |
39811 ns |
39540 ns |
1.01 |
integration/byval/slices=1 |
40810 ns |
40350 ns |
1.01 |
integration/byval/slices=2 |
146032 ns |
159152 ns |
0.92 |
integration/byval/slices=3 |
237953 ns |
238933 ns |
1.00 |
integration/volumerhs |
5047571 ns |
5031334 ns |
1.00 |
kernel/indexing |
130511 ns |
65521 ns |
1.99 |
kernel/indexing_checked |
65341 ns |
72491 ns |
0.90 |
kernel/launch |
1330 ns |
1280 ns |
1.04 |
kernel/rand |
123352 ns |
124252 ns |
0.99 |
latency/import |
1516539925 ns |
1491816057 ns |
1.02 |
latency/precompile |
33049701658 ns |
11773992921 ns |
2.81 |
latency/ttfp |
2079499186 ns |
10954774141 ns |
0.19 |
This comment was automatically generated by workflow using github-action-benchmark.
2a3506a to
1610237
Compare
…extension Mirror CUDA.jl's startup strategy: - Add `src/precompile.jl` with a PrecompileTools `@compile_workload` that runs the GPUCompiler -> AMDGPU codegen pipeline on a dummy kernel during precompilation. The first kernel launch no longer has to JIT-compile the whole compiler: cold first-kernel time drops ~8.2s -> ~1.0s. The workload builds the compiler config manually (baseline gfx900 target) so it needs neither a GPU nor ROCm discovery, and is guarded by `:AMDGPU in LLVM.backends()` and Julia >= 1.12 (matching CUDA.jl's foreign-MI workaround). Because the workload compiles before `__init__`/ROCm discovery runs, `libdevice_libs` is empty and the device-lib caches (`DEVICE_LIBS`, `_global_hostcalls`) would be poisoned with empty entries and baked into the image, breaking `__ocml_*` linking at runtime. We `empty!` them after the workload so they repopulate correctly once discovery has run. Adds the same explicit launch/library precompile directives CUDA.jl uses (hipfunction/actual_compilation/hiplink, plus rocBLAS/rocSOLVER/rocFFT/rand entry points). - Move SpecialFunctions to a weakdep + `AMDGPUSpecialFunctionsExt`, extracting the SpecialFunctions device overrides out of `device/gcn/math.jl` (Base-math overrides stay in the package). `using AMDGPU` no longer loads SpecialFunctions and its dependency tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1610237 to
b39908f
Compare
luraess
approved these changes
Jun 24, 2026
Member
|
Thanks! Feel free to merge when ready. |
Member
|
@gbaraldi is this ready? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrors CUDA.jl's startup strategy to cut time-to-first-kernel and
usingcost.Precompilation workload
Adds
src/precompile.jl(included at the end ofAMDGPU.jl) with aPrecompileTools
@compile_workloadthat runs the GPUCompiler → AMDGPU codegenpipeline on a dummy kernel during precompilation, so the first real kernel
launch no longer JIT-compiles the entire compiler.
Measured on Julia 1.12.6 (gfx, single GPU):
A * BDetails:
gfx900target), soit needs neither a GPU nor ROCm discovery — it only uses LLVM. Guarded by
:AMDGPU in LLVM.backends()andVERSION >= v"1.12-"(matching CUDA.jl'sforeign-MI-during-precompile workaround).
__init__/ROCm discovery runs,libdevice_libsis empty, which would poison theDEVICE_LIBS/_global_hostcallscaches with empty entries and bake them into the image(breaking
__ocml_*linking at runtime). They areempty!-ed after theworkload so they repopulate correctly once discovery has run.
precompiledirectives CUDA.jl uses:hipfunction/actual_compilation/hiplink, plus the hot entry points of thebundled libraries (rocBLAS
gemm!/*/mul!, rocSOLVER factorizations, rocFFTplans,
rand).SpecialFunctions extension
Moves
SpecialFunctionsfrom a hard dependency to a weakdep +AMDGPUSpecialFunctionsExt, extracting its device overrides out ofdevice/gcn/math.jl(the Base-math overrides stay in the package).using AMDGPUno longer loads SpecialFunctions and its dependency tree; the deviceoverrides are registered when a user loads SpecialFunctions, same as CUDA.jl.
Verified: Base-math kernels (
sqrt/exp/sin/pow), matmul, reductions, andthe extension's
erf/loggammadevice overrides all produce correct results.🤖 Generated with Claude Code