Use Octavian.jl for large mixed-mode CPU calculations.#125
Conversation
|
Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9? |
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## master #125 +/- ##
=======================================
Coverage 30.27% 30.27%
=======================================
Files 11 11
Lines 786 786
=======================================
Hits 238 238
Misses 548 548 ☔ View full report in Codecov by Sentry. |
|
For timings, I get julia> @time using Octavian
0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
Range (min … max): 43.139 ms … 44.684 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.791 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.750 ms ± 447.341 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁█ ▁ ▁ ▁
█▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
43.1 ms Histogram: frequency by time 44.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 42.711 ms … 43.548 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.004 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.067 ms ± 267.509 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ █ █ ▁▁ ▁ ▁ ▁ ▁▁
█▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
42.7 ms Histogram: frequency by time 43.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 44.262 ms … 54.795 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.080 ms ┊ GC (median): 0.00%
Time (mean ± σ): 47.153 ms ± 3.564 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂
▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
44.3 ms Histogram: frequency by time 54.8 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 11 on 8 virtual cores
Environment:
JULIA_PATH = @.
LD_LIBRARY_PATH = /usr/local/lib/
JULIA_NUM_THREADS = 8
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.soWhich, aside from Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite. |
I'm surprised it isn't <1.8, as 1.8 added
It should not be compiling for differently sized inputs, only different types. julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);
julia> @time using Octavian
0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)
julia> @time @eval matmul!(C,A,B);
10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time)Code coverage: julia> @time @eval matmul!(C,A,B);
202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time)But hopefully only GemmKernel's coverage gets taken with |
|
Thanks for the input! Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel.
We're just setting |
Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue. |
LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:
Octavian.jl fares quite a bit better:
However, replacing all of our
LinearAlgebra.mul!uses withOctavian.matmul!regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?For now, only use Octavian for large mixed-mode cases, which gets test times back to before #124.