As reported in luraess/JuliaGPUPerf#1, there is an issue significantly affecting performance when doing ^ operation on Float32 within GPU Triad 2D kernels:
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float (see here)
Performance (memory throughput in GB/s) gets reduced by nearly a factor 2 compared to same experiment using Float64.
See https://github.com/luraess/JuliaGPUPerf/blob/main/cuda_bench.jl for reproducer (and README for perf. output).
All testing was done using Julia v1.7, CUDA v3.8, on devices using cuda 11.4 stack without artifact.
As reported in luraess/JuliaGPUPerf#1, there is an issue significantly affecting performance when doing ^ operation on Float32 within GPU Triad 2D kernels:
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float(see here)Performance (memory throughput in GB/s) gets reduced by nearly a factor 2 compared to same experiment using Float64.
See https://github.com/luraess/JuliaGPUPerf/blob/main/cuda_bench.jl for reproducer (and README for perf. output).
All testing was done using Julia v1.7, CUDA v3.8, on devices using cuda 11.4 stack without artifact.