Skip to content

Commit 3dc3c7b

Browse files
efaulhabersloede
andauthored
Use fast divisions in performance-critical code (#1128)
* Use fast divisions in performance-critical code * Optimize density diffusion * Reformat * Add docs * Add tests for div_fast * Add comment * Add comments * Reformat * Fix tests * Fix docs * Update docs/src/development.md Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com> * Reformat * Optimize the other viscosity models * Reformat * Fix typos --------- Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>
1 parent 2277802 commit 3dc3c7b

File tree

11 files changed

+231
-45
lines changed

11 files changed

+231
-45
lines changed

Project.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,14 +37,17 @@ WriteVTK = "64499a7a-5c06-52f2-abe2-ccb03c286192"
3737
[weakdeps]
3838
OrdinaryDiffEq = "1dea7af3-3e70-54e6-95c3-0bf5283fa5ed"
3939
OrdinaryDiffEqCore = "bbf590c4-e513-4bbe-9b18-05decba2e5d8"
40+
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
4041

4142
[extensions]
4243
TrixiParticlesOrdinaryDiffEqExt = ["OrdinaryDiffEq", "OrdinaryDiffEqCore"]
44+
TrixiParticlesCUDAExt = "CUDA"
4345

4446
[compat]
4547
Accessors = "0.1.43"
4648
Adapt = "4"
4749
CSV = "0.10"
50+
CUDA = "5.9.1"
4851
DataFrames = "1.6"
4952
DelimitedFiles = "1"
5053
DiffEqCallbacks = "4"

docs/src/development.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,45 @@ To create a new release for TrixiParticles.jl, perform the following steps:
6666
version should be `v0.3.1-dev`. If you just released `v0.2.4`, the new development
6767
version should be `v0.2.5-dev`.
6868

69+
## [Writing GPU-compatible code](@id writing_gpu_code)
70+
71+
When implementing new functionality that should run on both CPUs and GPUs,
72+
follow these rules:
73+
74+
1. Data structures must be generic and parametric.
75+
Do not hardcode concrete CPU array types like `Vector` or `Matrix` in fields.
76+
Use type parameters, so the same structure can store CPU arrays and GPU arrays.
77+
2. Add an Adapt.jl rule in `src/general/gpu.jl`.
78+
Register the new type with `Adapt.@adapt_structure ...`, so `adapt` can recursively
79+
convert all arrays inside the structure to GPU arrays.
80+
This conversion is then applied automatically inside `semidiscretize`.
81+
3. Use `@threaded` for all loops.
82+
Accessing GPU arrays inside regular loops is not allowed.
83+
With a GPU backend, `@threaded` loops are compiled to GPU kernels.
84+
4. Write type-stable code and do not allocate inside `@threaded` loops.
85+
This is required for GPU kernels and is also essential for fast multithreaded CPU code.
86+
87+
## [Writing fast GPU code](@id writing_fast_gpu_code)
88+
89+
The following rules improve kernel performance and avoid common GPU pitfalls:
90+
91+
1. Avoid exceptions and bounds errors inside kernels.
92+
Perform all required checks before entering `@threaded` loops (that is, before GPU kernels).
93+
Then use `@inbounds` directly at the loop where bounds are guaranteed.
94+
In TrixiParticles.jl, we do not place `@inbounds` inside inner helper functions.
95+
Instead, mark helper functions with `@propagate_inbounds` so the loop-level
96+
`@inbounds` is propagated.
97+
2. Avoid implicit `Float64` literals in arithmetic.
98+
For example, prefer `x / 2` over `0.5 * x` so `Float32` simulations stay in `Float32`.
99+
Verify this with `@device_code`, or by confirming the kernel runs on an Apple GPU
100+
(most Apple GPUs do not support `Float64`).
101+
3. Use `div_fast` in performance-critical divisions, but only after benchmarking (!).
102+
It can significantly speed up kernels, but should not be applied indiscriminately.
103+
When introducing `div_fast` in code, add a reference to [this section](@ref writing_fast_gpu_code)
104+
to document the rationale and benchmarking context, e.g., like so:
105+
```julia
106+
# Since this is one of the most performance critical functions, using fast divisions
107+
# here gives a significant speedup on GPUs.
108+
# See the docs page "Development" for more details on `div_fast`.
109+
result = div_fast(dividend, divisor)
110+
```

docs/src/gpu.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,3 +137,8 @@ On GPUs that do not support `Float64`, such as most Apple GPUs, we also need to
137137
the coordinates to `Float32` by passing `coordinates_eltype=Float32` to
138138
the setup functions that create [`InitialCondition`](@ref)s, such as
139139
[`RectangularTank`](@ref), [`RectangularShape`](@ref), and [`SphereShape`](@ref).
140+
141+
## Writing GPU-compatible code
142+
143+
Please see the [development documentation](@ref writing_gpu_code) for guidelines on
144+
how to write GPU-compatible code.

ext/TrixiParticlesCUDAExt.jl

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# TODO this might be integrated into CUDA.jl at some point, see
2+
# https://github.com/JuliaGPU/CUDA.jl/pull/3077
3+
module TrixiParticlesCUDAExt
4+
5+
using CUDA: CUDA
6+
using TrixiParticles: TrixiParticles
7+
8+
# Use faster version of `div_fast` for `Float64` on CUDA.
9+
# By default, `div_fast` translates to `Base.FastMath.div_fast`, but there is
10+
# no fast division for `Float64` on CUDA, so we need to redefine it here to use the
11+
# improved fast reciprocal defined below.
12+
CUDA.@device_override TrixiParticles.div_fast(x, y::Float64) = x * fast_inv_cuda(y)
13+
14+
# Improved fast reciprocal for `Float64` by @Mikolaj-A-Kowalski, which is significantly
15+
# more accurate than just calling "llvm.nvvm.rcp.approx.ftz.d" without the cubic iteration,
16+
# while still being much faster than a full division.
17+
# This is copied from Oceananigans.jl, see https://github.com/CliMA/Oceananigans.jl/pull/5140.
18+
@inline function fast_inv_cuda(a::Float64)
19+
# Get the approximate reciprocal
20+
# https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rcp-approx-ftz-f64
21+
# This instruction chops off last 32bits of mantissa and computes inverse
22+
# while treating all subnormal numbers as 0.0
23+
# If reciprocal would be subnormal, underflows to 0.0
24+
# 32 least significant bits of the result are filled with 0s
25+
inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)
26+
27+
# Approximate the missing 32bits of mantissa with a single cubic iteration
28+
e = fma(inv_a, -a, 1.0)
29+
e = fma(e, e, e)
30+
inv_a = fma(e, inv_a, inv_a)
31+
return inv_a
32+
end
33+
34+
end # module

src/schemes/fluid/fluid.jl

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,10 @@ end
150150
particle_system, neighbor_system,
151151
particle, neighbor, rho_a, rho_b)
152152

153-
dv[end, particle] += rho_a / rho_b * m_b * dot(vdiff, grad_kernel)
153+
# Since this is one of the most performance critical functions, using fast divisions
154+
# here gives a significant speedup on GPUs.
155+
# See the docs page "Development" for more details on `div_fast`.
156+
dv[end, particle] += div_fast(rho_a, rho_b) * m_b * dot(vdiff, grad_kernel)
154157

155158
# Artificial density diffusion should only be applied to systems representing a fluid
156159
# with the same physical properties i.e. density and viscosity.

src/schemes/fluid/pressure_acceleration.jl

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,20 @@
77
# asymmetric version.
88
@inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
99
W_a)
10-
return -m_b * (p_a / rho_a^2 + p_b / rho_b^2) * W_a
10+
# Since this is one of the most performance critical functions, using fast divisions
11+
# here gives a significant speedup on GPUs.
12+
# See the docs page "Development" for more details on `div_fast`.
13+
return -m_b * (div_fast(p_a, rho_a^2) + div_fast(p_b, rho_b^2)) * W_a
1114
end
1215

1316
# Same as above, but not assuming symmetry of the kernel gradient. To be used with
1417
# corrections that do not produce a symmetric kernel gradient.
1518
@inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
1619
W_a, W_b)
17-
return -m_b * (p_a / rho_a^2 * W_a - p_b / rho_b^2 * W_b)
20+
# Since this is one of the most performance critical functions, using fast divisions
21+
# here gives a significant speedup on GPUs.
22+
# See the docs page "Development" for more details on `div_fast`.
23+
return -m_b * (div_fast(p_a, rho_a^2) * W_a - div_fast(p_b, rho_b^2) * W_b)
1824
end
1925

2026
# As shown in "Variational and momentum preservation aspects of Smooth Particle Hydrodynamic
@@ -26,14 +32,20 @@ end
2632
# asymmetric version.
2733
@inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
2834
W_a)
29-
return -m_b * (p_a + p_b) / (rho_a * rho_b) * W_a
35+
# Since this is one of the most performance critical functions, using fast divisions
36+
# here gives a significant speedup on GPUs.
37+
# See the docs page "Development" for more details on `div_fast`.
38+
return -m_b * div_fast(p_a + p_b, rho_a * rho_b) * W_a
3039
end
3140

3241
# Same as above, but not assuming symmetry of the kernel gradient. To be used with
3342
# corrections that do not produce a symmetric kernel gradient.
3443
@inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
3544
W_a, W_b)
36-
return -m_b / (rho_a * rho_b) * (p_a * W_a - p_b * W_b)
45+
# Since this is one of the most performance critical functions, using fast divisions
46+
# here gives a significant speedup on GPUs.
47+
# See the docs page "Development" for more details on `div_fast`.
48+
return -div_fast(m_b, rho_a * rho_b) * (p_a * W_a - p_b * W_b)
3749
end
3850

3951
@doc raw"""

src/schemes/fluid/viscosity.jl

Lines changed: 40 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -127,8 +127,11 @@ end
127127
# approaching particles and turn it off for receding particles. In this way, the
128128
# viscosity is used for shocks and not rarefactions."
129129
if vr < 0
130-
mu = h * vr / (distance^2 + epsilon * h^2)
131-
return (alpha * c * mu + beta * mu^2) / rho_mean * grad_kernel
130+
# Since this is one of the most performance critical functions, using fast divisions
131+
# here gives a significant speedup on GPUs.
132+
# See the docs page "Development" for more details on `div_fast`.
133+
mu = div_fast(h * vr, distance^2 + epsilon * h^2)
134+
return div_fast(alpha * c * mu + beta * mu^2, rho_mean) * grad_kernel
132135
end
133136

134137
return zero(v_diff)
@@ -142,8 +145,11 @@ end
142145
mu_a = nu_a * rho_a
143146
mu_b = nu_b * rho_b
144147

145-
return (mu_a + mu_b) / (rho_a * rho_b) * dot(pos_diff, grad_kernel) /
146-
(distance^2 + epsilon * h^2) * v_diff
148+
# Since this is one of the most performance critical functions, using fast divisions
149+
# here gives a significant speedup on GPUs.
150+
# See the docs page "Development" for more details on `div_fast`.
151+
return div_fast((mu_a + mu_b) * dot(pos_diff, grad_kernel),
152+
rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
147153
end
148154

149155
# See, e.g.,
@@ -177,17 +183,20 @@ struct ViscosityAdami{ELTYPE}
177183
end
178184
end
179185

180-
function adami_viscosity_force(smoothing_length_average, pos_diff, distance, grad_kernel,
186+
function adami_viscosity_force(h, pos_diff, distance, grad_kernel,
181187
m_a, m_b, rho_a, rho_b, v_diff, nu_a, nu_b, epsilon)
182188
eta_a = nu_a * rho_a
183189
eta_b = nu_b * rho_b
184190

185-
eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
191+
# Since this is one of the most performance critical functions, using fast divisions
192+
# here gives a significant speedup on GPUs.
193+
# See the docs page "Development" for more details on `div_fast`.
194+
volume_a = div_fast(m_a, rho_a)
195+
volume_b = div_fast(m_b, rho_b)
186196

187-
tmp = eta_tilde / (distance^2 + epsilon * smoothing_length_average^2)
188-
189-
volume_a = m_a / rho_a
190-
volume_b = m_b / rho_b
197+
# eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
198+
# tmp = eta_tilde / (distance^2 + epsilon * h^2) / m_a
199+
tmp = div_fast(2 * eta_a * eta_b, (eta_a + eta_b) * (distance^2 + epsilon * h^2) * m_a)
191200

192201
# This formulation was introduced by Hu and Adams (2006). https://doi.org/10.1016/j.jcp.2005.09.001
193202
# They argued that the formulation is more flexible because of the possibility to formulate
@@ -198,9 +207,9 @@ function adami_viscosity_force(smoothing_length_average, pos_diff, distance, gra
198207
# Because when using this formulation for the pressure acceleration, it is not
199208
# energy conserving.
200209
# See issue: https://github.com/trixi-framework/TrixiParticles.jl/issues/394
201-
visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp / m_a
210+
visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp
202211

203-
return visc .* v_diff
212+
return visc * v_diff
204213
end
205214

206215
@inline function (viscosity::ViscosityAdami)(particle_system, neighbor_system,
@@ -334,7 +343,10 @@ ViscosityAdamiSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityAdamiSGS(nu, C_S, eps
334343
# and then the Smagorinsky eddy viscosity:
335344
# ν_SGS = (C_S * h̄)^2 * S_mag.
336345
#
337-
S_mag = norm(v_diff) / (distance + epsilon)
346+
# Since this is one of the most performance critical functions, using fast divisions
347+
# here gives a significant speedup on GPUs.
348+
# See the docs page "Development" for more details on `div_fast`.
349+
S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
338350
nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag
339351

340352
# Effective kinematic viscosity is the sum of the standard and SGS parts.
@@ -412,7 +424,7 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
412424

413425
smoothing_length_particle = smoothing_length(particle_system, particle)
414426
smoothing_length_neighbor = smoothing_length(particle_system, neighbor)
415-
smoothing_length_average = (smoothing_length_particle + smoothing_length_neighbor) / 2
427+
h = (smoothing_length_particle + smoothing_length_neighbor) / 2
416428

417429
nu_a = kinematic_viscosity(particle_system,
418430
viscosity_model(neighbor_system, particle_system),
@@ -427,8 +439,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
427439

428440
# SGS part: Compute the subgrid-scale eddy viscosity.
429441
# See comments above for `ViscosityAdamiSGS`.
430-
S_mag = norm(v_diff) / (distance + epsilon)
431-
nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag
442+
# Since this is one of the most performance critical functions, using fast divisions
443+
# here gives a significant speedup on GPUs.
444+
# See the docs page "Development" for more details on `div_fast`.
445+
S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
446+
nu_SGS = (viscosity.C_S * h)^2 * S_mag
432447

433448
# Effective viscosities include the SGS term.
434449
nu_a_eff = nu_a + nu_SGS
@@ -438,9 +453,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
438453
mu_a = nu_a_eff * rho_a
439454
mu_b = nu_b_eff * rho_b
440455

441-
force_Morris = (mu_a + mu_b) / (rho_a * rho_b) * (dot(pos_diff, grad_kernel)) /
442-
(distance^2 + epsilon * smoothing_length_average^2) * v_diff
443-
return m_b * force_Morris
456+
# Since this is one of the most performance critical functions, using fast divisions
457+
# here gives a significant speedup on GPUs.
458+
# See the docs page "Development" for more details on `div_fast`.
459+
return div_fast(m_b * (mu_a + mu_b) * dot(pos_diff, grad_kernel),
460+
rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
444461
end
445462

446463
function kinematic_viscosity(system, viscosity::ViscosityMorrisSGS, smoothing_length,
@@ -496,7 +513,10 @@ end
496513
v_b = viscous_velocity(v_neighbor_system, neighbor_system, neighbor)
497514
v_diff = v_a - v_b
498515

499-
gamma_dot = norm(v_diff) / (distance + epsilon)
516+
# Since this is one of the most performance critical functions, using fast divisions
517+
# here gives a significant speedup on GPUs.
518+
# See the docs page "Development" for more details on `div_fast`.
519+
gamma_dot = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
500520

501521
# Compute Carreau-Yasuda effective viscosity
502522
(; nu0, nu_inf, lambda, a, n) = viscosity

src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl

Lines changed: 28 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,10 @@ end
4848

4949
@inline function density_diffusion_psi(::DensityDiffusionMolteniColagrossi, rho_a, rho_b,
5050
pos_diff, distance, system, particle, neighbor)
51-
return 2 * (rho_a - rho_b) * pos_diff / distance^2
51+
# Since this is one of the most performance critical functions, using fast divisions
52+
# here gives a significant speedup on GPUs.
53+
# See the docs page "Development" for more details on `div_fast`.
54+
return div_fast(2 * (rho_a - rho_b), distance^2) * pos_diff
5255
end
5356

5457
@doc raw"""
@@ -77,9 +80,11 @@ end
7780

7881
@inline function density_diffusion_psi(::DensityDiffusionFerrari, rho_a, rho_b,
7982
pos_diff, distance, system, particle, neighbor)
80-
return ((rho_a - rho_b) /
81-
(smoothing_length(system, particle) + smoothing_length(system, neighbor))) *
82-
pos_diff / distance
83+
# Since this is one of the most performance critical functions, using fast divisions
84+
# here gives a significant speedup on GPUs.
85+
# See the docs page "Development" for more details on `div_fast`.
86+
h = smoothing_length(system, particle) + smoothing_length(system, neighbor)
87+
return div_fast((rho_a - rho_b), h * distance) * pos_diff
8388
end
8489

8590
@doc raw"""
@@ -154,21 +159,23 @@ function allocate_buffer(ic, dd::DensityDiffusionAntuono, buffer::SystemBuffer)
154159
return initial_condition, DensityDiffusionAntuono(initial_condition; delta=dd.delta)
155160
end
156161

157-
@inline function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
158-
rho_a, rho_b, pos_diff, distance, system,
159-
particle, neighbor)
162+
@propagate_inbounds function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
163+
rho_a, rho_b, pos_diff, distance, system,
164+
particle, neighbor)
160165
(; normalized_density_gradient) = density_diffusion
161166

162-
normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
163-
normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)
164-
165-
# Fist term by Molteni & Colagrossi
167+
# First term by Molteni & Colagrossi
166168
result = 2 * (rho_a - rho_b)
167169

168170
# Second correction term
171+
normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
172+
normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)
169173
result -= dot(normalized_gradient_a + normalized_gradient_b, pos_diff)
170174

171-
return result * pos_diff / distance^2
175+
# Since this is one of the most performance critical functions, using fast divisions
176+
# here gives a significant speedup on GPUs.
177+
# See the docs page "Development" for more details on `div_fast`.
178+
return div_fast(result, distance^2) * pos_diff
172179
end
173180

174181
function update!(density_diffusion::DensityDiffusionAntuono, v, u, system, semi)
@@ -217,19 +224,20 @@ end
217224
# See `src/general/smoothing_kernels.jl` for more details.
218225
distance^2 < eps(initial_smoothing_length(particle_system)^2) && return
219226

220-
(; delta) = density_diffusion
221-
sound_speed = system_sound_speed(particle_system)
222-
223-
volume_b = m_b / rho_b
224-
227+
# Since this is one of the most performance critical functions, using fast divisions
228+
# here gives a significant speedup on GPUs.
229+
# See the docs page "Development" for more details on `div_fast`.
230+
volume_b = div_fast(m_b, rho_b)
225231
psi = density_diffusion_psi(density_diffusion, rho_a, rho_b, pos_diff, distance,
226232
particle_system, particle, neighbor)
227-
density_diffusion_term = dot(psi, grad_kernel) * volume_b
233+
density_diffusion_term = volume_b * dot(psi, grad_kernel)
228234

229235
smoothing_length_avg = (smoothing_length(particle_system, particle) +
230236
smoothing_length(particle_system, neighbor)) / 2
231-
dv[end, particle] += delta * smoothing_length_avg * sound_speed *
232-
density_diffusion_term
237+
238+
(; delta) = density_diffusion
239+
sound_speed = system_sound_speed(particle_system)
240+
dv[end, particle] += delta * smoothing_length_avg * sound_speed * density_diffusion_term
233241
end
234242

235243
# Density diffusion `nothing` or interaction other than fluid-fluid

src/util.jl

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
# By default, use `div_fast` of `Base.FastMath`.
2+
# In the `TrixiParticlesCUDAExt` extension, this is redefined for `Float64`.
3+
@inline function div_fast(x, y)
4+
return Base.FastMath.div_fast(x, y)
5+
end
6+
17
# Same as `foreach`, but it is unrolled by the compiler for small input tuples
28
@inline function foreach_noalloc(func, collection)
39
element = first(collection)

0 commit comments

Comments
 (0)