Use fast divisions in performance-critical code (#1128)

efaulhaber · sloede · web-flow · commit 3dc3c7b5c644 · 2026-04-10T15:57:45.000+02:00
* Use fast divisions in performance-critical code

* Optimize density diffusion

* Reformat

* Add docs

* Add tests for div_fast

* Add comment

* Add comments

* Reformat

* Fix tests

* Fix docs

* Update docs/src/development.md

Co-authored-by: Michael Schlottke-Lakemper &lt;michael@sloede.com&gt;

* Reformat

* Optimize the other viscosity models

* Reformat

* Fix typos

---------

Co-authored-by: Michael Schlottke-Lakemper &lt;michael@sloede.com&gt;
diff --git a/Project.toml b/Project.toml
@@ -37,14 +37,17 @@ WriteVTK = "64499a7a-5c06-52f2-abe2-ccb03c286192"
 [weakdeps]
 OrdinaryDiffEq = "1dea7af3-3e70-54e6-95c3-0bf5283fa5ed"
 OrdinaryDiffEqCore = "bbf590c4-e513-4bbe-9b18-05decba2e5d8"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 
 [extensions]
 TrixiParticlesOrdinaryDiffEqExt = ["OrdinaryDiffEq", "OrdinaryDiffEqCore"]
+TrixiParticlesCUDAExt = "CUDA"
 
 [compat]
 Accessors = "0.1.43"
 Adapt = "4"
 CSV = "0.10"
+CUDA = "5.9.1"
 DataFrames = "1.6"
 DelimitedFiles = "1"
 DiffEqCallbacks = "4"
diff --git a/docs/src/development.md b/docs/src/development.md
@@ -66,3 +66,45 @@ To create a new release for TrixiParticles.jl, perform the following steps:
    version should be `v0.3.1-dev`. If you just released `v0.2.4`, the new development
    version should be `v0.2.5-dev`.
 
+## [Writing GPU-compatible code](@id writing_gpu_code)
+
+When implementing new functionality that should run on both CPUs and GPUs,
+follow these rules:
+
+1. Data structures must be generic and parametric.
+   Do not hardcode concrete CPU array types like `Vector` or `Matrix` in fields.
+   Use type parameters, so the same structure can store CPU arrays and GPU arrays.
+2. Add an Adapt.jl rule in `src/general/gpu.jl`.
+   Register the new type with `Adapt.@adapt_structure ...`, so `adapt` can recursively
+   convert all arrays inside the structure to GPU arrays.
+   This conversion is then applied automatically inside `semidiscretize`.
+3. Use `@threaded` for all loops.
+   Accessing GPU arrays inside regular loops is not allowed.
+   With a GPU backend, `@threaded` loops are compiled to GPU kernels.
+4. Write type-stable code and do not allocate inside `@threaded` loops.
+   This is required for GPU kernels and is also essential for fast multithreaded CPU code.
+
+## [Writing fast GPU code](@id writing_fast_gpu_code)
+
+The following rules improve kernel performance and avoid common GPU pitfalls:
+
+1. Avoid exceptions and bounds errors inside kernels.
+   Perform all required checks before entering `@threaded` loops (that is, before GPU kernels).
+   Then use `@inbounds` directly at the loop where bounds are guaranteed.
+   In TrixiParticles.jl, we do not place `@inbounds` inside inner helper functions.
+   Instead, mark helper functions with `@propagate_inbounds` so the loop-level
+   `@inbounds` is propagated.
+2. Avoid implicit `Float64` literals in arithmetic.
+   For example, prefer `x / 2` over `0.5 * x` so `Float32` simulations stay in `Float32`.
+   Verify this with `@device_code`, or by confirming the kernel runs on an Apple GPU
+   (most Apple GPUs do not support `Float64`).
+3. Use `div_fast` in performance-critical divisions, but only after benchmarking (!).
+   It can significantly speed up kernels, but should not be applied indiscriminately.
+   When introducing `div_fast` in code, add a reference to [this section](@ref writing_fast_gpu_code)
+   to document the rationale and benchmarking context, e.g., like so:
+   ```julia
+   # Since this is one of the most performance critical functions, using fast divisions
+   # here gives a significant speedup on GPUs.
+   # See the docs page "Development" for more details on `div_fast`.
+   result = div_fast(dividend, divisor)
+   ```
diff --git a/docs/src/gpu.md b/docs/src/gpu.md
@@ -137,3 +137,8 @@ On GPUs that do not support `Float64`, such as most Apple GPUs, we also need to
 the coordinates to `Float32` by passing `coordinates_eltype=Float32` to
 the setup functions that create [`InitialCondition`](@ref)s, such as
 [`RectangularTank`](@ref), [`RectangularShape`](@ref), and [`SphereShape`](@ref).
+
+## Writing GPU-compatible code
+
+Please see the [development documentation](@ref writing_gpu_code) for guidelines on
+how to write GPU-compatible code.
diff --git a/ext/TrixiParticlesCUDAExt.jl b/ext/TrixiParticlesCUDAExt.jl
@@ -0,0 +1,34 @@
+# TODO this might be integrated into CUDA.jl at some point, see
+# https://github.com/JuliaGPU/CUDA.jl/pull/3077
+module TrixiParticlesCUDAExt
+
+using CUDA: CUDA
+using TrixiParticles: TrixiParticles
+
+# Use faster version of `div_fast` for `Float64` on CUDA.
+# By default, `div_fast` translates to `Base.FastMath.div_fast`, but there is
+# no fast division for `Float64` on CUDA, so we need to redefine it here to use the
+# improved fast reciprocal defined below.
+CUDA.@device_override TrixiParticles.div_fast(x, y::Float64) = x * fast_inv_cuda(y)
+
+# Improved fast reciprocal for `Float64` by @Mikolaj-A-Kowalski, which is significantly
+# more accurate than just calling "llvm.nvvm.rcp.approx.ftz.d" without the cubic iteration,
+# while still being much faster than a full division.
+# This is copied from Oceananigans.jl, see https://github.com/CliMA/Oceananigans.jl/pull/5140.
+@inline function fast_inv_cuda(a::Float64)
+    # Get the approximate reciprocal
+    # https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rcp-approx-ftz-f64
+    # This instruction chops off last 32bits of mantissa and computes inverse
+    # while treating all subnormal numbers as 0.0
+    # If reciprocal would be subnormal, underflows to 0.0
+    # 32 least significant bits of the result are filled with 0s
+    inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)
+
+    # Approximate the missing 32bits of mantissa with a single cubic iteration
+    e = fma(inv_a, -a, 1.0)
+    e = fma(e, e, e)
+    inv_a = fma(e, inv_a, inv_a)
+    return inv_a
+end
+
+end # module
diff --git a/src/schemes/fluid/fluid.jl b/src/schemes/fluid/fluid.jl
@@ -150,7 +150,10 @@ end
                                                particle_system, neighbor_system,
                                                particle, neighbor, rho_a, rho_b)
 
-    dv[end, particle] += rho_a / rho_b * m_b * dot(vdiff, grad_kernel)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    dv[end, particle] += div_fast(rho_a, rho_b) * m_b * dot(vdiff, grad_kernel)
 
     # Artificial density diffusion should only be applied to systems representing a fluid
     # with the same physical properties i.e. density and viscosity.
diff --git a/src/schemes/fluid/pressure_acceleration.jl b/src/schemes/fluid/pressure_acceleration.jl
@@ -7,14 +7,20 @@
 # asymmetric version.
 @inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
                                                          W_a)
-    return -m_b * (p_a / rho_a^2 + p_b / rho_b^2) * W_a
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return -m_b * (div_fast(p_a, rho_a^2) + div_fast(p_b, rho_b^2)) * W_a
 end
 
 # Same as above, but not assuming symmetry of the kernel gradient. To be used with
 # corrections that do not produce a symmetric kernel gradient.
 @inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
                                                          W_a, W_b)
-    return -m_b * (p_a / rho_a^2 * W_a - p_b / rho_b^2 * W_b)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return -m_b * (div_fast(p_a, rho_a^2) * W_a - div_fast(p_b, rho_b^2) * W_b)
 end
 
 # As shown in "Variational and momentum preservation aspects of Smooth Particle Hydrodynamic
@@ -26,14 +32,20 @@ end
 # asymmetric version.
 @inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
                                                           W_a)
-    return -m_b * (p_a + p_b) / (rho_a * rho_b) * W_a
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return -m_b * div_fast(p_a + p_b, rho_a * rho_b) * W_a
 end
 
 # Same as above, but not assuming symmetry of the kernel gradient. To be used with
 # corrections that do not produce a symmetric kernel gradient.
 @inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
                                                           W_a, W_b)
-    return -m_b / (rho_a * rho_b) * (p_a * W_a - p_b * W_b)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return -div_fast(m_b, rho_a * rho_b) * (p_a * W_a - p_b * W_b)
 end
 
 @doc raw"""
diff --git a/src/schemes/fluid/viscosity.jl b/src/schemes/fluid/viscosity.jl
@@ -127,8 +127,11 @@ end
     # approaching particles and turn it off for receding particles. In this way, the
     # viscosity is used for shocks and not rarefactions."
     if vr < 0
-        mu = h * vr / (distance^2 + epsilon * h^2)
-        return (alpha * c * mu + beta * mu^2) / rho_mean * grad_kernel
+        # Since this is one of the most performance critical functions, using fast divisions
+        # here gives a significant speedup on GPUs.
+        # See the docs page "Development" for more details on `div_fast`.
+        mu = div_fast(h * vr, distance^2 + epsilon * h^2)
+        return div_fast(alpha * c * mu + beta * mu^2, rho_mean) * grad_kernel
     end
 
     return zero(v_diff)
@@ -142,8 +145,11 @@ end
     mu_a = nu_a * rho_a
     mu_b = nu_b * rho_b
 
-    return (mu_a + mu_b) / (rho_a * rho_b) * dot(pos_diff, grad_kernel) /
-           (distance^2 + epsilon * h^2) * v_diff
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return div_fast((mu_a + mu_b) * dot(pos_diff, grad_kernel),
+                    rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
 end
 
 # See, e.g.,
@@ -177,17 +183,20 @@ struct ViscosityAdami{ELTYPE}
     end
 end
 
-function adami_viscosity_force(smoothing_length_average, pos_diff, distance, grad_kernel,
+function adami_viscosity_force(h, pos_diff, distance, grad_kernel,
                                m_a, m_b, rho_a, rho_b, v_diff, nu_a, nu_b, epsilon)
     eta_a = nu_a * rho_a
     eta_b = nu_b * rho_b
 
-    eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    volume_a = div_fast(m_a, rho_a)
+    volume_b = div_fast(m_b, rho_b)
 
-    tmp = eta_tilde / (distance^2 + epsilon * smoothing_length_average^2)
-
-    volume_a = m_a / rho_a
-    volume_b = m_b / rho_b
+    # eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
+    # tmp = eta_tilde / (distance^2 + epsilon * h^2) / m_a
+    tmp = div_fast(2 * eta_a * eta_b, (eta_a + eta_b) * (distance^2 + epsilon * h^2) * m_a)
 
     # This formulation was introduced by Hu and Adams (2006). https://doi.org/10.1016/j.jcp.2005.09.001
     # They argued that the formulation is more flexible because of the possibility to formulate
@@ -198,9 +207,9 @@ function adami_viscosity_force(smoothing_length_average, pos_diff, distance, gra
     # Because when using this formulation for the pressure acceleration, it is not
     # energy conserving.
     # See issue: https://github.com/trixi-framework/TrixiParticles.jl/issues/394
-    visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp / m_a
+    visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp
 
-    return visc .* v_diff
+    return visc * v_diff
 end
 
 @inline function (viscosity::ViscosityAdami)(particle_system, neighbor_system,
@@ -334,7 +343,10 @@ ViscosityAdamiSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityAdamiSGS(nu, C_S, eps
     # and then the Smagorinsky eddy viscosity:
     #   ν_SGS = (C_S * h̄)^2 * S_mag.
     #
-    S_mag = norm(v_diff) / (distance + epsilon)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
     nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag
 
     # Effective kinematic viscosity is the sum of the standard and SGS parts.
@@ -412,7 +424,7 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
 
     smoothing_length_particle = smoothing_length(particle_system, particle)
     smoothing_length_neighbor = smoothing_length(particle_system, neighbor)
-    smoothing_length_average = (smoothing_length_particle + smoothing_length_neighbor) / 2
+    h = (smoothing_length_particle + smoothing_length_neighbor) / 2
 
     nu_a = kinematic_viscosity(particle_system,
                                viscosity_model(neighbor_system, particle_system),
@@ -427,8 +439,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
 
     # SGS part: Compute the subgrid-scale eddy viscosity.
     # See comments above for `ViscosityAdamiSGS`.
-    S_mag = norm(v_diff) / (distance + epsilon)
-    nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
+    nu_SGS = (viscosity.C_S * h)^2 * S_mag
 
     # Effective viscosities include the SGS term.
     nu_a_eff = nu_a + nu_SGS
@@ -438,9 +453,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
     mu_a = nu_a_eff * rho_a
     mu_b = nu_b_eff * rho_b
 
-    force_Morris = (mu_a + mu_b) / (rho_a * rho_b) * (dot(pos_diff, grad_kernel)) /
-                   (distance^2 + epsilon * smoothing_length_average^2) * v_diff
-    return m_b * force_Morris
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return div_fast(m_b * (mu_a + mu_b) * dot(pos_diff, grad_kernel),
+                    rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
 end
 
 function kinematic_viscosity(system, viscosity::ViscosityMorrisSGS, smoothing_length,
@@ -496,7 +513,10 @@ end
     v_b = viscous_velocity(v_neighbor_system, neighbor_system, neighbor)
     v_diff = v_a - v_b
 
-    gamma_dot = norm(v_diff) / (distance + epsilon)
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    gamma_dot = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
 
     # Compute Carreau-Yasuda effective viscosity
     (; nu0, nu_inf, lambda, a, n) = viscosity
diff --git a/src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl b/src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl
@@ -48,7 +48,10 @@ end
 
 @inline function density_diffusion_psi(::DensityDiffusionMolteniColagrossi, rho_a, rho_b,
                                        pos_diff, distance, system, particle, neighbor)
-    return 2 * (rho_a - rho_b) * pos_diff / distance^2
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return div_fast(2 * (rho_a - rho_b), distance^2) * pos_diff
 end
 
 @doc raw"""
@@ -77,9 +80,11 @@ end
 
 @inline function density_diffusion_psi(::DensityDiffusionFerrari, rho_a, rho_b,
                                        pos_diff, distance, system, particle, neighbor)
-    return ((rho_a - rho_b) /
-            (smoothing_length(system, particle) + smoothing_length(system, neighbor))) *
-           pos_diff / distance
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    h = smoothing_length(system, particle) + smoothing_length(system, neighbor)
+    return div_fast((rho_a - rho_b), h * distance) * pos_diff
 end
 
 @doc raw"""
@@ -154,21 +159,23 @@ function allocate_buffer(ic, dd::DensityDiffusionAntuono, buffer::SystemBuffer)
     return initial_condition, DensityDiffusionAntuono(initial_condition; delta=dd.delta)
 end
 
-@inline function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
-                                       rho_a, rho_b, pos_diff, distance, system,
-                                       particle, neighbor)
+@propagate_inbounds function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
+                                                   rho_a, rho_b, pos_diff, distance, system,
+                                                   particle, neighbor)
     (; normalized_density_gradient) = density_diffusion
 
-    normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
-    normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)
-
-    # Fist term by Molteni & Colagrossi
+    # First term by Molteni & Colagrossi
     result = 2 * (rho_a - rho_b)
 
     # Second correction term
+    normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
+    normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)
     result -= dot(normalized_gradient_a + normalized_gradient_b, pos_diff)
 
-    return result * pos_diff / distance^2
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    return div_fast(result, distance^2) * pos_diff
 end
 
 function update!(density_diffusion::DensityDiffusionAntuono, v, u, system, semi)
@@ -217,19 +224,20 @@ end
     # See `src/general/smoothing_kernels.jl` for more details.
     distance^2 < eps(initial_smoothing_length(particle_system)^2) && return
 
-    (; delta) = density_diffusion
-    sound_speed = system_sound_speed(particle_system)
-
-    volume_b = m_b / rho_b
-
+    # Since this is one of the most performance critical functions, using fast divisions
+    # here gives a significant speedup on GPUs.
+    # See the docs page "Development" for more details on `div_fast`.
+    volume_b = div_fast(m_b, rho_b)
     psi = density_diffusion_psi(density_diffusion, rho_a, rho_b, pos_diff, distance,
                                 particle_system, particle, neighbor)
-    density_diffusion_term = dot(psi, grad_kernel) * volume_b
+    density_diffusion_term = volume_b * dot(psi, grad_kernel)
 
     smoothing_length_avg = (smoothing_length(particle_system, particle) +
                             smoothing_length(particle_system, neighbor)) / 2
-    dv[end, particle] += delta * smoothing_length_avg * sound_speed *
-                         density_diffusion_term
+
+    (; delta) = density_diffusion
+    sound_speed = system_sound_speed(particle_system)
+    dv[end, particle] += delta * smoothing_length_avg * sound_speed * density_diffusion_term
 end
 
 # Density diffusion `nothing` or interaction other than fluid-fluid
diff --git a/src/util.jl b/src/util.jl
@@ -1,3 +1,9 @@
+# By default, use `div_fast` of `Base.FastMath`.
+# In the `TrixiParticlesCUDAExt` extension, this is redefined for `Float64`.
+@inline function div_fast(x, y)
+    return Base.FastMath.div_fast(x, y)
+end
+
 # Same as `foreach`, but it is unrolled by the compiler for small input tuples
 @inline function foreach_noalloc(func, collection)
     element = first(collection)
diff --git a/test/examples/gpu.jl b/test/examples/gpu.jl
diff --git a/test/test_util.jl b/test/test_util.jl