Use fast divisions in performance-critical code#1128
Use fast divisions in performance-critical code#1128svchb merged 17 commits intotrixi-framework:mainfrom
Conversation
|
To-Do:
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1128 +/- ##
==========================================
- Coverage 88.72% 88.70% -0.02%
==========================================
Files 128 128
Lines 9745 9746 +1
==========================================
- Hits 8646 8645 -1
- Misses 1099 1101 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Very impressive results! I recommend to add a description to the performance section of the docs that briefly explains the rationale behind this optimization and references this PR for further details. It should also include a note that In addition, I recommend to add a comment to each use of |
|
/run-gpu-tests |
|
/run-gpu-tests |
sloede
left a comment
There was a problem hiding this comment.
Very nice docs! Small suggestion from my side.
Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>
|
/run-gpu-tests |
There was a problem hiding this comment.
Pull request overview
This PR introduces a package-level div_fast helper and replaces divisions in several performance-critical fluid interaction kernels with div_fast, enabling faster GPU execution (including a CUDA-specific Float64 fast reciprocal via an extension).
Changes:
- Add
TrixiParticles.div_fastwrapper (defaulting toBase.FastMath.div_fast) and use it in hot SPH kernels (density diffusion, viscosity, pressure acceleration, continuity equation). - Add a CUDA extension overriding
div_fast(::Any, ::Float64)on-device using a fast approximate reciprocal with a cubic refinement step. - Add GPU tests to validate
div_fastaccuracy on CPU/GPU and document GPU coding guidelines in the development docs.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/util.jl |
Introduces div_fast wrapper to centralize fast-division behavior and enable backend-specific overrides. |
ext/TrixiParticlesCUDAExt.jl |
Adds CUDA device override for Float64 fast division using a refined approximate reciprocal. |
src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl |
Switches inner-loop divisions to div_fast for GPU performance. |
src/schemes/fluid/viscosity.jl |
Switches several viscosity-related divisions to div_fast and refactors a helper accordingly. |
src/schemes/fluid/pressure_acceleration.jl |
Replaces pressure-related divisions with div_fast in hot paths. |
src/schemes/fluid/fluid.jl |
Uses div_fast in the continuity equation update for ContinuityDensity. |
Project.toml |
Adds CUDA as a weak dependency and registers the CUDA extension. |
test/test_util.jl |
Imports TrixiParticles.Adapt so GPU tests can call Adapt.adapt. |
test/examples/gpu.jl |
Adds accuracy tests for div_fast on CPU/GPU for Float32/Float64 (where supported). |
docs/src/development.md |
Adds GPU coding guidelines and a div_fast benchmarking/reference convention. |
docs/src/gpu.md |
Links GPU coding guidelines from the GPU documentation page. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/run-gpu-tests |
and what is with AMD and metal? |
It doesn't make sense that I wrote Metal. Metal doesn't support FP64, so this is irrelevant there. I just benchmarked with Niklas earlier today on an MI300A, and it turned out that FP64 fast divisions just work out of the box with |
This PR switches all divisions inside the neighbor loop to fast divisions. The fluid-* interaction kernel now still contains regular divisions in the PointNeighbors.jl part, but not for each neighbor particle, so these don't make a noticeable difference.
As shown in the benchmarks in #1116, these fast divisions make a massive difference in the runtime of the fluid-* interaction kernel, but they don't work with
Float64. @Mikolaj-A-Kowalski wrote this incrediblefast_inv_cudafunction to get fast divisions withFloat64without losing much accuracy in CliMA/Oceananigans.jl#5140.Here is a simple demonstration of the accuracy of this function vs a simple
llvm.nvvm.rcp.approx.ftz.dwithout the following iteration:Here is a comparison of different definitions for
div_fastwith the generated LLVM and corresponding runtime of the fluid-* interact kernel on an H100.x / yfdiv float %143, %144, !dbg !528fdiv double %143, %144, !dbg !528x * (1 / y)fdiv float 1.000000e+00, %145, !dbg !530fmul float %144, %146, !dbg !533fdiv double 1.000000e+00, %145, !dbg !530fmul double %144, %146, !dbg !533x * inv(y)call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413fmul float %117, %119, !dbg !418fdiv double 1.000000e+00, %145, !dbg !532fmul double %144, %146, !dbg !535x * fast_inv_cuda(y)call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413fmul float %117, %119, !dbg !418call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413fneg double %118, !dbg !418call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424fmul double %117, %123, !dbg !426Base.FastMath.div_fast(x, y)call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413fdiv fast double %143, %144, !dbg !530x * fast_inv_cuda_nofma(y)call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413fmul double %117, %119, !dbg !418With
Float32,1 / yis faster thanx / y, but it doesn't produce a reciprocal instruction.invdoes produce a reciprocal and is faster. The approx reciprocal is even faster, but still not as fast as the fast division.With
Float64,x / yis translated to afdiv fast, but (as expected, since there is no double fast division in PTX), this is not faster.1 / yandinvare both translated tofdiv double 1.0 ...and are significantly faster than the regular division.fast_inv_cudais even faster, and skipping the Newton-iteration makes it only 1.018x faster (at a significant loss of accuracy).