Skip to content

Use fast divisions in performance-critical code#1128

Merged
svchb merged 17 commits intotrixi-framework:mainfrom
efaulhaber:div-fast
Apr 10, 2026
Merged

Use fast divisions in performance-critical code#1128
svchb merged 17 commits intotrixi-framework:mainfrom
efaulhaber:div-fast

Conversation

@efaulhaber
Copy link
Copy Markdown
Member

@efaulhaber efaulhaber commented Apr 1, 2026

This PR switches all divisions inside the neighbor loop to fast divisions. The fluid-* interaction kernel now still contains regular divisions in the PointNeighbors.jl part, but not for each neighbor particle, so these don't make a noticeable difference.

As shown in the benchmarks in #1116, these fast divisions make a massive difference in the runtime of the fluid-* interaction kernel, but they don't work with Float64. @Mikolaj-A-Kowalski wrote this incredible fast_inv_cuda function to get fast divisions with Float64 without losing much accuracy in CliMA/Oceananigans.jl#5140.

Here is a simple demonstration of the accuracy of this function vs a simple llvm.nvvm.rcp.approx.ftz.d without the following iteration:

julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1; # avoid numbers close to zero

julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

Here is a comparison of different definitions for div_fast with the generated LLVM and corresponding runtime of the fluid-* interact kernel on an H100.

Variant LLVM Operation (FP32) Runtime (FP32) LLVM Operation (FP64) Runtime (FP64)
x / y fdiv float %143, %144, !dbg !528 4.894 ms fdiv double %143, %144, !dbg !528 10.159 ms
x * (1 / y) fdiv float 1.000000e+00, %145, !dbg !530
fmul float %144, %146, !dbg !533
4.227 ms fdiv double 1.000000e+00, %145, !dbg !530
fmul double %144, %146, !dbg !533
6.878 ms
x * inv(y) call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.620 ms fdiv double 1.000000e+00, %145, !dbg !532
fmul double %144, %146, !dbg !535
6.852 ms
x * fast_inv_cuda(y) call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.281 ms call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fneg double %118, !dbg !418
call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
fmul double %117, %123, !dbg !426
4.844 ms
Base.FastMath.div_fast(x, y) call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413 3.114 ms fdiv fast double %143, %144, !dbg !530 10.114 ms
x * fast_inv_cuda_nofma(y) call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fmul double %117, %119, !dbg !418
4.758 ms

With Float32, 1 / y is faster than x / y, but it doesn't produce a reciprocal instruction.
inv does produce a reciprocal and is faster. The approx reciprocal is even faster, but still not as fast as the fast division.
With Float64, x / y is translated to a fdiv fast, but (as expected, since there is no double fast division in PTX), this is not faster. 1 / y and inv are both translated to fdiv double 1.0 ... and are significantly faster than the regular division. fast_inv_cuda is even faster, and skipping the Newton-iteration makes it only 1.018x faster (at a significant loss of accuracy).

@efaulhaber
Copy link
Copy Markdown
Member Author

efaulhaber commented Apr 1, 2026

To-Do:

  • Tests for fast_inv_cuda.
  • Figure out what happens on AMD and Metal.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.70%. Comparing base (2277802) to head (a029c3b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...fluid/weakly_compressible_sph/density_diffusion.jl 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1128      +/-   ##
==========================================
- Coverage   88.72%   88.70%   -0.02%     
==========================================
  Files         128      128              
  Lines        9745     9746       +1     
==========================================
- Hits         8646     8645       -1     
- Misses       1099     1101       +2     
Flag Coverage Δ
total 88.70% <94.11%> (-0.02%) ⬇️
unit 66.98% <58.82%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@efaulhaber efaulhaber closed this Apr 2, 2026
@efaulhaber efaulhaber reopened this Apr 2, 2026
@sloede
Copy link
Copy Markdown
Member

sloede commented Apr 7, 2026

Very impressive results! I recommend to add a description to the performance section of the docs that briefly explains the rationale behind this optimization and references this PR for further details. It should also include a note that fast_div should only be used where it was deemed necessary and effective after making adequate performance comparisons.

In addition, I recommend to add a comment to each use of fast_div that points to said documentation, such that unwitting new developers do not think this is something to use at will.

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

Copy link
Copy Markdown
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice docs! Small suggestion from my side.

efaulhaber and others added 2 commits April 7, 2026 15:17
Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>
@efaulhaber
Copy link
Copy Markdown
Member Author

/run-gpu-tests

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a package-level div_fast helper and replaces divisions in several performance-critical fluid interaction kernels with div_fast, enabling faster GPU execution (including a CUDA-specific Float64 fast reciprocal via an extension).

Changes:

  • Add TrixiParticles.div_fast wrapper (defaulting to Base.FastMath.div_fast) and use it in hot SPH kernels (density diffusion, viscosity, pressure acceleration, continuity equation).
  • Add a CUDA extension overriding div_fast(::Any, ::Float64) on-device using a fast approximate reciprocal with a cubic refinement step.
  • Add GPU tests to validate div_fast accuracy on CPU/GPU and document GPU coding guidelines in the development docs.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/util.jl Introduces div_fast wrapper to centralize fast-division behavior and enable backend-specific overrides.
ext/TrixiParticlesCUDAExt.jl Adds CUDA device override for Float64 fast division using a refined approximate reciprocal.
src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl Switches inner-loop divisions to div_fast for GPU performance.
src/schemes/fluid/viscosity.jl Switches several viscosity-related divisions to div_fast and refactors a helper accordingly.
src/schemes/fluid/pressure_acceleration.jl Replaces pressure-related divisions with div_fast in hot paths.
src/schemes/fluid/fluid.jl Uses div_fast in the continuity equation update for ContinuityDensity.
Project.toml Adds CUDA as a weak dependency and registers the CUDA extension.
test/test_util.jl Imports TrixiParticles.Adapt so GPU tests can call Adapt.adapt.
test/examples/gpu.jl Adds accuracy tests for div_fast on CPU/GPU for Float32/Float64 (where supported).
docs/src/development.md Adds GPU coding guidelines and a div_fast benchmarking/reference convention.
docs/src/gpu.md Links GPU coding guidelines from the GPU documentation page.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@efaulhaber efaulhaber marked this pull request as ready for review April 7, 2026 16:38
@efaulhaber efaulhaber self-assigned this Apr 7, 2026
@efaulhaber efaulhaber requested a review from svchb April 7, 2026 16:38
@LasNikas
Copy link
Copy Markdown
Collaborator

LasNikas commented Apr 9, 2026

/run-gpu-tests

@svchb
Copy link
Copy Markdown
Collaborator

svchb commented Apr 9, 2026

To-Do:

  • Tests for fast_inv_cuda.
  • Figure out what happens on AMD and Metal.

and what is with AMD and metal?

@efaulhaber
Copy link
Copy Markdown
Member Author

and what is with AMD and metal?

It doesn't make sense that I wrote Metal. Metal doesn't support FP64, so this is irrelevant there. I just benchmarked with Niklas earlier today on an MI300A, and it turned out that FP64 fast divisions just work out of the box with Base.FastMath.div_fast, so this hacky workaround is not required there. I updated the benchmark results in #1131.

@svchb svchb merged commit 3dc3c7b into trixi-framework:main Apr 10, 2026
19 checks passed
@efaulhaber efaulhaber deleted the div-fast branch April 10, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants