Use fast divisions in performance-critical code by efaulhaber · Pull Request #1128 · trixi-framework/TrixiParticles.jl

efaulhaber · 2026-04-01T14:56:35Z

This PR switches all divisions inside the neighbor loop to fast divisions. The fluid-* interaction kernel now still contains regular divisions in the PointNeighbors.jl part, but not for each neighbor particle, so these don't make a noticeable difference.

As shown in the benchmarks in #1116, these fast divisions make a massive difference in the runtime of the fluid-* interaction kernel, but they don't work with Float64. @Mikolaj-A-Kowalski wrote this incredible fast_inv_cuda function to get fast divisions with Float64 without losing much accuracy in CliMA/Oceananigans.jl#5140.

Here is a simple demonstration of the accuracy of this function vs a simple llvm.nvvm.rcp.approx.ftz.d without the following iteration:

julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1; # avoid numbers close to zero

julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

Here is a comparison of different definitions for div_fast with the generated LLVM and corresponding runtime of the fluid-* interact kernel on an H100.

Variant	LLVM Operation (FP32)	Runtime (FP32)	LLVM Operation (FP64)	Runtime (FP64)
`x / y`	`fdiv float %143, %144, !dbg !528`	4.894 ms	`fdiv double %143, %144, !dbg !528`	10.159 ms
`x * (1 / y)`	`fdiv float 1.000000e+00, %145, !dbg !530` `fmul float %144, %146, !dbg !533`	4.227 ms	`fdiv double 1.000000e+00, %145, !dbg !530` `fmul double %144, %146, !dbg !533`	6.878 ms
`x * inv(y)`	`call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.620 ms	`fdiv double 1.000000e+00, %145, !dbg !532` `fmul double %144, %146, !dbg !535`	6.852 ms
`x * fast_inv_cuda(y)`	`call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.281 ms	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fneg double %118, !dbg !418` `call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420` `call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422` `call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424` `fmul double %117, %123, !dbg !426`	4.844 ms
`Base.FastMath.div_fast(x, y)`	`call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413`	3.114 ms	`fdiv fast double %143, %144, !dbg !530`	10.114 ms
`x * fast_inv_cuda_nofma(y)`	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fmul double %117, %119, !dbg !418`	—	—	4.758 ms

With Float32, 1 / y is faster than x / y, but it doesn't produce a reciprocal instruction.
inv does produce a reciprocal and is faster. The approx reciprocal is even faster, but still not as fast as the fast division.
With Float64, x / y is translated to a fdiv fast, but (as expected, since there is no double fast division in PTX), this is not faster. 1 / y and inv are both translated to fdiv double 1.0 ... and are significantly faster than the regular division. fast_inv_cuda is even faster, and skipping the Newton-iteration makes it only 1.018x faster (at a significant loss of accuracy).

efaulhaber · 2026-04-01T14:58:49Z

To-Do:

Tests for fast_inv_cuda.
Figure out what happens on AMD and Metal.

codecov · 2026-04-01T15:11:09Z

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.70%. Comparing base (2277802) to head (a029c3b).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...fluid/weakly_compressible_sph/density_diffusion.jl	83.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1128      +/-   ##
==========================================
- Coverage   88.72%   88.70%   -0.02%     
==========================================
  Files         128      128              
  Lines        9745     9746       +1     
==========================================
- Hits         8646     8645       -1     
- Misses       1099     1101       +2

Flag	Coverage Δ
total	`88.70% <94.11%> (-0.02%)`	⬇️
unit	`66.98% <58.82%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sloede · 2026-04-07T09:12:48Z

Very impressive results! I recommend to add a description to the performance section of the docs that briefly explains the rationale behind this optimization and references this PR for further details. It should also include a note that fast_div should only be used where it was deemed necessary and effective after making adequate performance comparisons.

In addition, I recommend to add a comment to each use of fast_div that points to said documentation, such that unwitting new developers do not think this is something to use at will.

efaulhaber · 2026-04-07T12:23:53Z

/run-gpu-tests

…to div-fast

efaulhaber · 2026-04-07T13:09:16Z

/run-gpu-tests

sloede

Very nice docs! Small suggestion from my side.

docs/src/development.md

Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>

efaulhaber · 2026-04-07T13:20:06Z

/run-gpu-tests

Copilot

Pull request overview

This PR introduces a package-level div_fast helper and replaces divisions in several performance-critical fluid interaction kernels with div_fast, enabling faster GPU execution (including a CUDA-specific Float64 fast reciprocal via an extension).

Changes:

Add TrixiParticles.div_fast wrapper (defaulting to Base.FastMath.div_fast) and use it in hot SPH kernels (density diffusion, viscosity, pressure acceleration, continuity equation).
Add a CUDA extension overriding div_fast(::Any, ::Float64) on-device using a fast approximate reciprocal with a cubic refinement step.
Add GPU tests to validate div_fast accuracy on CPU/GPU and document GPU coding guidelines in the development docs.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/util.jl`	Introduces `div_fast` wrapper to centralize fast-division behavior and enable backend-specific overrides.
`ext/TrixiParticlesCUDAExt.jl`	Adds CUDA device override for Float64 fast division using a refined approximate reciprocal.
`src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl`	Switches inner-loop divisions to `div_fast` for GPU performance.
`src/schemes/fluid/viscosity.jl`	Switches several viscosity-related divisions to `div_fast` and refactors a helper accordingly.
`src/schemes/fluid/pressure_acceleration.jl`	Replaces pressure-related divisions with `div_fast` in hot paths.
`src/schemes/fluid/fluid.jl`	Uses `div_fast` in the continuity equation update for `ContinuityDensity`.
`Project.toml`	Adds CUDA as a weak dependency and registers the CUDA extension.
`test/test_util.jl`	Imports `TrixiParticles.Adapt` so GPU tests can call `Adapt.adapt`.
`test/examples/gpu.jl`	Adds accuracy tests for `div_fast` on CPU/GPU for Float32/Float64 (where supported).
`docs/src/development.md`	Adds GPU coding guidelines and a `div_fast` benchmarking/reference convention.
`docs/src/gpu.md`	Links GPU coding guidelines from the GPU documentation page.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl

docs/src/development.md

ext/TrixiParticlesCUDAExt.jl

LasNikas · 2026-04-09T13:31:16Z

/run-gpu-tests

svchb · 2026-04-09T13:46:16Z

To-Do:

Tests for fast_inv_cuda.

Figure out what happens on AMD and Metal.

and what is with AMD and metal?

efaulhaber · 2026-04-09T13:52:58Z

and what is with AMD and metal?

It doesn't make sense that I wrote Metal. Metal doesn't support FP64, so this is irrelevant there. I just benchmarked with Niklas earlier today on an MI300A, and it turned out that FP64 fast divisions just work out of the box with Base.FastMath.div_fast, so this hacky workaround is not required there. I updated the benchmark results in #1131.

Use fast divisions in performance-critical code

748f6cd

This was referenced Apr 1, 2026

Improve GPU performance of fluid-* interact #1116

Draft

[Proof of Concept] Implement local memory kernels for fluid-fluid interaction #1129

Draft

Optimize density diffusion

57c47b7

efaulhaber closed this Apr 2, 2026

efaulhaber reopened this Apr 2, 2026

efaulhaber added 2 commits April 7, 2026 11:06

Reformat

4698252

Merge branch 'main' into div-fast

c758181

efaulhaber added 5 commits April 7, 2026 13:35

Add docs

d7691b1

Add tests for div_fast

380f261

Add comment

d190925

Add comments

ba6c348

Reformat

de8cb92

efaulhaber added 3 commits April 7, 2026 15:06

Fix tests

2d222f0

Fix docs

807072e

Merge branch 'div-fast' of github.com:efaulhaber/TrixiParticles.jl in…

05906e8

…to div-fast

sloede reviewed Apr 7, 2026

View reviewed changes

docs/src/development.md Outdated Show resolved Hide resolved

efaulhaber and others added 2 commits April 7, 2026 15:17

Update docs/src/development.md

8d9bbce

Co-authored-by: Michael Schlottke-Lakemper <michael@sloede.com>

Reformat

1d89c66

efaulhaber added 2 commits April 7, 2026 17:48

Optimize the other viscosity models

4949722

Reformat

36764b9

efaulhaber mentioned this pull request Apr 7, 2026

3x Speedup on GPUs: Checklist #1131

Open

7 tasks

efaulhaber requested a review from Copilot April 7, 2026 16:28

Copilot started reviewing on behalf of efaulhaber April 7, 2026 16:28 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl Outdated Show resolved Hide resolved

docs/src/development.md Outdated Show resolved Hide resolved

Fix typos

a029c3b

efaulhaber marked this pull request as ready for review April 7, 2026 16:38

efaulhaber self-assigned this Apr 7, 2026

efaulhaber added performance gpu labels Apr 7, 2026

efaulhaber requested a review from svchb April 7, 2026 16:38

LasNikas reviewed Apr 9, 2026

View reviewed changes

ext/TrixiParticlesCUDAExt.jl Show resolved Hide resolved

efaulhaber mentioned this pull request Apr 10, 2026

Remove CUDA FP64 fast divisions if they get integrated into CUDA.jl #1137

Open

LasNikas self-requested a review April 10, 2026 12:56

LasNikas approved these changes Apr 10, 2026

View reviewed changes

svchb approved these changes Apr 10, 2026

View reviewed changes

svchb merged commit 3dc3c7b into trixi-framework:main Apr 10, 2026
19 checks passed

efaulhaber deleted the div-fast branch April 10, 2026 14:00

Conversation

efaulhaber commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efaulhaber commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sloede commented Apr 7, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

sloede left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LasNikas commented Apr 9, 2026

Uh oh!

svchb commented Apr 9, 2026

Uh oh!

efaulhaber commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

efaulhaber commented Apr 1, 2026 •

edited

Loading

efaulhaber commented Apr 1, 2026 •

edited

Loading

codecov bot commented Apr 1, 2026 •

edited

Loading