Improve GPU performance of TLSPH#1084
Improve GPU performance of TLSPH#1084efaulhaber wants to merge 11 commits intotrixi-framework:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the TLSPH time-integration path for GPU workloads by reducing per-particle dispatch/branching in hot loops, adding GPU-optimized bulk copies for velocity/positions, and short-circuiting work when source terms/gravity are inactive.
Changes:
- Refactors
drift!velocity update into per-systemset_velocity!and adds a GPU fast-path (copyto!) for velocity assignment. - Adds a GPU fast-path for TLSPH position updates (
update_tlsph_positions!). - Refactors
add_source_terms!to dispatch per system and skip work when acceleration and source terms are inactive; updates special-case handling forParticlePackingSystem.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/general/semidiscretization.jl |
Refactors drift! velocity-setting and reworks source-term application/dispatch + skipping logic. |
src/schemes/structure/total_lagrangian_sph/system.jl |
Adds GPU-optimized TLSPH position copy. |
src/schemes/boundary/open_boundary/system.jl |
Updates open-boundary velocity-setting to new set_velocity! API and threaded loop. |
src/preprocessing/particle_packing/system.jl |
Adds ParticlePackingSystem source-term no-op specialization and adapts to set_velocity!. |
test/schemes/structure/total_lagrangian_sph/rhs.jl |
Removes now-unneeded mock add_acceleration! definition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1084 +/- ##
==========================================
- Coverage 88.71% 88.61% -0.11%
==========================================
Files 128 128
Lines 9745 9772 +27
==========================================
+ Hits 8645 8659 +14
- Misses 1100 1113 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/run-gpu-tests |
|
/run-gpu-tests |
|
|
||
| function drift!(du_ode, v_ode, u_ode, semi, t) | ||
| @trixi_timeit timer() "drift!" begin | ||
| @trixi_timeit timer() "reset ∂u/∂t" set_zero!(du_ode) |
There was a problem hiding this comment.
Didn't we ever need set_zero! for du in that case?
There was a problem hiding this comment.
Yes, we now need to be aware that we have to set this to zero for systems where it is not set to something non-zero. We can't just dispatch these functions to do nothing. Hence the comment
# Note that `du` is of length zero, so we don't have to set it to zerofor the boundary system, and the
if semi.integrate_tlsph[]
set_velocity_default!(du, v, u, system, semi, t)
else
set_zero!(du)
endfor TLSPH.
| @trixi_timeit timer() "drift!" begin | ||
| @trixi_timeit timer() "reset ∂u/∂t" set_zero!(du_ode) | ||
|
|
||
| @trixi_timeit timer() "velocity" begin |
There was a problem hiding this comment.
Why are you removing the timers?
There was a problem hiding this comment.
Because there is only one thing happening now in drift!, which is setting the velocity.
| if semi.integrate_tlsph[] | ||
| set_velocity_default!(du, v, u, system, semi, t) | ||
| else | ||
| set_zero!(du) |
There was a problem hiding this comment.
Have you checked whether the broad casting. in set_zero! causes memory allocation on some hardware? That happened to me once but I forgot to create an issue
There was a problem hiding this comment.
I've never seen this. Please create an issue if you see this again.
This PR
integrate_tlsphchecks out of the hot loops indrift!and source terms (relevant for small problems on GPUs),add_velocity!andupdate_tlsph_positions!for GPU arrays,set_zero!(du)and renamesadd_velocity!toset_velocity!(relevant for small problems on GPUs),add_source_terms!when no source terms are used andsystem.accelerationis zero.Especially for split integration on GPUs, where a small TLSPH-only simulation is run for many sub-steps, this provides a very significant speedup.
Here are the relevant changes in the timer output of my large fin simulation (#844) on an H100. Note that this is before I moved the "source terms" timer behind the dispatch. Now, the "source terms" timer should actually completely disappear when neither source terms nor gravity is used.
main:
this PR:
Here are the full timer outputs for reference.
main:
this PR: