Skip to content

A rework for the advection module that improves WENO performance#4434

Closed
simone-silvestri wants to merge 354 commits into
mainfrom
ss/optimize-weno
Closed

A rework for the advection module that improves WENO performance#4434
simone-silvestri wants to merge 354 commits into
mainfrom
ss/optimize-weno

Conversation

@simone-silvestri
Copy link
Copy Markdown
Collaborator

@simone-silvestri simone-silvestri commented Apr 22, 2025

This PR reworks how advection is implemented by passing a reduced_order integer to the interpolate functions.

At the moment, upwind and centered reconstruction do not formally change; the ifelse that was upstream before in the alt_interpolate functions is just pushed downstream in the reconstruction function.

For WENO, the reconstruction approach changes because the order is accounted for by changing the reconstruction coefficients and the smoothness coefficients.
This method should give us a boost in performance in the WENO case for bounded and immersed boundary grids, as there is much less computation to be performed. I will post some benchmarking later on.

This PR is still exploratory, so there is a bit of benchmarking and cleaning up to do.

EDIT: I think this PR might be ready

  • PRO1: faster code with a computational saving that might scale much more efficiently for larger kernels
  • PRO2: same behavior for immersed and non-immersed grids (previously halos in Bounded directions were considered inactive in an immersed grid and active in a non-immersed one, which might have made it difficult to do open boundaries for immersed boundary grids)
  • PRO3: much simpler code for the reduction of the order, which requires less metaprogramming.
  • PRO4: this PR exposed a bug in the distributed computations, where corners were not exchanged for pencil configurations near a bounded wall. This is fixed here, because of PRO3, but it might be a bit more difficult in main to find and fix.
  • CON1: not possible to mix and match different schemes together (for example, having WENO that falls back to Centered near the boundaries)
  • CON2: more difficult to implement higher orders (in main it is sufficient to add a number to the buffers and add the specific coefficients, here you would have to add also the coefficients for the intra-order reconstructions). I am not sure how important this CON is since I don't envision using advection orders higher than 10 for centered and 9 for upwind schemes
  • CON3: a more complicated (and lengthy) code for the WENO coefficients.

Removes unused features:

  • orders 12 for centered and 11 for upwind (never used, unstable, and just a waste of space)
  • vestigial code for stretched advection (now we only have constant coefficients, but some code to enable stretched advection remained after Remove stretched advection #4411)

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

Ah this was me not Claude 😅

@glwagner
Copy link
Copy Markdown
Member

@simone-silvestri in addition to the final report, please also ask for a write up of the workflow (ie the strategy used to get to this result) which can then be re-used in the future.

@glwagner
Copy link
Copy Markdown
Member

fairly modest results eh? is that surprising that its only 1.3x? still a win, but reading the report up to the final result, I would have expected a breakthrough rather than incremental improvment.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

@simone-silvestri in addition to the final report, please also ask for a write up of the workflow (ie the strategy used to get to this result) which can then be re-used in the future.

this is still human-made :) (I hadn't discover claude at the time), so I made claude only check registers and benchmark.
However, I have put claude to work on improving the performance based on this PR, but I wanted to leave it for a different PR.

fairly modest results eh? is that surprising that its only 1.3x? still a win, but reading the report up to the final result, I would have expected a breakthrough rather than incremental improvment.

The problem is that the computations are offset by extra registers required for the selp intructions, but still improvement can be achieved. For example, in the aforementioned ulterior optimization, claude identified kernel splitting between advection and non-advection to reduce the register pressure (that increased in this PR) and the interior - boundary kernel split (based on the stencil size required) which @jackdfranklin had proposed (so I ll leave that PR to him :) ). That works quite nicely and should buy us another 70%! (but only on immersed grids). Then we should move on from optimizing advection and focus on the rest of the model.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

I think this PR might be ready for review / merging. We might want to wait for the detailed benchmarks to kick off the optimization :)

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oceananigans.jl Benchmarks

Details
Benchmark suite Current: 4454681 Previous: 244ceed Ratio
Default/tripolar 360x180x50 F64/NVIDIA TITAN V/default 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gv_ 4.097272 ms (median GPU time) 3.7828245 ms (median GPU time) 1.08
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu__rk_substep_turbulent_kinetic_energy_ 4.012617 ms (median GPU time) 4.01252 ms (median GPU time) 1.00
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gu_ 3.923497 ms (median GPU time) 3.9576715 ms (median GPU time) 0.99
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.035567 ms (median GPU time) 2.557426 ms (median GPU time) 1.19
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.021167 ms (median GPU time) 2.557426 ms (median GPU time) 1.18
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.021998 ms (median GPU time) 2.557426 ms (median GPU time) 1.18
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_CATKE_closure_fields_ 1.618055 ms (median GPU time) 1.618183 ms (median GPU time) 1.00
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_TKE_diffusivity_ 1.271225 ms (median GPU time) 1.270968 ms (median GPU time) 1.00
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu__compute_w_from_continuity_ 0.340831 ms (median GPU time) 0.340702 ms (median GPU time) 1.00
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_solve_batched_tridiagonal_system_kernel_ 0.526332 ms (median GPU time) 0.525693 ms (median GPU time) 1.00
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/180x90x50 0.03455006025 s/timestep 0.02946350052 s/timestep 1.17
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/720x360x50 0.37494769731 s/timestep 0.35170072867 s/timestep 1.07
Float Type Sweep/tripolar 360x180x50 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/F32 0.059498087330000005 s/timestep 0.06635712049 s/timestep 0.90
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/nothing 0.05293816465 s/timestep 0.04941735279000001 s/timestep 1.07
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE+Biharmonic 0.11781581828 s/timestep 0.11839795846 s/timestep 1.00
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE+GM+Biharmonic 0.30343383327 s/timestep 0.29974474885 s/timestep 1.01
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/nothing+nothing 0.048168101500000005 s/timestep 0.04813939653 s/timestep 1.00
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariant5+WENO5 0.07725331778 s/timestep 0.07597942123 s/timestep 1.02
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariant9+WENO9 0.13591710038000002 s/timestep 0.12939755415 s/timestep 1.05
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/lat_lon_zstar 0.10406333887000001 s/timestep 0.10060468569999999 s/timestep 1.03
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/immersed_lat_lon_zstar 0.10305966120000001 s/timestep 0.09736463723 s/timestep 1.06
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/tripolar_zstar 0.10329602079000001 s/timestep 0.09562428461 s/timestep 1.08
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/lat_lon 0.09008006581 s/timestep 0.0865149562 s/timestep 1.04
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/immersed_lat_lon 0.09421945259999999 s/timestep 0.08869534321 s/timestep 1.06
Tracer Count Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/3 tracers 0.10641676138999999 s/timestep 0.09977233119 s/timestep 1.07
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/360x180x50 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
Float Type Sweep/tripolar 360x180x50 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/F64 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariantDefault+WENO7 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/tripolar 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06
Tracer Count Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/2 tracers 0.09555953858 s/timestep 0.09036038452 s/timestep 1.06

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Oceananigans.jl Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.

Benchmark suite Current: 4454681 Previous: 244ceed Ratio
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.035567 ms (median GPU time) 2.557426 ms (median GPU time) 1.19
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.021167 ms (median GPU time) 2.557426 ms (median GPU time) 1.18
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ 3.021998 ms (median GPU time) 2.557426 ms (median GPU time) 1.18
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/180x90x50 0.03455006025 s/timestep 0.02946350052 s/timestep 1.17

This comment was automatically generated by workflow using github-action-benchmark.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

unfortunately, given the evidence that this PR does not help at all performance, I will close it

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented May 8, 2026

I reopened this PR to check if the enum PR changes the performance diff of this branch. Supposedly that PR reduced the register usage in the advection computation which should benefit greatly the register pressure of this PR that reduced computation in exchange for a slightly larger register pressure.
If the benchmarking results are bad as per the previous commits, I ll close this PR once again

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 55401674.336 57689338.037 -4%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 26066498.845 24790952.253 +5.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 57973827.247 60340729.652 -3.9%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 67535959.803 64756664.503 +4.3%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 102108289.86 101101805.76 +1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 39313455.507 40872705.819 -3.8%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 11934666.013 12469255.469 -4.3%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 86983210.433 87263770.237 -0.3%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 62484721.774 65098048.655 -4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 27832578.738 42996896.979 -35.3%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 46138845.238 46233818.429 -0.2%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 51311513.934 53263154.800 -3.7%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 49400829.965 51442061.219 -4%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 57958325.725 56420068.334 +2.7%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 57226780.276 59887727.888 -4.4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 51255159.270 53930325.901 -5%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu_compute_hydrostatic_free_surface_Gv_ 2.450 2.172 +12.8% 318 2.456 2.440 2.636
gpu_compute_hydrostatic_free_surface_Gu_ 2.236 2.455 -8.9% 318 2.242 2.229 2.434
gpu__rk_substep_turbulent_kinetic_energy_ 1.985 1.985 -0.0% 315 1.990 1.981 2.170
gpu_compute_CATKE_closure_fields_ 1.597 1.597 -0.0% 318 1.601 1.589 1.734
gpu_compute_hydrostatic_free_surface_Gc_ 1.244 0.977 +27.3% 315 1.247 1.241 1.351
gpu_compute_hydrostatic_free_surface_Gc_ 1.229 0.977 +25.8% 315 1.232 1.227 1.335
gpu_compute_hydrostatic_free_surface_Gc_ 1.227 0.977 +25.6% 315 1.230 1.224 1.332
gpu__compute_w_from_continuity_ 0.339 0.339 -0.1% 633 0.339 0.335 0.347
gpu_compute_TKE_diffusivity_ 0.644 0.644 +0.0% 315 0.645 0.639 0.694
gpu__compute_split_explicit_transport_velocities_ 0.480 0.479 +0.0% 315 0.480 0.476 0.484

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

actually it's still worse. I ll close again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants