A rework for the advection module that improves WENO performance#4434
A rework for the advection module that improves WENO performance#4434simone-silvestri wants to merge 354 commits into
Conversation
|
Ah this was me not Claude 😅 |
|
@simone-silvestri in addition to the final report, please also ask for a write up of the workflow (ie the strategy used to get to this result) which can then be re-used in the future. |
|
fairly modest results eh? is that surprising that its only 1.3x? still a win, but reading the report up to the final result, I would have expected a breakthrough rather than incremental improvment. |
this is still human-made :) (I hadn't discover claude at the time), so I made claude only check registers and benchmark.
The problem is that the computations are offset by extra registers required for the selp intructions, but still improvement can be achieved. For example, in the aforementioned ulterior optimization, claude identified kernel splitting between advection and non-advection to reduce the register pressure (that increased in this PR) and the interior - boundary kernel split (based on the stencil size required) which @jackdfranklin had proposed (so I ll leave that PR to him :) ). That works quite nicely and should buy us another 70%! (but only on immersed grids). Then we should move on from optimizing advection and focus on the rest of the model. |
|
I think this PR might be ready for review / merging. We might want to wait for the detailed benchmarks to kick off the optimization :) |
There was a problem hiding this comment.
Oceananigans.jl Benchmarks
Details
| Benchmark suite | Current: 4454681 | Previous: 244ceed | Ratio |
|---|---|---|---|
Default/tripolar 360x180x50 F64/NVIDIA TITAN V/default |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gv_ |
4.097272 ms (median GPU time) |
3.7828245 ms (median GPU time) |
1.08 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu__rk_substep_turbulent_kinetic_energy_ |
4.012617 ms (median GPU time) |
4.01252 ms (median GPU time) |
1.00 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gu_ |
3.923497 ms (median GPU time) |
3.9576715 ms (median GPU time) |
0.99 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.035567 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.19 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.021167 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.18 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.021998 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.18 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_CATKE_closure_fields_ |
1.618055 ms (median GPU time) |
1.618183 ms (median GPU time) |
1.00 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_TKE_diffusivity_ |
1.271225 ms (median GPU time) |
1.270968 ms (median GPU time) |
1.00 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu__compute_w_from_continuity_ |
0.340831 ms (median GPU time) |
0.340702 ms (median GPU time) |
1.00 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_solve_batched_tridiagonal_system_kernel_ |
0.526332 ms (median GPU time) |
0.525693 ms (median GPU time) |
1.00 |
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/180x90x50 |
0.03455006025 s/timestep |
0.02946350052 s/timestep |
1.17 |
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/720x360x50 |
0.37494769731 s/timestep |
0.35170072867 s/timestep |
1.07 |
Float Type Sweep/tripolar 360x180x50 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/F32 |
0.059498087330000005 s/timestep |
0.06635712049 s/timestep |
0.90 |
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/nothing |
0.05293816465 s/timestep |
0.04941735279000001 s/timestep |
1.07 |
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE+Biharmonic |
0.11781581828 s/timestep |
0.11839795846 s/timestep |
1.00 |
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE+GM+Biharmonic |
0.30343383327 s/timestep |
0.29974474885 s/timestep |
1.01 |
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/nothing+nothing |
0.048168101500000005 s/timestep |
0.04813939653 s/timestep |
1.00 |
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariant5+WENO5 |
0.07725331778 s/timestep |
0.07597942123 s/timestep |
1.02 |
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariant9+WENO9 |
0.13591710038000002 s/timestep |
0.12939755415 s/timestep |
1.05 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/lat_lon_zstar |
0.10406333887000001 s/timestep |
0.10060468569999999 s/timestep |
1.03 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/immersed_lat_lon_zstar |
0.10305966120000001 s/timestep |
0.09736463723 s/timestep |
1.06 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/tripolar_zstar |
0.10329602079000001 s/timestep |
0.09562428461 s/timestep |
1.08 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/lat_lon |
0.09008006581 s/timestep |
0.0865149562 s/timestep |
1.04 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/immersed_lat_lon |
0.09421945259999999 s/timestep |
0.08869534321 s/timestep |
1.06 |
Tracer Count Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/3 tracers |
0.10641676138999999 s/timestep |
0.09977233119 s/timestep |
1.07 |
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/360x180x50 |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
Float Type Sweep/tripolar 360x180x50 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/F64 |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
Closure Sweep/tripolar 360x180x50 F64 WENOVectorInvariantDefault+WENO7/NVIDIA TITAN V/CATKE |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
Advection Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/WENOVectorInvariantDefault+WENO7 |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
Grid Type Sweep/360x180x50 F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/tripolar |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
Tracer Count Sweep/tripolar 360x180x50 F64 CATKE/NVIDIA TITAN V/2 tracers |
0.09555953858 s/timestep |
0.09036038452 s/timestep |
1.06 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Oceananigans.jl Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.10.
| Benchmark suite | Current: 4454681 | Previous: 244ceed | Ratio |
|---|---|---|---|
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.035567 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.19 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.021167 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.18 |
NSYS Kernels/EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr/NVIDIA TITAN V/gpu_compute_hydrostatic_free_surface_Gc_ |
3.021998 ms (median GPU time) |
2.557426 ms (median GPU time) |
1.18 |
Resolution Sweep/tripolar F64 WENOVectorInvariantDefault+WENO7 CATKE/NVIDIA TITAN V/180x90x50 |
0.03455006025 s/timestep |
0.02946350052 s/timestep |
1.17 |
This comment was automatically generated by workflow using github-action-benchmark.
|
unfortunately, given the evidence that this PR does not help at all performance, I will close it |
|
I reopened this PR to check if the |
Benchmark ComparisonBenchmark Comparison: PR vs Main
NSYS Kernel ProfilingEarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
|
|
actually it's still worse. I ll close again |
This PR reworks how advection is implemented by passing a
reduced_orderinteger to theinterpolatefunctions.At the moment, upwind and centered reconstruction do not formally change; the ifelse that was upstream before in the
alt_interpolatefunctions is just pushed downstream in the reconstruction function.For WENO, the reconstruction approach changes because the order is accounted for by changing the reconstruction coefficients and the smoothness coefficients.
This method should give us a boost in performance in the WENO case for bounded and immersed boundary grids, as there is much less computation to be performed. I will post some benchmarking later on.
This PR is still exploratory, so there is a bit of benchmarking and cleaning up to do.EDIT: I think this PR might be ready
Boundeddirections were considered inactive in an immersed grid and active in a non-immersed one, which might have made it difficult to do open boundaries for immersed boundary grids)WENOthat falls back toCenterednear the boundaries)Removes unused features:
orders 12 for centered and 11 for upwind (never used, unstable, and just a waste of space)