Skip to content

MPI for 2D parabolic system on conforming P4est mesh #2886

Merged
ranocha merged 29 commits intotrixi-framework:mainfrom
TJP-Karpowski:MPI_P4est_Parabolic2D_clean
Apr 20, 2026
Merged

MPI for 2D parabolic system on conforming P4est mesh #2886
ranocha merged 29 commits intotrixi-framework:mainfrom
TJP-Karpowski:MPI_P4est_Parabolic2D_clean

Conversation

@TJP-Karpowski
Copy link
Copy Markdown
Contributor

@TJP-Karpowski TJP-Karpowski commented Mar 26, 2026

MPI for 2D parabolic system on conforming P4est mesh

The PR adds MPI support for the parabolic rhs for conforming 2D P4est Meshes. Contrary to #2880 the cache_parabolic is not extended. The hyperbolic cache is reused.

Multiple existing testcases are repeated within the MPI tests and return the same results. Most notably, the surface integrals of the elixir_navierstokes_NACA0012airfoil_mach08 testcases return the same values as the local version. The analysis of surface integrals is extended to include an MPI reduce on parallel P4est (and T8code) grids, which enabled this analysis.

Based on this PR and #2881 the method will be extended to allow for AMR and MPI mortars.

Disclaimer

LLMs have been used to aid in the PR.

Funding Statement

This work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project Number 237267381 – TRR 150.

@github-actions
Copy link
Copy Markdown
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 98.30508% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.08%. Comparing base (b02a8a4) to head (e95d55f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rc/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl 98.27% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #2886    +/-   ##
========================================
  Coverage   97.08%   97.08%            
========================================
  Files         621      622     +1     
  Lines       48045    48222   +177     
========================================
+ Hits        46642    46816   +174     
- Misses       1403     1406     +3     
Flag Coverage Δ
unittests 97.08% <98.31%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

Mmm, the tests seem to fail due to excessive allocations. Looking at the allocations from the Analysis output, I would say the MPI allocations are similar on the parabolic and hyperbolic sides (with the parabolic side, of course, needing more communication). Thus, I would increase the acceptance threshold for these cases, but I may be misunderstanding the issue, so I am happy to receive a second opinion.

Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────────────────────
parabolic rhs! 75 246ms 69.6% 3.28ms 125KiB 11.0% 1.66KiB
calculate gradient local 75 79.8ms 22.5% 1.06ms 5.88KiB 0.5% 80.2B
calculate gradient local 75 144μs 0.0% 1.92μs 5.88KiB 0.5% 80.2B
finish MPI receive divergence 75 2.61ms 0.7% 34.8μs 17.6KiB 1.6% 240B
parabolic rhs! 75 506μs 0.1% 6.75μs 20.3KiB 1.8% 278B
finish MPI receive gradient 75 481μs 0.1% 6.42μs 17.6KiB 1.6% 240B
prolong2mpiinterfaces gradient 75 412μs 0.1% 5.50μs 0.00B 0.0% 0.00B
start MPI send divergence 75 296μs 0.1% 3.94μs 9.38KiB 0.8% 128B
start MPI send gradient 75 283μs 0.1% 3.77μs 9.38KiB 0.8% 128B
finish MPI send gradient 75 60.7μs 0.0% 809ns 12.9KiB 1.1% 176B
finish MPI send divergence 75 59.4μs 0.0% 792ns 12.9KiB 1.1% 176B
start MPI receive gradient 75 58.9μs 0.0% 785ns 9.38KiB 0.8% 128B
start MPI receive divergence 75 53.5μs 0.0% 714ns 9.38KiB 0.8% 128B
rhs! 75 88.6ms 25.0% 1.18ms 64.4KiB 5.7% 880B
finish MPI receive 75 428μs 0.1% 5.71μs 17.6KiB 1.6% 240B
rhs! 75 341μs 0.1% 4.55μs 15.2KiB 1.3% 208B
start MPI send 75 318μs 0.1% 4.24μs 9.38KiB 0.8% 128B
finish MPI send 75 83.6μs 0.0% 1.11μs 12.9KiB 1.1% 176B
start MPI receive 75 48.2μs 0.0% 642ns 9.38KiB 0.8% 128B

Copy link
Copy Markdown
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution! We will review the implementation in detail later.

Please consider adding your name to https://github.com/trixi-framework/Trixi.jl/blob/main/AUTHORS.md.

Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread test/test_mpi_p4est_parabolic_2d.jl
@ranocha ranocha requested a review from DanielDoehring April 8, 2026 04:58
Copy link
Copy Markdown
Member

@DanielDoehring DanielDoehring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already quite good!

So you can reuse the "hyperbolic" MPI Interfaces, i.e., need no new datastructures? That would be very nice.

Did you test this implementation on a genuine distributed memory system, i.e., on a cluster with really different sockets requested?

Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
TJP-Karpowski and others added 2 commits April 8, 2026 10:57
Removed dispatch to unsupoorted T8codeMeshParallel

Co-authored-by: Daniel Doehring <doehringd2@gmail.com>
@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

Looks already quite good!

So you can reuse the "hyperbolic" MPI Interfaces, i.e., need no new datastructures? That would be very nice.

Did you test this implementation on a genuine distributed memory system, i.e., on a cluster with really different sockets requested?

Yes it reuses the "hyperbolic" MPI Interfaces. I have tested the 3D version already on multi-node simulations and it works but I still need to add a speed-up test to ensure it is working somewhat okay. I can do that for one of the testcases.

Comment thread src/callbacks_step/analysis_surface_integral_2d.jl Outdated
Copy link
Copy Markdown
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution!

Comment thread src/callbacks_step/analysis_surface_integral_2d.jl Outdated
Comment thread src/callbacks_step/analysis_surface_integral_2d.jl Outdated
Comment thread src/callbacks_step/analysis_surface_integral_2d.jl Outdated
Comment thread src/callbacks_step/analysis_surface_integral_2d.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
TJP-Karpowski and others added 2 commits April 8, 2026 19:56
adhere to mpi internal functions

Co-authored-by: Hendrik Ranocha <ranocha@users.noreply.github.com>
@DanielDoehring
Copy link
Copy Markdown
Member

DanielDoehring commented Apr 10, 2026

Thanks a lot for this detailed investigation!

Can you try one rank per node(that would be the configuration with the lowest amount of communication) and multiple ranks (say maybe 2, 4, 8) ?
And for threads it probably suffices to stop at maybe 16 or 24 threads per rank/node, as we know that then memory bandwidth is the limiting factor and no more speedup is really expected.

@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

To clarify you want to see N=R (R/N=1) with T fixed at lets say T=16. Then very again N from 1 to 12. This way I would only saturate ~12.5% of the available corse per node. Further I would keep the problem size the same. To better understand your question, what would be the information you hope to gain from this comparison? I presume the idea is to differentiate between memory limits and compute limits more clearly?

@DanielDoehring
Copy link
Copy Markdown
Member

So I do not see any good reason why I would in a practical simulation have more than one rank per shared memory unit (node). So I am interested in that particular case.

About problem size: I guess keeping problem size for the moment (strong scaling) is fine, although you can also try increasing problem size (weak scaling) - but maybe start with something smaller then in the first place :)

@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

TJP-Karpowski commented Apr 10, 2026

Okay I am a bit supprised. Coming from pure MPI parallelization, I would always try to saturate the CPUs (or close to all). Looking at the R/N=8 compared to R/N=16 suggest that R/N=16 is always faster than R/N=8 and from the PID slopes I would not expect R/N=8 to ever catch up with the other one. But I will try to see if lower CPU utilization is offset by better communication for these cases.

@DanielDoehring
Copy link
Copy Markdown
Member

Yeah if you only have MPI parallelism there is no way around having more than one rank per node.

Concerning saturating a node: Unless you request exclude node access, you should only be billed for the cores/threads you request, right? So in that sense it seems natural to me to request more nodes.

As a disclaimer: I did not run any simulations beyond 4 ranks or so with trixi, would be good to get some info from @sloede on this matter.

@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

The PC2 clusters sets exclusive job allocations by default, similar defaults exist on other clusters, so billingwise I need to aim for full node allocation here too. Forthermore, you could still have the communication overhead of someone elses job. I have added the R/N=1 T=16 line to the figures, but needed to revaluate at 1500 interations as I did not adjust the timelimit. It scales roughly with 12.5% N so it looks like it is still scaling similarly on 1N as the others. This also suggests to me that there is no significant downside to having multiple MPI ranks per node as otherwise I would expect this line to start higher than 12.5%. In fact for N=1 there is only 0.1 speedup relative to full saturation.

I also added the scaling of this setup in isolation in reference to R=N=1 T=16. It also scales similarly.
speedupNodes
speedupNodes_1only

@sloede
Copy link
Copy Markdown
Member

sloede commented Apr 11, 2026

I'll try to take a look at this as soon as possible next week (though I have a proposal deadline on Thursday, but I'll try my best)

sloede
sloede previously approved these changes Apr 17, 2026
Copy link
Copy Markdown
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have checked, the code looks good to me, thanks a lot! This is a very nice extension indeed, and sets up nicely for 3D support as well 💪

Kudos also on the MPI cache reuse!

One additional question: Have you run also a comparison of at least one non-trivial setup in parallel and verified that you get exactly the same results as in the serial case, i.e., binary-identical error norms? IIRC, this should be the case at least for the hyperbolic MPI implementation, and probably also the BR1 implementation (since it's symmetric), but does it also work for LDG?

If you haven't, it would be good to at least run one of the simulations a bit longer than in the tests to ensure that there is no funny business going on that will manifest itself only after more than just a few time steps.

start_mpi_receive!(cache.mpi_cache)
end

@assert isempty(eachmortar(dg, cache)) "Nonconforming meshes are not yet supported on MPI parallel P4estMesh."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check sufficient to guarantee no broken simulations? What if there are only mortars at an MPI interface - would this also be caught?

Note that if it is not, I think it would be OK to merge this anyways, but should be addressed in the next PR then (which I asssume #2888 will do)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mortars are not on the mpi interface the code should still work as it reuses the same serial or threaded mortar treatment. As the mpi-mortars are already working in #2888 I would also not worry to much.

I have not run this 2D version for longer yet. But the implementation mirrors the 3D implementation from #2880 . For the 3D cases I simulated the Taylor-Green-Vortex and compared K and epsilon which matched perfectly back then:

3D_TGV_comparison

DNS data was from Zirwes et al. (2023) which I had lying around is just for reference and plausibility. Second row shows total dissipation (-dK/dT), while the third show resolved and numerical dissipation. The curves match what I would expect for an implicit LES and more importantly both curves match perfectly. In regards to the LDG method I did not try that one. As I am not very familar with the DG method could you suggest one of the examples for which both exist than I can try it. Maybe I can rerun the lid-driven cavity?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mortars are not on the mpi interface the code should still work as it reuses the same serial or threaded mortar treatment. As the mpi-mortars are already working in #2888 I would also not worry to much.

I agree!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not run this 2D version for longer yet.

Could you maybe just run a comparison of, e.g., the 2D lid-driven cavity case: once in serial, for 100 time steps, once with MPI in parallel, and then compare the L2/Linf errors? IMHO they should be identical

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but even if they're not, this is not a merge stopper for me)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L2_V1 L2_rho

Here are the errors for rhoV1 and in the second image just rho from the Lid-driven cavity. The different lines (which all overlap) are the different number of MPI ranks. 0.125 is the single Rank 1Thread run. 1 is 1Rank 8 Threads.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This is great to see that the results are still invariant to the MPI & multithreading parallelizations!

@sloede
Copy link
Copy Markdown
Member

sloede commented Apr 17, 2026

@TJP-Karpowski Please also add your name to the authors list in AUTHORS.md 🙂

@TJP-Karpowski
Copy link
Copy Markdown
Contributor Author

I had seen that in #2361 and #2284 there were discussions on the scaling of Trixi on a single node. I also timed the lid-driven cavity testcases (with less refinement than for the multi-node scaling) again on a single node and got this scaling plot:

speedup_singleNode

I utilize 8cpus per Rank in these simulations, which corresponds to the numa domains size. The drop from 1T to 8T is similar to the plots in #2361 and #2284 . Increasing the Node utilization further improves the scaling with MPI. Near the full utilization it drops again slightly, but to me that seems quite acceptable compared to the threading only approach. I presume that for the case of fewer cores I could also utilze e.g. 4 Ranks with 2 Threads each to get the scaling curve closer to linear.

@ranocha ranocha requested a review from DanielDoehring April 19, 2026 13:35
ranocha
ranocha previously approved these changes Apr 19, 2026
Copy link
Copy Markdown
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl
Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
return nothing
end

function calc_gradient_local!(gradients, u_transformed, t,
Copy link
Copy Markdown
Member

@DanielDoehring DanielDoehring Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this singled-out function. I also thought whether it makes sense to do the same also for the prolong2 and calc_flux building blocks. Maybe not in this PR, but something we could consider when the 3D version is added - would save us some lines of code and make the important building blocks clearer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please open an issue for this (if Jeremy or you do not want to work on this immediately in the next PR)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment in #2894 (review)

Comment thread src/solvers/dgsem_p4est/dg_2d_parabolic_parallel.jl Outdated
Copy link
Copy Markdown
Member

@DanielDoehring DanielDoehring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, cool stuff!

@ranocha ranocha enabled auto-merge (squash) April 20, 2026 07:49
@ranocha ranocha disabled auto-merge April 20, 2026 09:04
@ranocha ranocha merged commit c15e806 into trixi-framework:main Apr 20, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants