|
| 1 | +# GPU and MPI Patterns |
| 2 | + |
| 3 | +## GPU Offloading Architecture |
| 4 | + |
| 5 | +Only `src/simulation/` is GPU-accelerated. Pre/post_process run on CPU only. |
| 6 | + |
| 7 | +MFC uses a **backend-agnostic GPU abstraction** via Fypp macros. The same source code |
| 8 | +compiles to either OpenACC or OpenMP target offload depending on the build flag: |
| 9 | + |
| 10 | +- `./mfc.sh build --gpu acc` → OpenACC backend (NVIDIA nvfortran, Cray ftn) |
| 11 | +- `./mfc.sh build --gpu mp` → OpenMP target offload backend (Cray ftn, AMD flang) |
| 12 | +- `./mfc.sh build` (no --gpu) → CPU-only, GPU macros expand to plain Fortran |
| 13 | + |
| 14 | +### Macro Layers (in src/common/include/) |
| 15 | +- `parallel_macros.fpp` — **Use these.** Generic `GPU_*` macros that dispatch to the |
| 16 | + correct backend based on `MFC_OpenACC` / `MFC_OpenMP` compile definitions. |
| 17 | +- `acc_macros.fpp` — OpenACC-specific `ACC_*` implementations (do not call directly) |
| 18 | +- `omp_macros.fpp` — OpenMP target offload `OMP_*` implementations (do not call directly) |
| 19 | + - OMP macros generate **compiler-specific** directives: NVIDIA uses `target teams loop`, |
| 20 | + Cray uses `target teams distribute parallel do simd`, AMD uses |
| 21 | + `target teams distribute parallel do` |
| 22 | +- `shared_parallel_macros.fpp` — Shared helpers (collapse, private, reduction generators) |
| 23 | + |
| 24 | +### Key GPU Macros (always use the `GPU_*` prefix) |
| 25 | + |
| 26 | +Inline macros (use `$:` prefix): |
| 27 | +- `$:GPU_PARALLEL_LOOP(collapse=N, private=[...], reduction=[...], reductionOp='+')` — |
| 28 | + Parallel loop over GPU threads. Most common GPU macro. |
| 29 | +- `$:END_GPU_PARALLEL_LOOP()` — Required closing for GPU_PARALLEL_LOOP. |
| 30 | +- `$:GPU_LOOP(collapse=N, ...)` — Inner loop within a GPU parallel region. |
| 31 | +- `$:GPU_ENTER_DATA(create=[...])` — Allocate device memory (unscoped). |
| 32 | +- `$:GPU_EXIT_DATA(delete=[...])` — Free device memory. |
| 33 | +- `$:GPU_UPDATE(host=[...])` — Copy device → host (before MPI send). |
| 34 | +- `$:GPU_UPDATE(device=[...])` — Copy host → device (after MPI receive). |
| 35 | +- `$:GPU_ROUTINE(parallelism='[seq]')` — Mark routine for device compilation. |
| 36 | +- `$:GPU_DECLARE(create=[...])` — Declare device-resident data. |
| 37 | +- `$:GPU_ATOMIC(atomic='update')` — Atomic operation on device. |
| 38 | +- `$:GPU_WAIT()` — Synchronization barrier. |
| 39 | + |
| 40 | +Block macros (use `#:call`/`#:endcall`): |
| 41 | +- `GPU_PARALLEL(...)` — GPU parallel region (used for scalar reductions like `maxval`/`minval`). |
| 42 | +- `GPU_DATA(copy=..., create=..., ...)` — Scoped data region. |
| 43 | +- `GPU_HOST_DATA(use_device_addr=[...])` — Host code with device pointers. |
| 44 | + |
| 45 | +Typical GPU loop pattern (used 750+ times in the codebase): |
| 46 | +``` |
| 47 | +$:GPU_PARALLEL_LOOP(private='[i,j,k,l]', collapse=3) |
| 48 | +do l = idwbuff(3)%beg, idwbuff(3)%end |
| 49 | + do k = idwbuff(2)%beg, idwbuff(2)%end |
| 50 | + do j = idwbuff(1)%beg, idwbuff(1)%end |
| 51 | + ! loop body |
| 52 | + end do |
| 53 | + end do |
| 54 | +end do |
| 55 | +$:END_GPU_PARALLEL_LOOP() |
| 56 | +``` |
| 57 | + |
| 58 | +WARNING: Do NOT use `GPU_PARALLEL` wrapping `GPU_LOOP` for spatial loops. `GPU_LOOP` |
| 59 | +emits empty directives on Cray and AMD compilers, causing silent serial execution. |
| 60 | +Use `GPU_PARALLEL_LOOP` / `END_GPU_PARALLEL_LOOP` for all parallel spatial loops. |
| 61 | + |
| 62 | +NEVER write raw `!$acc` or `!$omp` directives. Always use `GPU_*` Fypp macros. |
| 63 | +The precheck source lint will catch raw directives and fail. |
| 64 | + |
| 65 | +### Memory Management Macros (from macros.fpp) |
| 66 | +- `@:ALLOCATE(var1, var2, ...)` — Fortran allocate + `GPU_ENTER_DATA(create=...)` |
| 67 | +- `@:DEALLOCATE(var1, var2, ...)` — `GPU_EXIT_DATA(delete=...)` + Fortran deallocate |
| 68 | +- `@:PREFER_GPU(var1, var2, ...)` — NVIDIA unified memory page placement hint |
| 69 | +- Every `@:ALLOCATE` MUST have a matching `@:DEALLOCATE` in finalization |
| 70 | +- Conditional allocation MUST have conditional deallocation |
| 71 | + |
| 72 | +### GPU Field Setup (Cray-specific, from macros.fpp) |
| 73 | +- `@:ACC_SETUP_VFs(...)` / `@:ACC_SETUP_SFs(...)` — GPU pointer setup for vector/scalar fields |
| 74 | +- These compile only for Cray (`_CRAYFTN`); other compilers skip them |
| 75 | + |
| 76 | +### Compiler-Backend Matrix |
| 77 | + |
| 78 | +CI-gated compilers (must always pass): gfortran, nvfortran, Cray ftn, Intel ifx. |
| 79 | +AMD flang is additionally supported for GPU builds but not in the CI matrix. |
| 80 | + |
| 81 | +| Compiler | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP) | CPU-only | |
| 82 | +|-----------------|----------------------|------------------------|----------| |
| 83 | +| GNU gfortran | No | Experimental (AMD GCN) | Yes | |
| 84 | +| NVIDIA nvfortran| Yes (primary) | Yes | Yes | |
| 85 | +| Cray ftn (CCE) | Yes | Yes (primary) | Yes | |
| 86 | +| Intel ifx | No | Experimental (SPIR64) | Yes | |
| 87 | +| AMD flang | No | Yes | Yes | |
| 88 | + |
| 89 | +## Preprocessor Defines (`#ifdef` / `#ifndef`) |
| 90 | + |
| 91 | +Raw `#ifdef` / `#ifndef` preprocessor guards are **normal and expected** in MFC. |
| 92 | +They are NOT the same as raw `!$acc`/`!$omp` pragmas (which are forbidden). |
| 93 | + |
| 94 | +Use `#ifdef` for feature, target, compiler, and library gating: |
| 95 | + |
| 96 | +### Feature gating |
| 97 | +- `MFC_MPI` — MPI-enabled build (`--mpi` flag, default ON) |
| 98 | +- `MFC_OpenACC` — OpenACC GPU backend (`--gpu acc`) |
| 99 | +- `MFC_OpenMP` — OpenMP target offload backend (`--gpu mp`) |
| 100 | +- `MFC_GPU` — Any GPU build (either OpenACC or OpenMP) |
| 101 | +- `MFC_DEBUG` — Debug build (`--debug`) |
| 102 | +- `MFC_SINGLE_PRECISION` — Single-precision mode (`--single`) |
| 103 | +- `MFC_MIXED_PRECISION` — Mixed-precision mode (`--mixed`) |
| 104 | + |
| 105 | +### Target gating (for code in `src/common/` shared across executables) |
| 106 | +- `MFC_PRE_PROCESS` — Only in pre_process builds |
| 107 | +- `MFC_SIMULATION` — Only in simulation builds |
| 108 | +- `MFC_POST_PROCESS` — Only in post_process builds |
| 109 | + |
| 110 | +### Compiler gating (for compiler-specific workarounds) |
| 111 | +- `_CRAYFTN` — Cray Fortran compiler |
| 112 | +- `__NVCOMPILER_GPU_UNIFIED_MEM` — NVIDIA unified memory (GH-200 / `--unified`) |
| 113 | +- `__PGI` — Legacy PGI/NVIDIA compiler |
| 114 | +- `__INTEL_COMPILER` — Intel compiler |
| 115 | +- `FRONTIER_UNIFIED` — Frontier HPC unified memory |
| 116 | + |
| 117 | +### Library-specific code |
| 118 | +- FFTW (`m_fftw.fpp`) uses heavy `#ifdef` gating for `MFC_GPU` and `__PGI` |
| 119 | +- CUDA Fortran (`cudafor` module) is gated behind `__NVCOMPILER_GPU_UNIFIED_MEM` |
| 120 | +- SILO/HDF5 interfaces may have conditional paths |
| 121 | + |
| 122 | +When adding new `#ifdef` blocks, always provide an `#else` or `#endif` path so |
| 123 | +the code compiles in all configurations (CPU-only, GPU-ACC, GPU-OMP, with/without MPI). |
| 124 | + |
| 125 | +## MPI |
| 126 | + |
| 127 | +### Halo Exchange |
| 128 | +- Pack/unpack offset calculations are error-prone — verify carefully |
| 129 | +- Buffer sizing depends on dimensionality and QBMM state |
| 130 | +- GPU coherence: always `GPU_UPDATE(host=...)` before MPI send, |
| 131 | + `GPU_UPDATE(device=...)` after MPI receive |
| 132 | + |
| 133 | +### Error Handling |
| 134 | +- Use `call s_mpi_abort()` for fatal errors, never `stop` or `error stop` |
| 135 | +- MPI must be finalized before program exit |
0 commit comments