@@ -23,26 +23,17 @@ compiles to either OpenACC or OpenMP target offload depending on the build flag:
2323
2424### Key GPU Macros (always use the ` GPU_* ` prefix)
2525
26- Inline macros (use ` $: ` prefix):
27- - ` $:GPU_PARALLEL_LOOP(collapse=N, private=[...], reduction=[...], reductionOp='+') ` —
28- Parallel loop over GPU threads. Most common GPU macro.
29- - ` $:END_GPU_PARALLEL_LOOP() ` — Required closing for GPU_PARALLEL_LOOP.
30- - ` $:GPU_LOOP(collapse=N, ...) ` — Inner loop within a GPU parallel region.
31- - ` $:GPU_ENTER_DATA(create=[...]) ` — Allocate device memory (unscoped).
32- - ` $:GPU_EXIT_DATA(delete=[...]) ` — Free device memory.
33- - ` $:GPU_UPDATE(host=[...]) ` — Copy device → host (before MPI send).
34- - ` $:GPU_UPDATE(device=[...]) ` — Copy host → device (after MPI receive).
35- - ` $:GPU_ROUTINE(parallelism='[seq]') ` — Mark routine for device compilation.
36- - ` $:GPU_DECLARE(create=[...]) ` — Declare device-resident data.
37- - ` $:GPU_ATOMIC(atomic='update') ` — Atomic operation on device.
38- - ` $:GPU_WAIT() ` — Synchronization barrier.
39-
40- Block macros (use ` #:call ` /` #:endcall ` ):
41- - ` GPU_PARALLEL(...) ` — GPU parallel region (used for scalar reductions like ` maxval ` /` minval ` ).
42- - ` GPU_DATA(copy=..., create=..., ...) ` — Scoped data region.
43- - ` GPU_HOST_DATA(use_device_addr=[...]) ` — Host code with device pointers.
44-
45- Typical GPU loop pattern (used 750+ times in the codebase):
26+ Full set with signatures in ` parallel_macros.fpp ` . The ones you reach for most:
27+ - ` $:GPU_PARALLEL_LOOP(collapse=N, private=[...], reduction=[...], reductionOp='+') `
28+ + ` $:END_GPU_PARALLEL_LOOP() ` — parallel spatial loop; by far the most common (see pattern below).
29+ - ` $:GPU_LOOP(collapse=N, ...) ` — inner loop * within* a parallel region.
30+ - ` $:GPU_UPDATE(host=[...]) ` / ` $:GPU_UPDATE(device=[...]) ` — device↔host copies (around MPI; see below).
31+ - ` #:call GPU_PARALLEL(...) ` — block region for scalar reductions (` maxval ` /` minval ` ).
32+
33+ Others in ` parallel_macros.fpp ` : ` GPU_ENTER_DATA ` /` GPU_EXIT_DATA ` , ` GPU_DECLARE ` , ` GPU_ROUTINE ` ,
34+ ` GPU_ATOMIC ` , ` GPU_WAIT ` , and the block macros ` GPU_DATA ` , ` GPU_HOST_DATA ` .
35+
36+ Typical GPU loop pattern (the dominant spatial-loop idiom):
4637```
4738$:GPU_PARALLEL_LOOP(private='[i,j,k,l]', collapse=3)
4839do l = idwbuff(3)%beg, idwbuff(3)%end
@@ -108,16 +99,10 @@ Use `#ifdef` for feature, target, compiler, and library gating:
10899- ` MFC_POST_PROCESS ` — Only in post_process builds
109100
110101### Compiler gating (for compiler-specific workarounds)
111- - ` _CRAYFTN ` — Cray Fortran compiler
112- - ` __NVCOMPILER_GPU_UNIFIED_MEM ` — NVIDIA unified memory (GH-200 / ` --unified ` )
113- - ` __PGI ` — Legacy PGI/NVIDIA compiler
114- - ` __INTEL_COMPILER ` — Intel compiler
115- - ` FRONTIER_UNIFIED ` — Frontier HPC unified memory
116-
117- ### Library-specific code
118- - FFTW (` m_fftw.fpp ` ) uses heavy ` #ifdef ` gating for ` MFC_GPU ` and ` __PGI `
119- - CUDA Fortran (` cudafor ` module) is gated behind ` __NVCOMPILER_GPU_UNIFIED_MEM `
120- - SILO/HDF5 interfaces may have conditional paths
102+ Compiler/feature macros: ` _CRAYFTN ` , ` __NVCOMPILER_GPU_UNIFIED_MEM ` (NVIDIA unified mem, GH-200 /
103+ ` --unified ` ), ` __PGI ` (legacy PGI/NVIDIA), ` __INTEL_COMPILER ` , ` FRONTIER_UNIFIED ` . Library code is
104+ similarly gated (FFTW in ` m_fftw.fpp ` on ` MFC_GPU ` /` __PGI ` ; CUDA Fortran ` cudafor ` on
105+ ` __NVCOMPILER_GPU_UNIFIED_MEM ` ; SILO/HDF5 paths). Grep the relevant file for exact usage.
121106
122107When adding new ` #ifdef ` blocks, always provide an ` #else ` or ` #endif ` path so
123108the code compiles in all configurations (CPU-only, GPU-ACC, GPU-OMP, with/without MPI).
0 commit comments