Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses an AMD flang OpenMP-offload GPU correctness regression (post-#1556 module split) where Re_size could remain zero on-device in solver translation units, effectively disabling viscous terms without an error. The fix makes Re_size allocatable so it is handled through the OpenMP runtime mapping machinery (consistent with Re_idx / Res_gs) and thus behaves correctly across translation units.
Changes:
- Change
Re_sizefrom a staticinteger, dimension(2)tointeger, allocatable, dimension(:). - Allocate
Re_size(1:2)during global-parameters initialization before first use and keep initialization to zero. - Deallocate
Re_sizein the module finalization path.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1588 +/- ##
=======================================
Coverage 60.94% 60.94%
=======================================
Files 82 82
Lines 19922 19925 +3
Branches 2924 2924
=======================================
+ Hits 12141 12144 +3
Misses 5805 5805
Partials 1976 1976 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
baf0884 to
ec9c371
Compare
…host-capture (firstprivate) The declare-target-roulette case shows two identical declare-target arrays disagreeing in one kernel based only on the push mechanism (enter data map vs update to), verified on ROCm 7.2.0 and AFAR drop 23.2.0. Also corrects declare-target-static-tu's Fix section: making the variable allocatable is NOT reliable in the full app (it only moves the bug -- an inversion), the robust fix is host-capture + firstprivate (MFlowCode/MFC#1588).
dd1eef1 to
203ba70
Compare
…plete Riemann private lists) AMD flang reads the static declare-target Re_size stale across translation units, silently disabling viscosity. Host-capture it into Re_size_loc and firstprivate it into the Riemann kernels for all compilers. That firstprivate clause exposed a latent bug: several per-cell scalars (s_M/s_P/xi_M/xi_P in HLL, c_sum_Yi_Phi/flux_ene_e in HLLC, Gamm_L/Gamm_R/flux_tau_L/flux_tau_R in LF) were omitted from private() and relied on CCE-19's defaultmap(firstprivate:scalar) auto-privatization, which the firstprivate clause disrupts (gross wrong physics on Cray). Complete the private lists -- a no-op on every compiler -- so firstprivate is safe everywhere; no compiler gate needed. Verified on Frontier: AMD-flang 29/29 and Cray-CCE (rs=1/2/5) of the previously-failing cases pass.
203ba70 to
0d995d5
Compare
The host-capture firstprivate(Re_size_loc) added to the LF/HLL/HLLC solvers (the AMD-flang viscosity fix) pushed the longest GPU_PARALLEL_LOOP directive past nvfortran's ~1000-char source-line limit (1011 chars -> 'source line too long' / unbalanced parentheses; AMD and Cray accept the long line, nvfortran caps it). FOLD_DIRECTIVE wraps the assembled directive at whole-clause boundaries with repeated sentinels (!$acc&/!$omp&), so the longest emitted line is prefix+longest-single-clause (817 chars on LF) -- shorter than the single line a build with one fewer clause already compiles. firstprivate is preserved (AMD/Cray correctness) on its own continuation. Validated: nvfortran 25.5 -acc and -mp=gpu compile+run the folded marker-interleaved continuation correctly; the cpp line-markers between continuations are already emitted by master for regular Fortran continuations and compile on Cray/AMD CI. Short directives (<200 chars) are unchanged.
…AML is missing run_parallel_benchmarks.sh only hard-failed if the output YAML was absent; a non-zero PR SLURM exit (all cases SIGTERM'd/crashed/hung) was downgraded to a warning, so a broken PR benchmark passed green as long as a partial YAML existed (observed on MFlowCode#1588, where phoenix-bench node contention SIGTERM'd every PR case yet the check stayed green). Now a genuine PR-job failure exits 1. pr_exit is reliable: run_monitored_slurm_job.sh re-checks sacct and returns 0 when the job actually COMPLETED 0:0, so this does not red-cross on monitor flakiness. Scoped to PR only -- a master/baseline infra failure stays a warning.
…lvers firstprivate([Re_size_loc]) made AMD-flang allocate a per-thread copy of the array, pushing the heavy HLLC kernel off an occupancy cliff: a deterministic ~3.1x grind regression on the 5eq/viscous/ibm GPU benchmarks (AMD-flang OpenMP offload only). copyin maps one shared read-only device copy, which still bypasses the stale cross-TU declare-target read of Re_size. Same-node Frontier validation: AMD grind restored to master (5eq 0.99, viscous 1.19, ibm 3.44 vs PR 3.12/4.44/4.75); Cray 29/29 of the previously-failing #1588 UUIDs.
…with copyin) copyin carries Re_size_loc with no firstprivate, so there is no CCE defaultmap disruption: the explicit complete-private-lists are unnecessary (defaultmap auto-privatizes the scalars as on master), and directives are short enough that the nvfortran continuation-line FOLD is no longer needed. Net diff reduces to the minimal fix (Re_size_loc host-copy + copyin in the 3 Riemann solvers). Validated on Frontier: Cray 29/29 of the previously-failing #1588 UUIDs; AMD grind at master (5eq 0.996, viscous 1.19, ibm 3.45) with viscous correctness passing.
7e0e5f8 to
ecf7f22
Compare
…ang occupancy)
firstprivate('[Re_size_loc]') of an integer(2) ARRAY craters AMD-flang occupancy on the heavy HLLC/bubbles Riemann kernels: a deterministic ~3-3.8x grind regression on 5eq/viscous/ibm GPU benches (AMD-flang OMP only). Capturing Re_size into two scalars (Re_size_loc1/Re_size_loc2), firstprivate-ing those, and reading them via merge(...) keeps the per-thread by-value semantics that make this correct (incl. model_eqns=3, where a copyin/map(to:) alternative silently dropped viscosity at the 6-eq kernel) while living in registers, so the cliff is gone. Frontier validation: AMD grind back to master (5eq 0.997, viscous 1.17, ibm 3.45 vs array-firstprivate 3.12/4.44/4.75); AMD model_eqns=3 + shock-tube correctness 7/7; Cray 29/29 of the previously-failing #1588 UUIDs.
…#1588) through the s_compute_interface_reynolds extraction Phase-3 extracts the viscous Reynolds computation into a cray-inlined GPU_ROUTINE that read the declare-target Re_size static directly, which silently reintroduces the AMD flang stale-cross-TU read fixed in MFlowCode#1588. Thread MFlowCode#1588's host-captured Re_size (Re_size_loc1/2 + firstprivate) through the helper and its 10 call sites, and apply the merge() pattern to hllc's inline block.
…he Riemann solvers) (MFlowCode#1588)
Closes #1587. (The branch name is stale — this no longer makes
Re_sizeallocatable; see below.)Summary
Fixes silently-wrong viscous results on the Frontier AMD-flang OpenMP-offload GPU build — the
2D_viscous_shock_tuberegression introduced by #1556. 3 files, solver-only;Re_sizestays static.Root cause
On AMD flang OpenMP offload, a cross-translation-unit read of the static
declare targetRe_sizeinside a Riemann-solver kernel is unreliable and codegen-dependent: some solver kernels read it as0, soif (Re_size(i) > 0)is false,Re_L/Re_Rcollapse todflt_real, the interface Reynolds number becomes1e16, and viscosity is silently switched off. CPU and NVIDIA (nvfortran) builds are unaffected — verified across all 14 NVHPC CI lanes.Why not "make
Re_sizeallocatable" (the previous version of this PR)That approach only moved the bug. On AMD flang it fixed HLL/HLLC but broke the LF (
riemann_solver=5) and IBM viscous paths — they pass on master and failed with the allocatable change (an inversion: static breaks some solver kernels, allocatable breaks others). The defect is per-kernel codegen, not a clean static-vs-allocatable rule, so no choice of storage class is reliable for every kernel. (Confirmed on hardware: per-solverGPU_UPDATEworkarounds also only fixed some kernels.)The fix
Keep
Re_sizestatic; in each ofs_hll/hllc/lf_riemann_solver, capture it into a host-set localRe_size_locandfirstprivatethat value into the viscous kernels, so they never readRe_sizefrom declare-target device memory:Removing the declare-target read removes the variable the compiler bug acts on, so it's immune regardless of which kernel or solver.
Verification (Frontier AMD flang, gpu-omp)
All five previously-affected viscous tests pass with this fix:
70EC99CEviscous_shock_tube,8EAC3DA71D viscous canary,27B24CD9+4C812637LF-viscous (the two the allocatable approach broke),B0CE19C5IBM-viscous.format+precheckclean.Upstream
This is a workaround for an
amdflangbug, filed with a standalone minimal reproducer: ROCm/llvm-project#2890, llvm/llvm-project#203711 (still reproduces on the newest amdflang, AFAR 23.2.0 / LLVM 23). The always-run viscous-GPU canary in #1590 is what surfaced the inversion before merge.