Skip to content

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588

Merged
sbryngelson merged 4 commits into
masterfrom
fix-amdflang-re-size-allocatable
Jun 15, 2026
Merged

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588
sbryngelson merged 4 commits into
masterfrom
fix-amdflang-re-size-allocatable

Conversation

@sbryngelson

@sbryngelson sbryngelson commented Jun 13, 2026

Copy link
Copy Markdown
Member

Closes #1587. (The branch name is stale — this no longer makes Re_size allocatable; see below.)

Summary

Fixes silently-wrong viscous results on the Frontier AMD-flang OpenMP-offload GPU build — the 2D_viscous_shock_tube regression introduced by #1556. 3 files, solver-only; Re_size stays static.

Root cause

On AMD flang OpenMP offload, a cross-translation-unit read of the static declare target Re_size inside a Riemann-solver kernel is unreliable and codegen-dependent: some solver kernels read it as 0, so if (Re_size(i) > 0) is false, Re_L/Re_R collapse to dflt_real, the interface Reynolds number becomes 1e16, and viscosity is silently switched off. CPU and NVIDIA (nvfortran) builds are unaffected — verified across all 14 NVHPC CI lanes.

Why not "make Re_size allocatable" (the previous version of this PR)

That approach only moved the bug. On AMD flang it fixed HLL/HLLC but broke the LF (riemann_solver=5) and IBM viscous paths — they pass on master and failed with the allocatable change (an inversion: static breaks some solver kernels, allocatable breaks others). The defect is per-kernel codegen, not a clean static-vs-allocatable rule, so no choice of storage class is reliable for every kernel. (Confirmed on hardware: per-solver GPU_UPDATE workarounds also only fixed some kernels.)

The fix

Keep Re_size static; in each of s_hll/hllc/lf_riemann_solver, capture it into a host-set local Re_size_loc and firstprivate that value into the viscous kernels, so they never read Re_size from declare-target device memory:

integer, dimension(2) :: Re_size_loc   ! host copy
...
Re_size_loc = Re_size
$:GPU_PARALLEL_LOOP(collapse=3, firstprivate='[Re_size_loc]', private='[...]')
...
do q = 1, Re_size_loc(i)   ! was Re_size(i)

Removing the declare-target read removes the variable the compiler bug acts on, so it's immune regardless of which kernel or solver.

Verification (Frontier AMD flang, gpu-omp)

All five previously-affected viscous tests pass with this fix:
70EC99CE viscous_shock_tube, 8EAC3DA7 1D viscous canary, 27B24CD9 + 4C812637 LF-viscous (the two the allocatable approach broke), B0CE19C5 IBM-viscous. format + precheck clean.

Upstream

This is a workaround for an amdflang bug, filed with a standalone minimal reproducer: ROCm/llvm-project#2890, llvm/llvm-project#203711 (still reproduces on the newest amdflang, AFAR 23.2.0 / LLVM 23). The always-run viscous-GPU canary in #1590 is what surfaced the inversion before merge.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an AMD flang OpenMP-offload GPU correctness regression (post-#1556 module split) where Re_size could remain zero on-device in solver translation units, effectively disabling viscous terms without an error. The fix makes Re_size allocatable so it is handled through the OpenMP runtime mapping machinery (consistent with Re_idx / Res_gs) and thus behaves correctly across translation units.

Changes:

  • Change Re_size from a static integer, dimension(2) to integer, allocatable, dimension(:).
  • Allocate Re_size(1:2) during global-parameters initialization before first use and keep initialization to zero.
  • Deallocate Re_size in the module finalization path.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 42.85714% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.94%. Comparing base (ac5774e) to head (da4b6e3).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_riemann_solver_hllc.fpp 50.00% 3 Missing and 2 partials ⚠️
src/simulation/m_riemann_solver_lf.fpp 42.85% 0 Missing and 4 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp 25.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1588   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files          82       82           
  Lines       19922    19925    +3     
  Branches     2924     2924           
=======================================
+ Hits        12141    12144    +3     
  Misses       5805     5805           
  Partials     1976     1976           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from baf0884 to ec9c371 Compare June 13, 2026 19:49
@sbryngelson sbryngelson changed the title Make Re_size allocatable: fix silent viscosity loss on AMD flang GPU Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers) Jun 13, 2026
sbryngelson added a commit to sbryngelson/compiler-bugs that referenced this pull request Jun 13, 2026
…host-capture (firstprivate)

The declare-target-roulette case shows two identical declare-target arrays disagreeing in one kernel based only on the push mechanism (enter data map vs update to), verified on ROCm 7.2.0 and AFAR drop 23.2.0. Also corrects declare-target-static-tu's Fix section: making the variable allocatable is NOT reliable in the full app (it only moves the bug -- an inversion), the robust fix is host-capture + firstprivate (MFlowCode/MFC#1588).
@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch 2 times, most recently from dd1eef1 to 203ba70 Compare June 14, 2026 02:45
…plete Riemann private lists)

AMD flang reads the static declare-target Re_size stale across translation units, silently disabling viscosity. Host-capture it into Re_size_loc and firstprivate it into the Riemann kernels for all compilers. That firstprivate clause exposed a latent bug: several per-cell scalars (s_M/s_P/xi_M/xi_P in HLL, c_sum_Yi_Phi/flux_ene_e in HLLC, Gamm_L/Gamm_R/flux_tau_L/flux_tau_R in LF) were omitted from private() and relied on CCE-19's defaultmap(firstprivate:scalar) auto-privatization, which the firstprivate clause disrupts (gross wrong physics on Cray). Complete the private lists -- a no-op on every compiler -- so firstprivate is safe everywhere; no compiler gate needed. Verified on Frontier: AMD-flang 29/29 and Cray-CCE (rs=1/2/5) of the previously-failing cases pass.
@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from 203ba70 to 0d995d5 Compare June 14, 2026 02:59
The host-capture firstprivate(Re_size_loc) added to the LF/HLL/HLLC solvers (the AMD-flang viscosity fix) pushed the longest GPU_PARALLEL_LOOP directive past nvfortran's ~1000-char source-line limit (1011 chars -> 'source line too long' / unbalanced parentheses; AMD and Cray accept the long line, nvfortran caps it). FOLD_DIRECTIVE wraps the assembled directive at whole-clause boundaries with repeated sentinels (!$acc&/!$omp&), so the longest emitted line is prefix+longest-single-clause (817 chars on LF) -- shorter than the single line a build with one fewer clause already compiles. firstprivate is preserved (AMD/Cray correctness) on its own continuation. Validated: nvfortran 25.5 -acc and -mp=gpu compile+run the folded marker-interleaved continuation correctly; the cpp line-markers between continuations are already emitted by master for regular Fortran continuations and compile on Cray/AMD CI. Short directives (<200 chars) are unchanged.
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 14, 2026
…AML is missing

run_parallel_benchmarks.sh only hard-failed if the output YAML was absent; a non-zero PR SLURM exit (all cases SIGTERM'd/crashed/hung) was downgraded to a warning, so a broken PR benchmark passed green as long as a partial YAML existed (observed on MFlowCode#1588, where phoenix-bench node contention SIGTERM'd every PR case yet the check stayed green). Now a genuine PR-job failure exits 1. pr_exit is reliable: run_monitored_slurm_job.sh re-checks sacct and returns 0 when the job actually COMPLETED 0:0, so this does not red-cross on monitor flakiness. Scoped to PR only -- a master/baseline infra failure stays a warning.
sbryngelson added a commit that referenced this pull request Jun 14, 2026
…lvers

firstprivate([Re_size_loc]) made AMD-flang allocate a per-thread copy of the array, pushing the heavy HLLC kernel off an occupancy cliff: a deterministic ~3.1x grind regression on the 5eq/viscous/ibm GPU benchmarks (AMD-flang OpenMP offload only). copyin maps one shared read-only device copy, which still bypasses the stale cross-TU declare-target read of Re_size. Same-node Frontier validation: AMD grind restored to master (5eq 0.99, viscous 1.19, ibm 3.44 vs PR 3.12/4.44/4.75); Cray 29/29 of the previously-failing #1588 UUIDs.
sbryngelson added a commit that referenced this pull request Jun 14, 2026
…with copyin)

copyin carries Re_size_loc with no firstprivate, so there is no CCE defaultmap disruption: the explicit complete-private-lists are unnecessary (defaultmap auto-privatizes the scalars as on master), and directives are short enough that the nvfortran continuation-line FOLD is no longer needed. Net diff reduces to the minimal fix (Re_size_loc host-copy + copyin in the 3 Riemann solvers). Validated on Frontier: Cray 29/29 of the previously-failing #1588 UUIDs; AMD grind at master (5eq 0.996, viscous 1.19, ibm 3.45) with viscous correctness passing.
@sbryngelson sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from 7e0e5f8 to ecf7f22 Compare June 15, 2026 01:51
…ang occupancy)

firstprivate('[Re_size_loc]') of an integer(2) ARRAY craters AMD-flang occupancy on the heavy HLLC/bubbles Riemann kernels: a deterministic ~3-3.8x grind regression on 5eq/viscous/ibm GPU benches (AMD-flang OMP only). Capturing Re_size into two scalars (Re_size_loc1/Re_size_loc2), firstprivate-ing those, and reading them via merge(...) keeps the per-thread by-value semantics that make this correct (incl. model_eqns=3, where a copyin/map(to:) alternative silently dropped viscosity at the 6-eq kernel) while living in registers, so the cliff is gone. Frontier validation: AMD grind back to master (5eq 0.997, viscous 1.17, ibm 3.45 vs array-firstprivate 3.12/4.44/4.75); AMD model_eqns=3 + shock-tube correctness 7/7; Cray 29/29 of the previously-failing #1588 UUIDs.
@sbryngelson sbryngelson merged commit 7715b72 into master Jun 15, 2026
143 of 145 checks passed
@sbryngelson sbryngelson deleted the fix-amdflang-re-size-allocatable branch June 15, 2026 12:33
sbryngelson added a commit to sbryngelson/MFC that referenced this pull request Jun 15, 2026
…#1588) through the s_compute_interface_reynolds extraction

Phase-3 extracts the viscous Reynolds computation into a cray-inlined GPU_ROUTINE that read the declare-target Re_size static directly, which silently reintroduces the AMD flang stale-cross-TU read fixed in MFlowCode#1588. Thread MFlowCode#1588's host-captured Re_size (Re_size_loc1/2 + firstprivate) through the helper and its 10 call sites, and apply the merge() pattern to hllc's inline block.
Cowsreal pushed a commit to Cowsreal/MFCMarkZhang that referenced this pull request Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Frontier AMD (flang, gpu-omp): viscous_shock_tube fails golden tolerance deterministically on master

2 participants