Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers) by sbryngelson · Pull Request #1588 · MFlowCode/MFC

sbryngelson · 2026-06-13T16:20:35Z

Closes #1587. (The branch name is stale — this no longer makes Re_size allocatable; see below.)

Summary

Fixes silently-wrong viscous results on the Frontier AMD-flang OpenMP-offload GPU build — the 2D_viscous_shock_tube regression introduced by #1556. 3 files, solver-only; Re_size stays static.

Root cause

On AMD flang OpenMP offload, a cross-translation-unit read of the static declare target Re_size inside a Riemann-solver kernel is unreliable and codegen-dependent: some solver kernels read it as 0, so if (Re_size(i) > 0) is false, Re_L/Re_R collapse to dflt_real, the interface Reynolds number becomes 1e16, and viscosity is silently switched off. CPU and NVIDIA (nvfortran) builds are unaffected — verified across all 14 NVHPC CI lanes.

Why not "make `Re_size` allocatable" (the previous version of this PR)

That approach only moved the bug. On AMD flang it fixed HLL/HLLC but broke the LF (riemann_solver=5) and IBM viscous paths — they pass on master and failed with the allocatable change (an inversion: static breaks some solver kernels, allocatable breaks others). The defect is per-kernel codegen, not a clean static-vs-allocatable rule, so no choice of storage class is reliable for every kernel. (Confirmed on hardware: per-solver GPU_UPDATE workarounds also only fixed some kernels.)

The fix

Keep Re_size static; in each of s_hll/hllc/lf_riemann_solver, capture it into a host-set local Re_size_loc and firstprivate that value into the viscous kernels, so they never read Re_size from declare-target device memory:

integer, dimension(2) :: Re_size_loc   ! host copy
...
Re_size_loc = Re_size
$:GPU_PARALLEL_LOOP(collapse=3, firstprivate='[Re_size_loc]', private='[...]')
...
do q = 1, Re_size_loc(i)   ! was Re_size(i)

Removing the declare-target read removes the variable the compiler bug acts on, so it's immune regardless of which kernel or solver.

Verification (Frontier AMD flang, gpu-omp)

All five previously-affected viscous tests pass with this fix:
70EC99CE viscous_shock_tube, 8EAC3DA7 1D viscous canary, 27B24CD9 + 4C812637 LF-viscous (the two the allocatable approach broke), B0CE19C5 IBM-viscous. format + precheck clean.

Upstream

This is a workaround for an amdflang bug, filed with a standalone minimal reproducer: ROCm/llvm-project#2890, llvm/llvm-project#203711 (still reproduces on the newest amdflang, AFAR 23.2.0 / LLVM 23). The always-run viscous-GPU canary in #1590 is what surfaced the inversion before merge.

Copilot

Pull request overview

This PR addresses an AMD flang OpenMP-offload GPU correctness regression (post-#1556 module split) where Re_size could remain zero on-device in solver translation units, effectively disabling viscous terms without an error. The fix makes Re_size allocatable so it is handled through the OpenMP runtime mapping machinery (consistent with Re_idx / Res_gs) and thus behaves correctly across translation units.

Changes:

Change Re_size from a static integer, dimension(2) to integer, allocatable, dimension(:).
Allocate Re_size(1:2) during global-parameters initialization before first use and keep initialization to zero.
Deallocate Re_size in the module finalization path.

codecov · 2026-06-13T18:26:50Z

Codecov Report

❌ Patch coverage is 42.85714% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.94%. Comparing base (ac5774e) to head (da4b6e3).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_riemann_solver_hllc.fpp	50.00%	3 Missing and 2 partials ⚠️
src/simulation/m_riemann_solver_lf.fpp	42.85%	0 Missing and 4 partials ⚠️
src/simulation/m_riemann_solver_hll.fpp	25.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1588   +/-   ##
=======================================
  Coverage   60.94%   60.94%           
=======================================
  Files          82       82           
  Lines       19922    19925    +3     
  Branches     2924     2924           
=======================================
+ Hits        12141    12144    +3     
  Misses       5805     5805           
  Partials     1976     1976

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…host-capture (firstprivate) The declare-target-roulette case shows two identical declare-target arrays disagreeing in one kernel based only on the push mechanism (enter data map vs update to), verified on ROCm 7.2.0 and AFAR drop 23.2.0. Also corrects declare-target-static-tu's Fix section: making the variable allocatable is NOT reliable in the full app (it only moves the bug -- an inversion), the robust fix is host-capture + firstprivate (MFlowCode/MFC#1588).

…plete Riemann private lists) AMD flang reads the static declare-target Re_size stale across translation units, silently disabling viscosity. Host-capture it into Re_size_loc and firstprivate it into the Riemann kernels for all compilers. That firstprivate clause exposed a latent bug: several per-cell scalars (s_M/s_P/xi_M/xi_P in HLL, c_sum_Yi_Phi/flux_ene_e in HLLC, Gamm_L/Gamm_R/flux_tau_L/flux_tau_R in LF) were omitted from private() and relied on CCE-19's defaultmap(firstprivate:scalar) auto-privatization, which the firstprivate clause disrupts (gross wrong physics on Cray). Complete the private lists -- a no-op on every compiler -- so firstprivate is safe everywhere; no compiler gate needed. Verified on Frontier: AMD-flang 29/29 and Cray-CCE (rs=1/2/5) of the previously-failing cases pass.

The host-capture firstprivate(Re_size_loc) added to the LF/HLL/HLLC solvers (the AMD-flang viscosity fix) pushed the longest GPU_PARALLEL_LOOP directive past nvfortran's ~1000-char source-line limit (1011 chars -> 'source line too long' / unbalanced parentheses; AMD and Cray accept the long line, nvfortran caps it). FOLD_DIRECTIVE wraps the assembled directive at whole-clause boundaries with repeated sentinels (!$acc&/!$omp&), so the longest emitted line is prefix+longest-single-clause (817 chars on LF) -- shorter than the single line a build with one fewer clause already compiles. firstprivate is preserved (AMD/Cray correctness) on its own continuation. Validated: nvfortran 25.5 -acc and -mp=gpu compile+run the folded marker-interleaved continuation correctly; the cpp line-markers between continuations are already emitted by master for regular Fortran continuations and compile on Cray/AMD CI. Short directives (<200 chars) are unchanged.

…AML is missing run_parallel_benchmarks.sh only hard-failed if the output YAML was absent; a non-zero PR SLURM exit (all cases SIGTERM'd/crashed/hung) was downgraded to a warning, so a broken PR benchmark passed green as long as a partial YAML existed (observed on MFlowCode#1588, where phoenix-bench node contention SIGTERM'd every PR case yet the check stayed green). Now a genuine PR-job failure exits 1. pr_exit is reliable: run_monitored_slurm_job.sh re-checks sacct and returns 0 when the job actually COMPLETED 0:0, so this does not red-cross on monitor flakiness. Scoped to PR only -- a master/baseline infra failure stays a warning.

…lvers firstprivate([Re_size_loc]) made AMD-flang allocate a per-thread copy of the array, pushing the heavy HLLC kernel off an occupancy cliff: a deterministic ~3.1x grind regression on the 5eq/viscous/ibm GPU benchmarks (AMD-flang OpenMP offload only). copyin maps one shared read-only device copy, which still bypasses the stale cross-TU declare-target read of Re_size. Same-node Frontier validation: AMD grind restored to master (5eq 0.99, viscous 1.19, ibm 3.44 vs PR 3.12/4.44/4.75); Cray 29/29 of the previously-failing #1588 UUIDs.

…with copyin) copyin carries Re_size_loc with no firstprivate, so there is no CCE defaultmap disruption: the explicit complete-private-lists are unnecessary (defaultmap auto-privatizes the scalars as on master), and directives are short enough that the nvfortran continuation-line FOLD is no longer needed. Net diff reduces to the minimal fix (Re_size_loc host-copy + copyin in the 3 Riemann solvers). Validated on Frontier: Cray 29/29 of the previously-failing #1588 UUIDs; AMD grind at master (5eq 0.996, viscous 1.19, ibm 3.45) with viscous correctness passing.

…ang occupancy) firstprivate('[Re_size_loc]') of an integer(2) ARRAY craters AMD-flang occupancy on the heavy HLLC/bubbles Riemann kernels: a deterministic ~3-3.8x grind regression on 5eq/viscous/ibm GPU benches (AMD-flang OMP only). Capturing Re_size into two scalars (Re_size_loc1/Re_size_loc2), firstprivate-ing those, and reading them via merge(...) keeps the per-thread by-value semantics that make this correct (incl. model_eqns=3, where a copyin/map(to:) alternative silently dropped viscosity at the 6-eq kernel) while living in registers, so the cliff is gone. Frontier validation: AMD grind back to master (5eq 0.997, viscous 1.17, ibm 3.45 vs array-firstprivate 3.12/4.44/4.75); AMD model_eqns=3 + shock-tube correctness 7/7; Cray 29/29 of the previously-failing #1588 UUIDs.

…#1588) through the s_compute_interface_reynolds extraction Phase-3 extracts the viscous Reynolds computation into a cray-inlined GPU_ROUTINE that read the declare-target Re_size static directly, which silently reintroduces the AMD flang stale-cross-TU read fixed in MFlowCode#1588. Thread MFlowCode#1588's host-captured Re_size (Re_size_loc1/2 + firstprivate) through the helper and its 10 call sites, and apply the merge() pattern to hllc's inline block.

…he Riemann solvers) (MFlowCode#1588)

Copilot AI review requested due to automatic review settings June 13, 2026 16:20

Copilot started reviewing on behalf of sbryngelson June 13, 2026 16:20 View session

sbryngelson mentioned this pull request Jun 13, 2026

Harden AMD gpu-omp CI coverage + lint against silent declare-target regressions (follow-up to #1588) #1589

Open

Copilot AI reviewed Jun 13, 2026

View reviewed changes

sbryngelson mentioned this pull request Jun 13, 2026

test: always-run canary smoke set spanning physics modules #1590

Merged

sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from baf0884 to ec9c371 Compare June 13, 2026 19:49

sbryngelson changed the title ~~Make Re_size allocatable: fix silent viscosity loss on AMD flang GPU~~ Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers) Jun 13, 2026

sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch 2 times, most recently from dd1eef1 to 203ba70 Compare June 14, 2026 02:45

sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from 203ba70 to 0d995d5 Compare June 14, 2026 02:59

sbryngelson added 2 commits June 13, 2026 23:50

macros: avoid '...' placeholder in FOLD_DIRECTIVE comment (source lint)

ecf7f22

sbryngelson mentioned this pull request Jun 14, 2026

CI: shard the AMD case-optimization pre-build #1582

Merged

sbryngelson force-pushed the fix-amdflang-re-size-allocatable branch from 7e0e5f8 to ecf7f22 Compare June 15, 2026 01:51

sbryngelson merged commit 7715b72 into master Jun 15, 2026
143 of 145 checks passed

sbryngelson deleted the fix-amdflang-re-size-allocatable branch June 15, 2026 12:33

Cowsreal pushed a commit to Cowsreal/MFCMarkZhang that referenced this pull request Jun 24, 2026

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in t…

5e8e881

…he Riemann solvers) (MFlowCode#1588)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588

Fix silent viscosity loss on AMD flang GPU (host-capture Re_size in the Riemann solvers)#1588
sbryngelson merged 4 commits into
masterfrom
fix-amdflang-re-size-allocatable

sbryngelson commented Jun 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sbryngelson commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Why not "make Re_size allocatable" (the previous version of this PR)

The fix

Verification (Frontier AMD flang, gpu-omp)

Upstream

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

sbryngelson commented Jun 13, 2026 •

edited

Loading

Why not "make `Re_size` allocatable" (the previous version of this PR)

codecov Bot commented Jun 13, 2026 •

edited

Loading