|
| 1 | +# Fast Dev Builds (`--fast-build`) |
| 2 | + |
| 3 | +Status: **prototype / work in progress.** The NVHPC path is implemented and measured; |
| 4 | +the AMD (LLVMFlang) path described below is an analysis with a proposed change that |
| 5 | +still needs validation on an AMD GPU + AMD-compiler machine. |
| 6 | + |
| 7 | +## Motivation |
| 8 | + |
| 9 | +GPU builds are slow to iterate on when you just want to add a `print` and re-run. |
| 10 | +Two different compilers, two different bottlenecks: |
| 11 | + |
| 12 | +- **NVHPC** — the per-iteration cost is the two-pass IPO link (`-Mextract` + `-Minline`), |
| 13 | + plus device codegen for every targeted compute capability. Editing one hot file and |
| 14 | + rebuilding takes minutes because the IPO passes re-run and device code is generated for |
| 15 | + every arch in `MFC_CUDA_CC`. |
| 16 | +- **AMD / LLVMFlang** — the per-iteration cost is the **device LTO link**. The link step |
| 17 | + can take 20+ minutes *every* build, because the OpenMP offload device link does |
| 18 | + whole-program LTO regardless of what changed. |
| 19 | + |
| 20 | +`--fast-build` is a dedicated build mode that strips the expensive, optimization-oriented |
| 21 | +machinery that is pointless during print-debugging. |
| 22 | + |
| 23 | +## Usage |
| 24 | + |
| 25 | +```bash |
| 26 | +./mfc.sh build -t simulation --gpu acc --fast-build -j 8 # NVHPC (OpenACC) |
| 27 | +./mfc.sh build -t simulation --gpu mp --fast-build -j 8 # AMD/Cray (OpenMP offload) |
| 28 | +``` |
| 29 | + |
| 30 | +`--fast-build` is mutually exclusive with `--debug` / `--reldebug`. It is **not** a |
| 31 | +correctness build: no bounds checking, no `MFC_DEBUG` asserts. Add your own `print`/`write` |
| 32 | +statements; pair with `--debug` when you need runtime checks. |
| 33 | + |
| 34 | +## What it does |
| 35 | + |
| 36 | +`--fast-build` selects a new CMake build type, `Fast`, that deliberately matches **none** of |
| 37 | +the existing conditional flag blocks in `CMakeLists.txt`: |
| 38 | + |
| 39 | +- Not `Release`, so no IPO/LTO and no `-march=native`. |
| 40 | +- Not `Debug`/`RelDebug`, so no `MFC_DEBUG` and no `-gpu=debug`. |
| 41 | + |
| 42 | +It then adds a light `-O1` (via `add_compile_options`, since the `CMAKE_*_FLAGS_FAST` cache |
| 43 | +variables do not inject flags in this codebase). Because `MFC_DEBUG` is off, device routines |
| 44 | +contain no host-only debug aborts, so the binary compiles cleanly **without** IPO. |
| 45 | + |
| 46 | +On NVHPC GPU builds it also restricts device codegen to a **single** compute capability — |
| 47 | +the GPU on the build node, detected via `nvidia-smi` — overriding the multi-arch |
| 48 | +`MFC_CUDA_CC` that the module files set. Set `MFC_FAST_ARCH=<cc>` (e.g. `MFC_FAST_ARCH=90`) |
| 49 | +to override the detection on a login node with no visible GPU. |
| 50 | + |
| 51 | +## NVHPC results (measured) |
| 52 | + |
| 53 | +NVHPC 24.5, Quadro RTX 6000 (cc75), generic `simulation` build, 8 cores: |
| 54 | + |
| 55 | +| Scenario | Release (fat 5-arch) | `--fast-build` (single-arch) | |
| 56 | +| --- | --- | --- | |
| 57 | +| Clean full build | 641 s | 170 s (3.8x) | |
| 58 | +| Hot-module incremental (`m_riemann_solvers`) | 385 s | 79 s (4.9x) | |
| 59 | + |
| 60 | +Verified: builds with no IPO (`-Mextract` absent), no `MFC_DEBUG`, single `-gpu=cc75`, and |
| 61 | +the resulting binary runs a 1D case on the GPU to exit code 0 with finite output. |
| 62 | + |
| 63 | +## AMD / LLVMFlang: the device-LTO link (proposed, needs validation) |
| 64 | + |
| 65 | +The AMD GPU offload flags live in `CMakeLists.txt` (`MFC_SETUP_TARGET`): |
| 66 | + |
| 67 | +```cmake |
| 68 | +# compile |
| 69 | +target_compile_options(${a_target} PRIVATE |
| 70 | + -fopenmp --offload-arch=gfx90a -O3 |
| 71 | + -fopenmp-assume-threads-oversubscription |
| 72 | + -fopenmp-assume-teams-oversubscription) |
| 73 | +# link |
| 74 | +target_link_options(${a_target} PRIVATE |
| 75 | + -fopenmp --offload-arch=gfx90a -flto-partitions=${MFC_BUILD_JOBS}) |
| 76 | +``` |
| 77 | + |
| 78 | +The `-flto-partitions` at link is the tell: the OpenMP offload **device link runs |
| 79 | +whole-program LTO every time**, so even a one-file edit re-LTOs all device code. Single-arch |
| 80 | +is not a lever here (already a single `gfx90a`). |
| 81 | + |
| 82 | +### Levers, best first |
| 83 | + |
| 84 | +1. **JIT the device code: `-fopenmp-target-jit`.** Instead of AOT-compiling and LTO-linking |
| 85 | + device code into the binary at link time, embed device LLVM-IR and JIT each kernel at |
| 86 | + runtime on first launch. The device LTO link essentially disappears, so the link drops to |
| 87 | + roughly host-link time. Cost: a one-time JIT warmup on the *run* (tunable with |
| 88 | + `LIBOMPTARGET_JIT_OPT_LEVEL`). This is the real fix for AMD iteration. |
| 89 | + |
| 90 | +2. **Build with a high `-j` (no code change — try this first).** `-flto-partitions` is set to |
| 91 | + `MFC_BUILD_JOBS`, which is your `-j`. Building with `-j 8` on a 64-core node runs the |
| 92 | + device LTO link only 8-way parallel. Use `-j 32`/`-j 64` to give the LTO link more |
| 93 | + partitions; this alone may cut the link time substantially with no toolchain change. |
| 94 | + |
| 95 | +3. **Lower device optimization `-O3` -> `-O1`/`-O0` for dev builds.** The `-O3` drives the |
| 96 | + heavy LTO optimization; lowering it cuts link work (slower runtime, fine for debugging). |
| 97 | + |
| 98 | +4. **`-fno-lto` (AOT, non-LTO device link).** Links per-translation-unit device objects |
| 99 | + instead of whole-program LTO. Potentially faster, but less certain across ROCm/flang |
| 100 | + versions — only if JIT does not pan out. |
| 101 | + |
| 102 | +### Proposed `Fast` branch for LLVMFlang |
| 103 | + |
| 104 | +Make the flags above build-type-aware so `--fast-build` emits, for `LLVMFlang`: |
| 105 | + |
| 106 | +```cmake |
| 107 | +# compile |
| 108 | +-fopenmp --offload-arch=gfx90a -O1 -fopenmp-target-jit \ |
| 109 | + -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription |
| 110 | +# link (no -flto-partitions; JIT removes the whole-program device LTO) |
| 111 | +-fopenmp --offload-arch=gfx90a -fopenmp-target-jit |
| 112 | +``` |
| 113 | + |
| 114 | +### How to validate on an AMD machine |
| 115 | + |
| 116 | +On a Frontier AMD / AFAR-style node (`source ./mfc.sh load -c famd -m g` or equivalent): |
| 117 | + |
| 118 | +1. **Baseline** — time the current link: |
| 119 | + `./mfc.sh build -t simulation --gpu mp -j 8` and note the link duration. |
| 120 | +2. **Free lever** — rebuild with a high `-j` (more LTO partitions) and compare the link time: |
| 121 | + `./mfc.sh build -t simulation --gpu mp -j 64`. |
| 122 | +3. **JIT lever** — once the `Fast` LLVMFlang branch is wired in, build with `--fast-build` |
| 123 | + and confirm: (a) the link time collapses, (b) a small case runs to exit 0 (expect a |
| 124 | + one-time JIT warmup on first launch). `OMP_TARGET_OFFLOAD=MANDATORY` is already set. |
| 125 | + |
| 126 | +### Caveat |
| 127 | + |
| 128 | +None of the AMD numbers are measured — LLVMFlang is not available on the development machine |
| 129 | +used so far. The diagnosis follows directly from the build flags and from how LLVM OpenMP |
| 130 | +offload works, but the exact flag spelling and ROCm/flang-version behavior must be confirmed |
| 131 | +on real hardware before the LLVMFlang `Fast` branch is trusted. |
| 132 | + |
| 133 | +## Implementation notes / TODO |
| 134 | + |
| 135 | +- Implemented: `Fast` CMake build type, `fast_build` field in `MFCConfig` (auto |
| 136 | + `--fast-build`/`--no-fast-build`, own build slug), NVHPC single-arch autodetect, lock-file |
| 137 | + version bump for the new config field. |
| 138 | +- Not yet: a `--gpu-arch` CLI flag (only the `MFC_FAST_ARCH` env escape hatch exists), the |
| 139 | + LLVMFlang `Fast` branch above, Cray-on-AMD validation, and `--help`/docs polish. |
0 commit comments