Skip to content

Commit f1ad000

Browse files
committed
docs: fast-build mode design + AMD/LLVMFlang link-time plan
Documents the --fast-build dev-iteration mode: motivation, usage, the new Fast build type, measured NVHPC results (clean 3.8x, hot-module 4.9x), and the proposed AMD/LLVMFlang path (device-LTO diagnosis, -fopenmp-target-jit + -O1, high-j partitions lever) with steps to validate on an AMD GPU + AMD compiler node. The AMD path is analysis only and unverified (no LLVMFlang on the dev box).
1 parent 30f7e8b commit f1ad000

1 file changed

Lines changed: 139 additions & 0 deletions

File tree

docs/documentation/fast_build.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Fast Dev Builds (`--fast-build`)
2+
3+
Status: **prototype / work in progress.** The NVHPC path is implemented and measured;
4+
the AMD (LLVMFlang) path described below is an analysis with a proposed change that
5+
still needs validation on an AMD GPU + AMD-compiler machine.
6+
7+
## Motivation
8+
9+
GPU builds are slow to iterate on when you just want to add a `print` and re-run.
10+
Two different compilers, two different bottlenecks:
11+
12+
- **NVHPC** — the per-iteration cost is the two-pass IPO link (`-Mextract` + `-Minline`),
13+
plus device codegen for every targeted compute capability. Editing one hot file and
14+
rebuilding takes minutes because the IPO passes re-run and device code is generated for
15+
every arch in `MFC_CUDA_CC`.
16+
- **AMD / LLVMFlang** — the per-iteration cost is the **device LTO link**. The link step
17+
can take 20+ minutes *every* build, because the OpenMP offload device link does
18+
whole-program LTO regardless of what changed.
19+
20+
`--fast-build` is a dedicated build mode that strips the expensive, optimization-oriented
21+
machinery that is pointless during print-debugging.
22+
23+
## Usage
24+
25+
```bash
26+
./mfc.sh build -t simulation --gpu acc --fast-build -j 8 # NVHPC (OpenACC)
27+
./mfc.sh build -t simulation --gpu mp --fast-build -j 8 # AMD/Cray (OpenMP offload)
28+
```
29+
30+
`--fast-build` is mutually exclusive with `--debug` / `--reldebug`. It is **not** a
31+
correctness build: no bounds checking, no `MFC_DEBUG` asserts. Add your own `print`/`write`
32+
statements; pair with `--debug` when you need runtime checks.
33+
34+
## What it does
35+
36+
`--fast-build` selects a new CMake build type, `Fast`, that deliberately matches **none** of
37+
the existing conditional flag blocks in `CMakeLists.txt`:
38+
39+
- Not `Release`, so no IPO/LTO and no `-march=native`.
40+
- Not `Debug`/`RelDebug`, so no `MFC_DEBUG` and no `-gpu=debug`.
41+
42+
It then adds a light `-O1` (via `add_compile_options`, since the `CMAKE_*_FLAGS_FAST` cache
43+
variables do not inject flags in this codebase). Because `MFC_DEBUG` is off, device routines
44+
contain no host-only debug aborts, so the binary compiles cleanly **without** IPO.
45+
46+
On NVHPC GPU builds it also restricts device codegen to a **single** compute capability —
47+
the GPU on the build node, detected via `nvidia-smi` — overriding the multi-arch
48+
`MFC_CUDA_CC` that the module files set. Set `MFC_FAST_ARCH=<cc>` (e.g. `MFC_FAST_ARCH=90`)
49+
to override the detection on a login node with no visible GPU.
50+
51+
## NVHPC results (measured)
52+
53+
NVHPC 24.5, Quadro RTX 6000 (cc75), generic `simulation` build, 8 cores:
54+
55+
| Scenario | Release (fat 5-arch) | `--fast-build` (single-arch) |
56+
| --- | --- | --- |
57+
| Clean full build | 641 s | 170 s (3.8x) |
58+
| Hot-module incremental (`m_riemann_solvers`) | 385 s | 79 s (4.9x) |
59+
60+
Verified: builds with no IPO (`-Mextract` absent), no `MFC_DEBUG`, single `-gpu=cc75`, and
61+
the resulting binary runs a 1D case on the GPU to exit code 0 with finite output.
62+
63+
## AMD / LLVMFlang: the device-LTO link (proposed, needs validation)
64+
65+
The AMD GPU offload flags live in `CMakeLists.txt` (`MFC_SETUP_TARGET`):
66+
67+
```cmake
68+
# compile
69+
target_compile_options(${a_target} PRIVATE
70+
-fopenmp --offload-arch=gfx90a -O3
71+
-fopenmp-assume-threads-oversubscription
72+
-fopenmp-assume-teams-oversubscription)
73+
# link
74+
target_link_options(${a_target} PRIVATE
75+
-fopenmp --offload-arch=gfx90a -flto-partitions=${MFC_BUILD_JOBS})
76+
```
77+
78+
The `-flto-partitions` at link is the tell: the OpenMP offload **device link runs
79+
whole-program LTO every time**, so even a one-file edit re-LTOs all device code. Single-arch
80+
is not a lever here (already a single `gfx90a`).
81+
82+
### Levers, best first
83+
84+
1. **JIT the device code: `-fopenmp-target-jit`.** Instead of AOT-compiling and LTO-linking
85+
device code into the binary at link time, embed device LLVM-IR and JIT each kernel at
86+
runtime on first launch. The device LTO link essentially disappears, so the link drops to
87+
roughly host-link time. Cost: a one-time JIT warmup on the *run* (tunable with
88+
`LIBOMPTARGET_JIT_OPT_LEVEL`). This is the real fix for AMD iteration.
89+
90+
2. **Build with a high `-j` (no code change — try this first).** `-flto-partitions` is set to
91+
`MFC_BUILD_JOBS`, which is your `-j`. Building with `-j 8` on a 64-core node runs the
92+
device LTO link only 8-way parallel. Use `-j 32`/`-j 64` to give the LTO link more
93+
partitions; this alone may cut the link time substantially with no toolchain change.
94+
95+
3. **Lower device optimization `-O3` -> `-O1`/`-O0` for dev builds.** The `-O3` drives the
96+
heavy LTO optimization; lowering it cuts link work (slower runtime, fine for debugging).
97+
98+
4. **`-fno-lto` (AOT, non-LTO device link).** Links per-translation-unit device objects
99+
instead of whole-program LTO. Potentially faster, but less certain across ROCm/flang
100+
versions — only if JIT does not pan out.
101+
102+
### Proposed `Fast` branch for LLVMFlang
103+
104+
Make the flags above build-type-aware so `--fast-build` emits, for `LLVMFlang`:
105+
106+
```cmake
107+
# compile
108+
-fopenmp --offload-arch=gfx90a -O1 -fopenmp-target-jit \
109+
-fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription
110+
# link (no -flto-partitions; JIT removes the whole-program device LTO)
111+
-fopenmp --offload-arch=gfx90a -fopenmp-target-jit
112+
```
113+
114+
### How to validate on an AMD machine
115+
116+
On a Frontier AMD / AFAR-style node (`source ./mfc.sh load -c famd -m g` or equivalent):
117+
118+
1. **Baseline** — time the current link:
119+
`./mfc.sh build -t simulation --gpu mp -j 8` and note the link duration.
120+
2. **Free lever** — rebuild with a high `-j` (more LTO partitions) and compare the link time:
121+
`./mfc.sh build -t simulation --gpu mp -j 64`.
122+
3. **JIT lever** — once the `Fast` LLVMFlang branch is wired in, build with `--fast-build`
123+
and confirm: (a) the link time collapses, (b) a small case runs to exit 0 (expect a
124+
one-time JIT warmup on first launch). `OMP_TARGET_OFFLOAD=MANDATORY` is already set.
125+
126+
### Caveat
127+
128+
None of the AMD numbers are measured — LLVMFlang is not available on the development machine
129+
used so far. The diagnosis follows directly from the build flags and from how LLVM OpenMP
130+
offload works, but the exact flag spelling and ROCm/flang-version behavior must be confirmed
131+
on real hardware before the LLVMFlang `Fast` branch is trusted.
132+
133+
## Implementation notes / TODO
134+
135+
- Implemented: `Fast` CMake build type, `fast_build` field in `MFCConfig` (auto
136+
`--fast-build`/`--no-fast-build`, own build slug), NVHPC single-arch autodetect, lock-file
137+
version bump for the new config field.
138+
- Not yet: a `--gpu-arch` CLI flag (only the `MFC_FAST_ARCH` env escape hatch exists), the
139+
LLVMFlang `Fast` branch above, Cray-on-AMD validation, and `--help`/docs polish.

0 commit comments

Comments
 (0)