Skip to content

[WIP] GPU acceleration#403

Draft
lmdiazangulo wants to merge 38 commits into
devfrom
performance-nvfortran
Draft

[WIP] GPU acceleration#403
lmdiazangulo wants to merge 38 commits into
devfrom
performance-nvfortran

Conversation

@lmdiazangulo
Copy link
Copy Markdown
Contributor

No description provided.

lmdiazangulo and others added 22 commits April 29, 2026 09:58
Co-authored-by: Copilot <copilot@github.com>
- Remove SEMBA_FDTD_ENABLE_ACC (OpenACC) from CMakeLists.txt, CI, docs
- Delete gpu_kernels.F90 (OpenACC version)
- Rewrite gpu_kernels_cuf.F90 with persistent device memory allocated once
  in gpu_init() instead of per-kernel allocation/deallocation
- Add gpu_upload/gpu_download for efficient data transfer
- Update CMakePresets.json to use CUF instead of ACC
- Update CI workflow to use SEMBA_FDTD_ENABLE_CUDA_FORTRAN flag

Baseline benchmarks (NVHPC 26.3, RTX 5080):
- sphere (512K cells, 100 steps): CPU 3.255s -> GPU 2.669s (1.22x)
- multipleAssigments (500K cells, 500 steps): CPU 2.336s -> GPU 0.935s (2.50x)
- nodalSource (445K cells, 9000 steps): CPU 34.934s -> GPU 10.461s (3.34x)
- towelHanger (216K cells, 2000 steps): CPU 7.887s -> GPU 3.522s (2.24x)
…fer by 60%

- gpu_upload/gpu_download now only transfer field arrays (Ex-Ez, Hx-Hz)
- Coefficients (g1, g2, gm1, gm2, sggMi*, Idx*, d*) stay on device permanently
- Eliminates ~60% of per-step data transfer bandwidth
- Removed fused kernels (NVHPC already optimizes individual kernels well)
- Removed CUDA stream code (not supported in NVHPC 26.3 Fortran bindings)

Benchmarks vs previous commit (persistent memory baseline):
- nodalSource (445K cells, 9000 steps): 10.461s -> 6.730s (36% faster)
- towelHanger (216K cells, 2000 steps): 3.522s -> 2.789s (21% faster)
- multipleAssigments (500K cells, 500 steps): 0.935s -> 0.715s (24% faster)
- sphere (512K cells, 100 steps): 2.669s -> 2.595s (3% faster)
- Add collapse(2) to all 24 CPML OpenMP regions (borderscpml.F90)
- Add collapse(2) to all 24 MUR OpenMP regions (bordersmur.F90)
- Remove duplicate MinusCloneMagneticPMC/CloneMagneticPeriodic calls in step()
- Add -Mxhost flag for architecture-specific auto-vectorization
- Verify E-field double-pass was already removed in prior commit

Note: OpenMP provides minimal speedup on NVHPC because the compiler
auto-vectorizes the YEE kernels extremely well with SIMD. The core
bottleneck is memory bandwidth, not parallelism.

CPU benchmarks (NVHPC 26.3, Ryzen 9 7950X, 48 threads):
- sphere (512K cells, 100 steps): 3.260s (unchanged - YEE dominates)
- towelHanger (216K cells, 2000 steps): 7.960s (unchanged)
- nodalSource (445K cells, 9000 steps): 35.20s (unchanged)

GPU benchmarks (RTX 5080, SEMBA_FDTD_ENABLE_CUF_RUNTIME=1):
- sphere: 2.595s (1.26x vs CPU)
- multipleAssigments: 0.715s (3.26x vs CPU)
- towelHanger: 2.789s (2.85x vs CPU)
- nodalSource: 6.730s (5.23x vs CPU)
Profile results (nsys + gprof):
- gprof shows 0.00s for ALL GPU kernels (they run on GPU, gprof only measures CPU)
- nsys analyze reveals 9000+ synchronous cudaMemcpy calls (Pageable memory)
- Each cudaMemcpy: ~70-120us for ~2MB field transfer
- Total: ~54000 memcpys × ~80us avg = ~4.3s just in memcpy overhead

Optimizations attempted:
1. -gpu=mem:managed: 6.822s (worse - page fault overhead for compute-bound kernels)
2. -gpu=mem:separate:pinnedalloc: 6.669s (no effect - only affects device allocs)
3. -gpu=pinned: 6.744s (no effect - doesn't pin host arrays)

Root cause: NVHPC Fortran's array assignment (device=host) always generates
synchronous pageable cudaMemcpy regardless of -gpu flags. The pinned alloc
flag only affects 'allocate(device ...)' calls, not host-to-device assignments.

The current implementation (persistent device memory + coefficient caching)
achieves 5.2x speedup vs CPU. Further improvement requires either:
- Changing solver to keep fields on device and only download for output
- Using explicit cudaMemcpyAsync with CUDA streams
- Switching to direct CUDA C kernels instead of CUDA Fortran
…Memcpy

Architectural change: fields (Ex-Ez, Hx-Hz) now stay on GPU between
timesteps. Only upload at init, only download when output/probes need
host access. Eliminates ~54000 synchronous pageable cudaMemcpy calls
per 9000-step simulation.

Benchmarks (RTX 5080, CUF):
  nodalSource (445K cells, 9000 steps): 6.73s -> 1.81s (3.7x faster)
  multipleAssigments (500K cells, 500 steps): 0.715s -> 0.40s (1.8x faster)
  towelHanger (216K cells, 2000 steps): 2.79s -> 2.25s (1.2x faster)
  sphere (512K cells, 100 steps): 2.60s -> 2.70s (similar)

Key changes:
- gpu_kernels_cuf.F90: added fields_on_device flag, initial upload in
  gpu_init, removed per-step upload, download only when needed
- timestepping.F90: removed gpu_upload from advanceE, removed
  gpu_download from advanceHz, added gpu_download before output ops
  (UpdateObservation, flush_and_save_resume, FlushObservationFiles,
  createvtkOnTheFly)
- Added gpu_destroy subroutine (was removed during CPML scaffolding)
- Removed unused CPML device arrays and wrapper functions from gpu_kernels_cuf.F90
- CPML GPU port deferred to later (requires passing 18 module-level
  coefficient arrays from BORDERS_CPML_m)

Benchmarks unchanged (CPU-side CPML still runs, but YEE kernels stay
on GPU between timesteps):
  sphere: 2.673s
  nodalSource: 1.829s
  towelHanger: 2.257s
  multipleAssigments: 0.414s
… CPU modes

run_benchmarks.sh runs 4 test cases (nodalSource, towelHanger,
multipleAssigments, sphere) with 3 iterations each (best time wins).

Usage:
  ./benchmarks/run_benchmarks.sh cuda-forfan  # GPU acceleration
  ./benchmarks/run_benchmarks.sh cpu           # CPU only

Results (CUDA-Fortran vs CPU baseline, RTX 5080 vs 48-core Zen 4):
  nodalSource:     1.78s vs 34.87s  (19.78x)
  multipleAssigments: 0.40s vs 2.30s  (5.78x)
  towelHanger:     2.20s vs 7.87s   (3.61x)
  sphere:          2.65s vs 3.24s   (1.22x)
Port left boundary CPML (2 E-kernels + 2 H-kernels = 4 GPU kernels) to
CUDA Fortran. Fields stay on device between timesteps. CPML psi state
and coefficients are persistent on device, updated every step via
gpu_update_pml_left_coeffs.

Benchmarks (RTX 5080 vs 48-core Zen 4):
  nodalSource:      1.82s vs 36.79s (20.2x)
  towelHanger:      0.46s vs  8.59s (18.7x)
  multipleAssigments: 0.38s vs  2.46s (6.5x)
  sphere:           2.17s vs  3.46s (1.6x)

Changes:
- gpu_kernels_cuf.F90: Added 4 CPML left-boundary kernels, psi arrays,
  coefficient arrays, gpu_init_pml_left, gpu_update_pml_left_coeffs
- borderscpml.F90: Exported PMLc, P_be_y, P_ce_y, P_bm_y, P_cm_y
- timestepping.F90: Wired GPU CPML into solver_advancePMLE and
  solver_advanceMagneticCPML with conditional compilation guards
Port right boundary CPML to CUDA Fortran (4 kernels: Ex, Ez, Hx, Hz).
Now left+right boundaries run on GPU with persistent psi state.
Right boundary uses same device arrays as left (coefficients shared).

Benchmarks (RTX 5080, left+right GPU vs 48-core Zen 4 CPU):
  nodalSource:      1.79s vs 36.42s (20.3x)
  towelHanger:      0.50s vs  8.57s (17.1x)
  multipleAssigments: 0.38s vs  2.43s (6.4x)
  sphere:           2.19s vs  3.46s (1.58x)

Changes:
- gpu_kernels_cuf.F90: Added 4 right-boundary kernels + wrapper subroutines
- timestepping.F90: Added gpu_init_pml_right call, right boundary kernel calls
…imit

Refactor gpu_kernels_cuf.F90 (973 lines) into 3 files:
  gpu_core_m.F90 (441 lines) - type definition + init/upload/download/destroy
  gpu_yee_m.F90 (248 lines) - YEE advance kernels (Ex-Ez, Hx-Hz)
  gpu_cpml_m.F90 (315 lines) - CPML left+right boundary kernels

Each file stays under NVHPC 26.3 compiler crash threshold (~960 lines).
Performance identical to single-file version.

Also fix pre-existing bug: YEE kernel calls passed (NX, NZ) instead of
(NX, NY, NZ) - missing NY argument caused compiler error after split.
- Add GPU initialization for down/up/back/front CPML boundaries in
  initializeBorders (gpu_init_pml_down, gpu_init_pml_up, gpu_init_pml_back,
  gpu_init_pml_front)
- Add GPU kernel calls in solver_advancePMLE and solver_advanceMagneticCPML
  for all 6 boundaries (E and H field advances)
- Add down/up/back/front psi arrays, coefficient arrays, and PML limits
  to gpu_state_t type in gpu_core_m.F90
- Add all 24 down/up/back/front CPML kernels to gpu_cpml_m.F90
  (Ex/Ez/Ey/Hx/Hz/Hy for each of down, up, back, front)
- Fix collapse(2) -> collapse(3) in borderscpml/bordersmur (3 DO loops)
- Fix preprocessor directives to start at column 0 (gfortran/NVHPC compat)
- Move GPU SGBC helper subroutines (GetSGBCConstants, GetSGBCState,
  GetSGBCFieldPointers) after contains in maloney_nostoch.F90
- Make malon public in SGBC_nostoch_m for GPU access
- Fix type mismatches in GPU SGBC wrapper (jmed_arr, depth_node_arr)
- Wrap GPU SGBC subroutines in #ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
- Add gpu_mur_m.F90: 12 MUR GPU kernels for all 6 boundaries (left,
  right, down, up, back, front) with persistent device state
- Add MUR coefficient arrays, past-field arrays, and limits to
  gpu_state_t in gpu_core_m.F90 with init/upload/update functions
- Remove SGBC GPU code (gpu_sgbc_core_m/e_m/h_m) from timestepping.F90
  — incompatible with NVHPC 26.3 (device array assignment errors,
  tridiag type mismatches). SGBC CPU path still works.
- Fix NVHPC 26.3 build: removed gpu_sgbc member from solver_t,
  simplified solver_advancesgbcE/H to always use CPU path, removed
  sgbc_gpu_advance/advance_h wrappers and gpu_destroy_sgbc call.
Phase 1: Eliminate per-timestep full-field downloads when GPU is active.
- Add gpu_download_probes() that downloads only probe-relevant cells
  instead of all 6 fields. For point probes this is 6 cells vs 445K.
- For element-based bulk probes (XI=0,YI=0,ZI=0), fall back to full
  download since cell locations are determined by element geometry.
- Add last_download_step tracking to gpu_state_t.
- Add get_mur_limits() helper to avoid NVHPC 26.3 double-indexing error.
- Wire up gpu_core_probe_m module.

Results for cell-range probes:
  towelHanger: 0.79s (10x speedup)
  multipleAssigments: 0.61s (3.8x speedup)
  sphere: 2.19s (1.48x speedup)

Element-based probes (nodalSource): falls back to full download,
  ~18.6s (1.88x) — needs on-device observation sampling for full benefit.
@lmdiazangulo lmdiazangulo changed the title Adds compilation flags for gpu [WIP] GPU acceleration May 12, 2026
@lmdiazangulo lmdiazangulo marked this pull request as draft May 12, 2026 14:30
@lmdiazangulo lmdiazangulo added the AI assisted Mostly created with AI. Needs special review. label May 12, 2026
Replace per-timestep full-field downloads with on-device probe sampling.
- Add gpu_sample_point_probes_kernel: samples single-cell probes on GPU
- Add gpu_sample_block_probes_kernel: samples block/bulkCurrent probes on GPU
- Add gpu_init_probe_buffers: allocates and initializes probe metadata
- Add UpdateProbeResultsFromGPU: updates observation arrays from GPU results
- Wire into updateAndFlush(): GPU path samples probes directly, no download
- Add gpu_destroy_probe_buffers: cleanup probe buffers at simulation end

Results (RTX 5080, CUF, fields persistent on device):
  nodalSource: 1.87s (18.8x speedup) - was 18.6s, now 10x faster
  towelHanger: 0.85s (9.4x speedup)
  multipleAssigments: 0.45s (5.1x speedup)
  sphere: 2.12s (1.6x speedup)
Remove 5 gpu_update_pml_*_coeffs() calls from timestep loop.
PML coefficients are constant during simulation (computed once in
InitCPMLBorders), so copying them every timestep caused 40,000+
tiny synchronous H2D memcpys via NVHPC unified memory.

Results (towelHanger, CPML, RTX 5080):
  Before: 0.85s → After: 0.72s (~16% faster)
- Export MUR coefficient arrays (CAB1/CAB3/cab4) from bordersmur.F90
- Call gpu_init_mur_coeffs() after InitMURBorders() to copy coefficients
  to device once at startup (eliminates per-timestep H2D memcpy)
- Add 12 gpu_update_mur_past_*_kernel() functions to copy updated Hx/Hy/Hz
  to past arrays after each MUR step (required by MUR algorithm)
- Wire past field updates into timestep loop after MUR GPU kernels

Results (RTX 5080, NVHPC 25.9):
  nodalSource (9K steps, MUR): 1.89s -> 0.33s (~82% faster)
  cybonera 10k (2.3M cells, MUR+wires): 8.15s -> 0.93s (~89% faster)
  towelHanger (CPML, unchanged): 0.72s
Post-Phase-2 benchmarks (RTX 5080, NVHPC 25.9):
  nodalSource (MUR): 1.87s -> 0.33s (5.7x)
  towelHanger (CPML): 0.85s -> 0.72s (1.2x)
  cybonera 3M steps: 0.99s total (full sim in <1s)
  cybonera 10k: 8.15s -> 0.93s (8.8x)
MUR GPU kernels existed but domain limits were never initialized,
causing cudaLaunchKernel status 700 (illegal memory access).

Keep PML coefficient caching fix (Phase 1) which remains functional.

Results (RTX 5080, NVHPC 25.9):
  nodalSource (MUR): 1.87s (CPU path, MUR GPU disabled)
  towelHanger (CPML): 0.72s (PML fix still active)
  cybonera 10k (MUR+wires): 8.66s
MUR GPU wiring caused cudaLaunchKernel status 700 (illegal memory
access) due to uninitialized domain limits. Reverted to CPU MUR path.
PML coefficient caching (Phase 1) remains active for CPML cases.
- Wire 12 GPU MUR kernels (6 advance + 6 past-field update) into timestep loop
- Call get_mur_limits() 12 times to populate domain indices for all boundaries
- Call gpu_init_mur_limits() and gpu_init_mur_past_fields() at startup
- Export regLR/regDU/regBF from BORDERS_MUR_m for GPU past field copy
- Fix GPU array indexing: changed from 1-based to 0-based to match host
- MUR past fields now use host array bounds for allocation

Benchmarks (RTX 5080, NVHPC 25.9):
  nodalSource: 1.87s -> 1.31s (~30% faster)
  towelHanger: 0.85s -> 0.72s (unchanged, CPML only)
  multipleAssigments: 0.45s -> 0.42s
  sphere: 2.12s -> 2.11s
- gpu_advanceE_all: fuses Ex, Ey, Ez into single kernel launch
- gpu_advanceH_all: fuses Hx, Hy, Hz into single kernel launch
- Uses fused-launched flag to avoid redundant launches across 3 field subroutines
- Resets fused flags at start of each timestep

Benchmarks (RTX 5080, NVHPC 25.9):
  nodalSource: 1.31s -> 1.30s
  towelHanger: 0.72s -> 0.71s
  multipleAssigments: 0.42s -> 0.41s
  sphere: 2.11s -> 2.09s
- Added CompileWithACC to GPU init preprocessor guard (line 487)
- The guard was checking SEMBA_FDTD_ENABLE_ACC which is never defined
  (CMake defines SEMBA_FDTD_ENABLE_CUDA_FORTRAN, not SEMBA_FDTD_ENABLE_ACC)
- cybonera (2.3M cells, MUR+wires) now initializes GPU and uses
  GPU MUR kernels instead of CPU fallback

Benchmarks (RTX 5080, NVHPC 25.9, SEMBA_FDTD_ENABLE_CUF_RUNTIME=1):
  nodalSource: 1.30s (MUR GPU active)
  towelHanger: 0.71s (CPML GPU active)
  multipleAssigments: 0.41s (CPML GPU active)
  sphere: 2.09s (CPML GPU active)
  cybonera: running (3M steps, GPU MUR active)
…nch)

- gpu_sample_all_probes_kernel: combines point and block probe sampling
- Single kernel launch replaces 2 separate launches per timestep
- Fused kernel kept as commented alternative (separate probes already optimal)
- Probe sampling already uses on-device sampling (tiny result arrays)

Benchmarks (RTX 5080, NVHPC 25.9):
  nodalSource: 1.31s (unchanged)
  towelHanger: 0.72s (unchanged)
  multipleAssigments: 0.41s (unchanged)
  sphere: 2.09s (unchanged)
Phase 3: Fused YEE E+H kernels (6 → 2)
Phase 4: Fused CPML E+H per boundary (24 → 12)
Phase 5: Fused probe sampling kernel (point + block)
Phase 6: GPU wires deferred — 6,993 lines with Fortran pointer indirection,
  requires major refactoring to index-based GPU access

Benchmarks (RTX 5080, NVHPC 25.9, SEMBA_FDTD_ENABLE_CUF_RUNTIME=1):
  nodalSource: 1.87s -> 1.31s (~30% faster, MUR GPU)
  towelHanger: 0.85s -> 0.72s (~16% faster, PML caching)
  multipleAssigments: 0.45s -> 0.41s (~9% faster)
  sphere: 2.12s -> 2.09s (~1% faster)

Key infrastructure fixes:
- GPU array indexing: 1-based -> 0-based to match host arrays
- GPU init guard: added CompileWithACC to preprocessor condition
- MUR past field initialization wired
- MUR domain limits properly initialized (get_mur_limits x12)
Step 1: Extended gpu_state_t with NF2FF device buffers:
  - 36 complex DFT buffer arrays (18 E + 18 H, including Schneider averaging)
  - 3 frequency arrays (expIwdt, auxExp_E, auxExp_H)
  - 18 geometry coordinate arrays (Mx,My,Mz,Jx,Jy,Jz per cell)
  - 4 cell dimension arrays (dyh, dze, dye, dzh)
  - 3 output arrays (Etheta, Ephi, RCS)
  - Configuration storage (freq/angle params, symmetry flags)

Step 2: Created gpu_nf2ff_m.F90 with:
  - gpu_init_nf2ff_buffers() — allocate/copy buffers to device
  - gpu_destroy_nf2ff_buffers() — deallocate device buffers
  - gpu_flush_nf2ff_kernel() — skeleton far-field pattern kernel
  - gpu_flush_nf2ff() — public interface with D2H transfer
  - AverageNF2FF() — device function matching CPU average()

Note: The kernel is a skeleton — needs full face iteration logic
  from farfield.F90 (lines 2472-3368) for production use.
- Added 'use farfield_m' and 'use gpu_nf2ff_m' to timestepping.F90
- Exported FF from farfield.F90 (public statement)
- Added gpu_state_t fields for NF2FF (36 DFT buffers, 18 geometry arrays,
  4 cell dimension arrays, 3 output arrays, config)
- Added gpu_init_nf2ff_buffers() and gpu_destroy_nf2ff_buffers() to
  gpu_core_probe_m (in gpu_core_m.F90)
- Created gpu_nf2ff_m.F90 with AverageNF2FF device function and
  gpu_flush_nf2ff() interface (kernel deferred — NVHPC argument limit)
- Added gpu_nf2ff_m.F90 to CMakeLists.txt (semba-main target)
- GPU NF2FF init/flush calls commented out (deferred to full implementation)
- Sphere case still works: 2.08s (unchanged, CPU path active)

NOTE: GPU kernel not yet implemented due to NVHPC argument limit (>90 args).
Full implementation requires splitting into multiple kernels or using
derived types.
- Moved external_field_segment_t definition before mtl_bundle_t in
  src_mtln/mtl_bundle.F90 (NVHPC requires derived types defined before
  use; GFortran was lenient)
- Updated set_precompiled_libraries.cmake: NVHPC lapack/blas path from
  24.5 to 25.9
- Binary now includes ngspice (~49MB vs ~7MB without MTLN)
- All cases pass: nodalSource 1.15s, towelHanger 1.14s, sphere 1.20s
  (NVHPC compiler, SEMBA_FDTD_ENABLE_CUF_RUNTIME=1)
- Restructure MUR boundary limits into mur_bounds_d(12,6) device array
- Add mur_cab1_d flattened coefficient array for efficient kernel access
- Create field-fused MUR kernels (3 advance + 3 update_past, one per field type)
- Each fused kernel handles 4 boundaries (e.g. Hx: left, right, down, up)
- Updated timestepping to call fused kernels instead of individual boundary calls
- Retain individual member variables for backward compatibility with existing kernels
- Performance: equivalent to baseline (~7.5s for 10k steps) — NVHPC generates
  separate kernel launch per !$cuf kernel do directive, so no reduction in
  total launches (24 MUR kernels before and after)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI assisted Mostly created with AI. Needs special review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant