Skip to content
Manjaree Binjolkar edited this page Aug 1, 2025 · 3 revisions

Overview of CUDA

This project leverages CUDA (Compute Unified Device Architecture) to offload computationally intensive ODE integrations onto NVIDIA GPUs. CUDA enables massive parallelization by distributing independent simulation tasks across thousands of GPU threads, significantly accelerating model runs compared to CPU-only execution.

In this project, CUDA is used to integrate multiple independent stream systems concurrently, using a Runge-Kutta 4/5 (RK45) solver kernel with optional stiffness detection and Radau-IIA fallback for stiff systems.

The integration kernels are defined in:

  • rk45_kernel.cu – RK45 kernel (rk45_then_radau_multi) with stiffness detection

Enabling CUDA Compilation

Compilation requires nvcc (CUDA compiler) and the proper GPU architecture flags. For example:

nvcc -O3 -arch=sm_80 -o simulation main_new.cpp solver/rk45_kernel.cu ...

The runtime API is provided via:

#include <cuda_runtime.h>

GPU memory management and error handling is implemented through helper macros like CUDA_CHECK.


Parallelization Strategy

The CUDA implementation maps each stream system to an independent GPU thread. Each thread:

  1. Loads its initial state and spatial parameters from global memory.
  2. Integrates the ODEs from t0 to tf using the RK45 solver.
  3. Performs dense output interpolation at requested query times.
  4. Optionally flags stiff systems, which can then be re-integrated with the Radau-IIA kernel.

This design follows the SIMD (Single Instruction, Multiple Data) pattern: all threads execute the same solver logic on different data. Memory is organized to maximize coalesced access, improving performance.

Additionally, forcing data (e.g., precipitation, temperature) is uploaded to the device once per time chunk and accessed directly by kernels.


Thread and Block Configuration

The launch configuration for the solver kernel is determined at runtime based on occupancy:

int blockSize = 0, minGridSize = 0;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,
                                   rk45_then_radau_multi<Model204>, 0, 0);
int numBlocks = (num_systems + blockSize - 1) / blockSize;
rk45_then_radau_multi<Model204><<<numBlocks, blockSize>>>(...);
  • numBlocks × blockSize = total threads = number of stream systems
  • Each thread solves one stream system independently.
  • GPU utilization scales with the number of streams.

GPU Device Selection (MPI-Aware)

When running under MPI across multiple nodes/GPUs, each rank is assigned a GPU device in a round-robin manner:

assignGpuDevice(rank, getGpuCount());

This ensures optimal resource usage across distributed simulations.


GPU Memory Management

The solver allocates device buffers for:

  • Initial conditions (d_y0_all)
  • Final states (d_y_final_all)
  • Dense output (d_dense_all)
  • Query times (d_query_times)
  • Forcing data (d_forc_data)
  • Stiffness flags (d_stiff)

Data transfer between host and device is handled via cudaMemcpy, with error-checked wrappers (CUDA_CHECK).

Clone this wiki locally