This guide covers common issues and solutions when building and using HPC-AI-Optimization-Lab.
Symptoms:
CMake Error: Could not find CUDA
Solutions:
- Verify CUDA Toolkit is installed:
nvcc --version # Should show CUDA 12.4+ - Set CUDA path explicitly:
export CUDA_HOME=/usr/local/cuda cmake -S . -B build -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
- Check PATH includes CUDA:
echo $PATH # Should include /usr/local/cuda/bin
Symptoms:
CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles"
Solutions:
- Install CMake 3.24+:
# Ubuntu/Debian wget https://github.com/Kitware/CMake/releases/download/v3.28.0/cmake-3.28.0-linux-x86_64.sh chmod +x cmake-*.sh sudo ./cmake-*.sh --prefix=/usr/local
- Or use pip:
pip install cmake --upgrade
Symptoms:
error: unrecognized command line option '-std=c++20'
Solutions:
- Upgrade GCC to 11+:
# Ubuntu 22.04+ sudo apt install g++-11 export CXX=g++-11
- Or use Clang 14+:
sudo apt install clang-14 export CXX=clang++-14
Symptoms:
error: identifier "wmma::load_matrix_sync" is undefined
Solutions:
- Specify GPU architecture explicitly:
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80;90"
- Or check your GPU:
nvidia-smi --query-gpu=compute_cap --format=csv
Symptoms:
error: shared memory array size exceeds maximum
Solutions:
- Reduce tile size in kernel configuration
- Use dynamic shared memory:
extern __shared__ float smem[];
- Check GPU shared memory limits:
nvidia-smi -q -d MEMORY | grep "Shared"
Symptoms:
CUDA error: invalid device ordinal
Solutions:
- Check available GPUs:
nvidia-smi -L
- Set visible devices:
export CUDA_VISIBLE_DEVICES=0
Symptoms:
CUDA error: out of memory
Solutions:
- Reduce batch/tensor size
- Check GPU memory:
nvidia-smi
- Use memory pool (CUDA 11.2+):
cudaMemPool_t pool; cudaDeviceGetDefaultMemPool(&pool, 0);
Solutions:
- Reduce block size:
// Instead of 1024 threads dim3 block(256); // Use smaller block
- Check register usage:
nvcc --ptxas-options=-v kernel.cu
Symptoms: TFLOPS much lower than expected
Solutions:
- Ensure dimensions are multiples of 16:
// Tensor Core requires M, N, K divisible by 16 int M = ((M + 15) / 16) * 16; // Pad to multiple of 16
- Verify FP16 input:
// Tensor Core requires __half input hpc::Tensor<__half> A(M * K); // Not float
- Check occupancy:
nvprof --metrics achieved_occupancy ./program
Symptoms: Unexpected slowdown in shared memory operations
Solutions:
- Add padding to shared memory arrays:
__shared__ float tile[32][33]; // +1 for bank conflict avoidance
- Profile with Nsight Compute:
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld ./program
Solutions:
- Build with Python bindings:
cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON cmake --build build - Set PYTHONPATH:
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
- Verify:
import hpc_ai_opt print(hpc_ai_opt.__doc__)
Symptoms:
ValueError: Tensors must be on CUDA device
Solutions:
# Wrong
x = torch.randn(1024) # CPU tensor
# Correct
x = torch.randn(1024, device="cuda") # GPU tensorSolutions:
- Check tensor dtypes match:
x = torch.randn(1024, device="cuda", dtype=torch.float32) y = torch.empty_like(x) # Same dtype and device
- Verify dimensions:
# FlashAttention requires head_dim=64 config = { 'head_dim': 64, # Must be 64 ... }
| Error Code | Description | Common Cause |
|---|---|---|
| 1 | Invalid value | Bad parameter |
| 2 | Out of memory | GPU memory exhausted |
| 8 | Invalid device ordinal | Wrong GPU ID |
| 9 | Invalid kernel image | Architecture mismatch |
| 30 | Unknown error | Usually driver issue |
Solutions:
- Check driver version:
nvidia-smi # Look for "Driver Version" - Update driver:
# Ubuntu sudo apt install nvidia-driver-535 - Match CUDA version:
CUDA Min Driver 12.4 550.54+ 12.3 545.23+ 12.2 535.54+
Solutions:
- Check GPU architecture:
nvidia-smi --query-gpu=compute_cap --format=csv
- Build for correct architecture:
# For A100 (SM 8.0) cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80" # For H100 (SM 9.0) cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="90"
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Use after kernel launch
kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());# Check for memory errors
compute-sanitizer ./build/tests/gemm/test_gemm
# Check for race conditions
compute-sanitizer --tool racecheck ./program
# Check for memory leaks
compute-sanitizer --tool memcheck ./program# Detailed kernel analysis
ncu --set full -o profile ./program
# Focus on specific metrics
ncu --metrics gpu__time_duration.sum ./program
# Compare kernels
ncu --set basic ./program# System-wide profiling
nsys profile -o timeline ./program
# View results
nsys-ui timeline.nsys-repIf your issue isn't covered here:
- Search existing issues: GitHub Issues
- Check documentation: Documentation
- Ask in discussions: GitHub Discussions
- Report a bug: Use the Bug Report Template
When reporting, please include:
- OS and version
- CUDA version (
nvcc --version) - GPU model and driver (
nvidia-smi) - CMake configuration output
- Full error message
- Minimal reproduction code
A: No. This library requires an NVIDIA GPU with Compute Capability 7.0+. All kernels execute on the GPU.
A: Common reasons:
- Wrong GPU architecture (compile for your GPU)
- Non-optimal dimensions (pad to multiples of 16 for Tensor Core)
- Low occupancy (reduce register usage)
- Bank conflicts (add padding)
A: Yes, with Visual Studio 2022+ and CUDA 12.4+. Use CMake GUI or Developer Command Prompt.
A: Yes! Build Python bindings and pass PyTorch CUDA tensors directly:
import torch
import hpc_ai_opt
x = torch.randn(1024, device="cuda")
y = torch.empty_like(x)
hpc_ai_opt.elementwise.relu(x, y)Still stuck? Open an issue and we'll help!