Troubleshooting Guide | 故障排除指南

This guide covers common issues and solutions when building and using HPC-AI-Optimization-Lab.

Build Issues
Runtime Issues
Performance Issues
Python Binding Issues
CUDA Errors

Build Issues

CMake Configuration Errors

"Could not find CUDA"

Symptoms:

CMake Error: Could not find CUDA

Solutions:

Verify CUDA Toolkit is installed:

nvcc --version  # Should show CUDA 12.4+

Set CUDA path explicitly:

export CUDA_HOME=/usr/local/cuda
cmake -S . -B build -DCUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME

Check PATH includes CUDA:

echo $PATH  # Should include /usr/local/cuda/bin

"CMake version too old"

Symptoms:

CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles"

Solutions:

Install CMake 3.24+:

# Ubuntu/Debian
wget https://github.com/Kitware/CMake/releases/download/v3.28.0/cmake-3.28.0-linux-x86_64.sh
chmod +x cmake-*.sh
sudo ./cmake-*.sh --prefix=/usr/local

Or use pip:
```
pip install cmake --upgrade
```

"Compiler doesn't support C++20"

Symptoms:

error: unrecognized command line option '-std=c++20'

Solutions:

Upgrade GCC to 11+:

# Ubuntu 22.04+
sudo apt install g++-11
export CXX=g++-11

Or use Clang 14+:

sudo apt install clang-14
export CXX=clang++-14

Compilation Errors

"Tensor Core requires SM 7.0+"

Symptoms:

error: identifier "wmma::load_matrix_sync" is undefined

Solutions:

Specify GPU architecture explicitly:

cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80;90"

Or check your GPU:

nvidia-smi --query-gpu=compute_cap --format=csv

"Shared memory size exceeds limit"

Symptoms:

error: shared memory array size exceeds maximum

Solutions:

Reduce tile size in kernel configuration
Use dynamic shared memory:
```
extern __shared__ float smem[];
```
Check GPU shared memory limits:
```
nvidia-smi -q -d MEMORY | grep "Shared"
```

Runtime Issues

"Invalid device ordinal"

Symptoms:

CUDA error: invalid device ordinal

Solutions:

Check available GPUs:
```
nvidia-smi -L
```
Set visible devices:
```
export CUDA_VISIBLE_DEVICES=0
```

"Out of memory"

Symptoms:

CUDA error: out of memory

Solutions:

Reduce batch/tensor size
Check GPU memory:
```
nvidia-smi
```

Use memory pool (CUDA 11.2+):

cudaMemPool_t pool;
cudaDeviceGetDefaultMemPool(&pool, 0);

Kernel Launch Failures

"Launch out of resources"

Solutions:

Reduce block size:

// Instead of 1024 threads
dim3 block(256);  // Use smaller block

Check register usage:
```
nvcc --ptxas-options=-v kernel.cu
```

Performance Issues

Low Performance on Tensor Core Kernels

Symptoms: TFLOPS much lower than expected

Solutions:

Ensure dimensions are multiples of 16:

// Tensor Core requires M, N, K divisible by 16
int M = ((M + 15) / 16) * 16;  // Pad to multiple of 16

Verify FP16 input:

// Tensor Core requires __half input
hpc::Tensor<__half> A(M * K);  // Not float

Check occupancy:

nvprof --metrics achieved_occupancy ./program

Bank Conflicts

Symptoms: Unexpected slowdown in shared memory operations

Solutions:

Add padding to shared memory arrays:

__shared__ float tile[32][33];  // +1 for bank conflict avoidance

Profile with Nsight Compute:

ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld ./program

Python Binding Issues

"No module named 'hpc_ai_opt'"

Solutions:

Build with Python bindings:

cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build

Set PYTHONPATH:

export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"

Verify:

import hpc_ai_opt
print(hpc_ai_opt.__doc__)

"PyTorch tensors must be on CUDA"

Symptoms:

ValueError: Tensors must be on CUDA device

Solutions:

# Wrong
x = torch.randn(1024)  # CPU tensor

# Correct
x = torch.randn(1024, device="cuda")  # GPU tensor

NaN or Incorrect Results

Solutions:

Check tensor dtypes match:

x = torch.randn(1024, device="cuda", dtype=torch.float32)
y = torch.empty_like(x)  # Same dtype and device

Verify dimensions:

# FlashAttention requires head_dim=64
config = {
    'head_dim': 64,  # Must be 64
    ...
}

CUDA Errors

Error Code Reference

Error Code	Description	Common Cause
1	Invalid value	Bad parameter
2	Out of memory	GPU memory exhausted
8	Invalid device ordinal	Wrong GPU ID
9	Invalid kernel image	Architecture mismatch
30	Unknown error	Usually driver issue

"CUDA driver version is insufficient"

Solutions:

Check driver version:
```
nvidia-smi  # Look for "Driver Version"
```

Update driver:

# Ubuntu
sudo apt install nvidia-driver-535

Match CUDA version:

CUDA Min Driver

12.4 550.54+

12.3 545.23+

12.2 535.54+

CUDA	Min Driver
12.4	550.54+
12.3	545.23+
12.2	535.54+

"CUDA capability not supported"

Solutions:

Check GPU architecture:

nvidia-smi --query-gpu=compute_cap --format=csv

Build for correct architecture:

# For A100 (SM 8.0)
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="80"

# For H100 (SM 9.0)
cmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES="90"

Debugging Tips

Enable CUDA Error Checking

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                    __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

// Use after kernel launch
kernel<<<grid, block>>>(args);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());

Use Compute Sanitizer

# Check for memory errors
compute-sanitizer ./build/tests/gemm/test_gemm

# Check for race conditions
compute-sanitizer --tool racecheck ./program

# Check for memory leaks
compute-sanitizer --tool memcheck ./program

Use Nsight Compute for Profiling

# Detailed kernel analysis
ncu --set full -o profile ./program

# Focus on specific metrics
ncu --metrics gpu__time_duration.sum ./program

# Compare kernels
ncu --set basic ./program

Use Nsight Systems for Timeline

# System-wide profiling
nsys profile -o timeline ./program

# View results
nsys-ui timeline.nsys-rep

Getting Help

If your issue isn't covered here:

Search existing issues: GitHub Issues
Check documentation: Documentation
Ask in discussions: GitHub Discussions
Report a bug: Use the Bug Report Template

When reporting, please include:

OS and version
CUDA version (nvcc --version)
GPU model and driver (nvidia-smi)
CMake configuration output
Full error message
Minimal reproduction code

FAQ

Q: Can I use this without a GPU?

A: No. This library requires an NVIDIA GPU with Compute Capability 7.0+. All kernels execute on the GPU.

Q: Why is my kernel slower than expected?

A: Common reasons:

Wrong GPU architecture (compile for your GPU)
Non-optimal dimensions (pad to multiples of 16 for Tensor Core)
Low occupancy (reduce register usage)
Bank conflicts (add padding)

Q: Does this work on Windows?

A: Yes, with Visual Studio 2022+ and CUDA 12.4+. Use CMake GUI or Developer Command Prompt.

Q: Can I use this with PyTorch?

A: Yes! Build Python bindings and pass PyTorch CUDA tensors directly:

import torch
import hpc_ai_opt

x = torch.randn(1024, device="cuda")
y = torch.empty_like(x)
hpc_ai_opt.elementwise.relu(x, y)

Still stuck? Open an issue and we'll help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide | 故障排除指南

Table of Contents

Build Issues

CMake Configuration Errors

"Could not find CUDA"

"CMake version too old"

"Compiler doesn't support C++20"

Compilation Errors

"Tensor Core requires SM 7.0+"

"Shared memory size exceeds limit"

Runtime Issues

"Invalid device ordinal"

"Out of memory"

Kernel Launch Failures

"Launch out of resources"

Performance Issues

Low Performance on Tensor Core Kernels

Bank Conflicts

Python Binding Issues

"No module named 'hpc_ai_opt'"

"PyTorch tensors must be on CUDA"

NaN or Incorrect Results

CUDA Errors

Error Code Reference

"CUDA driver version is insufficient"

"CUDA capability not supported"

Debugging Tips

Enable CUDA Error Checking

Use Compute Sanitizer

Use Nsight Compute for Profiling

Use Nsight Systems for Timeline

Getting Help

FAQ

Q: Can I use this without a GPU?

Q: Why is my kernel slower than expected?

Q: Does this work on Windows?

Q: Can I use this with PyTorch?

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide | 故障排除指南

Table of Contents

Build Issues

CMake Configuration Errors

"Could not find CUDA"

"CMake version too old"

"Compiler doesn't support C++20"

Compilation Errors

"Tensor Core requires SM 7.0+"

"Shared memory size exceeds limit"

Runtime Issues

"Invalid device ordinal"

"Out of memory"

Kernel Launch Failures

"Launch out of resources"

Performance Issues

Low Performance on Tensor Core Kernels

Bank Conflicts

Python Binding Issues

"No module named 'hpc_ai_opt'"

"PyTorch tensors must be on CUDA"

NaN or Incorrect Results

CUDA Errors

Error Code Reference

"CUDA driver version is insufficient"

"CUDA capability not supported"

Debugging Tips

Enable CUDA Error Checking

Use Compute Sanitizer

Use Nsight Compute for Profiling

Use Nsight Systems for Timeline

Getting Help

FAQ

Q: Can I use this without a GPU?

Q: Why is my kernel slower than expected?

Q: Does this work on Windows?

Q: Can I use this with PyTorch?