CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
-
Updated
May 11, 2026 - Cuda
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA/CUTLASS kernels, Triton spells, and PTX sorcery.
Profiling with NVIDIA Nsight Tools Bootcamp
References content from the OLCF CUDA Training Series. (https://github.com/olcf/cuda-training-series)
CUDA Samples and Nsight Guided Profiling Samples
A comprehensive, hardware-agnostic GPU benchmarking suite that compares CUDA, OpenCL, and DirectCompute performance using identical workloads. Built from scratch with professional architecture, extensive documentation, and production-ready GUI.
Learning CUDA GEMM optimization and profiling for AI infrastructure.
CUDA FP32 GEMM optimization with loop unrolling, shared memory tiling, register tiling, benchmarking, and Nsight profiling.
libHPC is a high-performance computing library focused on Linux and Windows environments. It provides SIMD-optimized kernels, concurrent data structures, GPU utilities, and HPC-oriented memory management components.
CUDA learning lab: kernel exercises, profiling experiments, and writeups.
This project demonstrates the integration of a CUDA kernel within an NVIDIA Holoscan application. It consists of two custom operators: one for memory allocation and data initialization, and another for executing the CUDA kernel. The application was profiled using Nsight systems and the kernel with Nsight compute
Single-head CUDA attention kernel: naive SDPA --> fused softmax --> occupancy-tuned variants, benchmarked against cuDNN SDPA with Nsight Compute profiling.
CUDA reduction primitive using warp shuffles, grid-stride loading, and memory-bandwidth profiling with Nsight Compute.
Nsight-driven CUDA kernel profiling studies for GEMM, Tensor Core GEMM, reductions, softmax, and attention against vendor baselines.
Kernel-only profiling workflow for CUDA and Triton kernels with Nsight Compute, standardized reports, visual analysis, and vendor-portable adapters.
Sparse binary 2D FFT on CUDA/cuFFT with memory-footprint optimization, streaming tiles, Hermitian symmetry, and Nsight analysis.
GPU-accelerated Number-Theoretic Transform for ZK-Proof generation. Targets the NTT bottleneck (91% of Groth16 prover time) via two CUDA optimizations: async double-buffered pipeline eliminating CPU-GPU transfer overhead, and IADD3-path Montgomery multiplication reducing finite-field instruction latency. BLS12-381, Ampere sm_86, Nsight-profiled.
Profile-driven FP32 CUDA GEMM optimization: naive --> tiled --> coalesced --> register-blocked --> bank-padded, benchmarked against cuBLAS.
Row-wise CUDA softmax kernels: shared-memory reduction, warp-shuffle reduction, and online softmax benchmarked against cuDNN.
University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.
Add a description, image, and links to the nsight-compute topic page so that developers can more easily learn about it.
To associate your repository with the nsight-compute topic, visit your repo's landing page and select "manage topics."