First off — thanks for taking the time to look at this project. This is a personal learning project focused on GPU inference engineering, and contributions that deepen that exploration are very welcome.
Most welcome:
- Bug fixes in CUDA kernels (incorrect output, numerical instability, race conditions)
- Performance improvements with measured benchmarks to back them up
- New kernel variants (e.g. FP16 attention, INT4 quantization, grouped-query attention)
- Better test coverage for numerical correctness
- Documentation fixes or clearer explanations in code comments
Also welcome:
- New model support beyond GPT-2 (LLaMA-style RoPE, SwiGLU FFN, etc.)
- Build system improvements (CMake, CI via GitHub Actions)
- Python API ergonomics
Out of scope for now:
- Training support — this engine is inference-only by design
- Replacing hand-written kernels with cuBLAS/cuDNN — the whole point is to understand what's happening inside
# CUDA 12.x
nvcc --version
# CMake 3.20+
cmake --version
# Python 3.10+
pip install pybind11 safetensors numpy torch # torch only needed for correctness testsgit clone https://github.com/lecuong1502/CUDA-Inference-Engine
cd CUDA-Inference-Engine
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)# Numerical correctness vs PyTorch reference
python tests/test_kernels.py
# Full benchmark suite
python benchmarks/run_benchmark.py-
Open an issue first for anything non-trivial — kernel changes, new features, API modifications. Describe what you're trying to fix or add and why.
-
Fork and branch from
main:git checkout -b fix/attention-kernel-overflow
Branch naming:
fix/,feat/,bench/,docs/prefixes. -
Write or update tests. Kernel changes must include a test in
tests/test_kernels.pythat validates output against a PyTorch reference within an acceptable tolerance:assert torch.allclose(engine_output, pytorch_output, atol=1e-4, rtol=1e-3)
-
Include benchmark numbers for any change that claims a performance improvement. Format:
Before: X ms/token, Y tok/s (RTX 4050, seq_len=512) After: X ms/token, Y tok/s -
Open a pull request against
main. Keep the PR focused — one fix or feature per PR.
- Follow the existing file structure: one kernel family per
.cufile (gemm.cu,attention.cu, etc.) - Use
// ===section headers to separate kernel variants within a file - Document launch parameters at the top of each kernel:
// Grid: (M/BM, N/BN) // Block: (BN, BM) // Shared: 2 * BM * BK * sizeof(float)
- Prefer
constexprfor tile sizes, avoid magic numbers - All kernels must handle edge cases: non-power-of-two dimensions, seq_len < block_size, batch_size=1
- Follow
blackformatting (pip install black && black .) - Type hints on all public API functions
- Keep
python/binding.cppminimal — just the pybind11 glue, no logic
Use conventional commits:
fix: resolve numerical overflow in softmax kernel for long sequences
feat: add FP16 attention kernel with 2x memory reduction
bench: add GQA benchmark vs standard MHA
docs: clarify tiling strategy in gemm.cu comments
Open a GitHub issue with:
- Environment: GPU model, CUDA version, OS, Python version
- Reproduction steps: minimal code to trigger the bug
- Expected vs actual output: if it's a numerical issue, include the max absolute error and which kernel is involved
- Nsight profile (if available): attach a
.ncu-repor screenshot from Nsight Compute if it's a performance regression
If you're unsure whether something is worth contributing, open a Discussion rather than an issue. Questions about the kernel implementation, GPU architecture, or quantization math are also welcome there.
By contributing, you agree that your contributions will be licensed under the Apache 2.0 License — same as the rest of the project.