🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.
-
Updated
May 30, 2026 - Python
🤖FFPA: Extends FlashAttention-2 via Split-D for large headdims, 1.5x~3×↑🎉 vs SDPA, up to 430T🎉 on H200.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
General Matrix Multiplication using NVIDIA Tensor Cores
Handwritten Flash Attention 2 CUDA kernel for Blackwell (SM120) with TMA, swizzle, double buffering & warp specialization
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
Vulkan & GLSL implementation of FlashAttention-2
Lynn 原生 LLM 推理引擎 · W4A8/NVFP4 量化 · 自写 CUDA/Triton kernel · MoE · 投机解码 | Lynn-native LLM inference engine for NVIDIA Blackwell
CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5)
Bilingual CUDA SGEMM optimization tutorial and reference implementation, from naive kernels to Tensor Core WMMA | 双语 CUDA SGEMM 优化教程与参考实现,从朴素内核到 Tensor Core WMMA
A benchmarking framework for correlators of FX telescope arrays
INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.
Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.
RA-SpMM: Regime-Aware Sparse Matrix Multiplication for GNN Workloads on GPUs. 8-rule router, 6 preprocessing-free kernels, 3.25x over cuSPARSE (FGCS 2026).
Tensor-core CUDA kernels for Nyström attention, linear-time forward and backward with exact autograd gradients. Faster than flash-attention at long sequence length.
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).
TsuruTune is a comprehensive deep learning model optimization tool designed specifically for NVIDIA Jetson platforms and edge devices.. It leverages Tensor Core acceleration and memory bandwidth alignment to achieve optimal performance for deep learning inference on edge devices.
Multi-GPU CUDA stress test with Tensor Core power filler for board power testing
Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.
To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."