English | 中文
This tutorial is dedicated to beginners who are not familiar with Python and GPU programming, offering a walkthrough from vector addition to RoPE, matmul_ogs, Top-K, Gluon Attention, and other important GPU kernels for Large Language Models (LLMs).
For those who haven't installed Python yet or want to learn some baic Python syntax before getting started, Python.org provides a comprehensive tutorial.
OpenAI/Triton is a Python dialect that allows developers to write efficient GPU kernels using Python syntax instead of coding in C++. As to now, Triton supports multiple backends, including: NVIDIA, AMD, 华为昇腾, 寒武纪、摩尔线程, 沐曦. This means kernel written in Triton can compile and run on different hardware architectures. For more details, see FlagOpen/FlagGems.
Advantages: Easy-to-use Python-like syntax, with high optimization for GPU performance and memory management.
Applications: Enhancing the efficiency of the inference phase of LLMs by speeding up deep learning operators and enabling customization of operators.
This repository requires Triton 3.4.0 (released on July 31, 2025), which comes with torch == 2.8.0. Since Triton has excellent backward compatibility, other versions of PyTorch might work as well if Triton is manually upgraded.