Q8 Kernels

Q8Kernels is a efficent implementation of 8bit kernels(FP8 and INT8).

Features:

-8bit GEMM(with fused gelu and bias) / 2x faster than cuBLAS FP8 and 3.5x faster than torch.mm
-FP8 Flash Attention 2 with Fast Hadamard Transform(also supports cross attention mask) / 2x faster than flash attention 2
-Mixed Precision Fast Hadamard Transform
-RMSNorm
-Mixed Precision FMA
-RoPE Layer
-Quantizers

All operations are implemented in CUDA. Current version supports ADA Architecture(Ampere optimizations are coming soon!).

Installation

q8_kernels requires CUDA Version >= 12.4 and pytorch >=2.4. q8_kernels was tested on Windows machine. Dont see problem with building on Linux systems. Install ninja pip install ninja Make sure that ninja is installed and that it works correctly (e.g. ninja --version). Without ninja installation is very slow.

git clone https://github.com/KONAKONA666/q8_kernels
cd q8_kernels 
git submodule init
git submodule update

python setup.py install
pip install . # for utility

It takes ~10-15 minutes to compile and install all modules.

Supported models

Speed ups are computed relative to transformers with inference with 16bit and flash attention 2

Model name	Speed up
LTXVideo	up to 2.5x

Acknowledgement

Thanks to: Flash attention

@66RING

fast-hadamard-transform

cutlass

@weishengying: Check his CUTE exercises and flash attn implementations

Authors

KONAKONA666

License

MIT Free Software

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q8 Kernels

Features:

Installation

Supported models

Acknowledgement

Authors

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Q8 Kernels

Features:

Installation

Supported models

Acknowledgement

Authors

License