This repository contains examples of NineToothed, including implementations of several common compute kernels written using NineToothed.
After cloning this repository, install the dependencies and run the tests using pytest.
pytestpytest -m benchmarkTo run a single operator benchmark:
pytest -m benchmark -k TestMMSome examples apply autotuning, which may take several minutes or longer to complete for complex kernels. If you wish to disable autotuning, you can replace symbol definitions with concrete values.
Consider the following example:
BLOCK_SIZE = Symbol("BLOCK_SIZE", meta=True)Here, meta=True specifies that BLOCK_SIZE is a meta symbol for autotuning. To disable autotuning, you can:
- Set
constexpr=Trueand pass a value when invoking the kernel. - Replace the symbol definition with a fixed integer value, as shown below:
BLOCK_SIZE = 1024These approaches allow you to obtain results in seconds. However, selecting optimal values is crucial for good performance. Experiment with different values to determine the best configuration.
Note: Please don't forget to also disable the autotuning of the corresponding Triton compute kernels.
This project includes code modified or inspired from the following open-source repositories:
- https://github.com/huggingface/transformers
- https://github.com/triton-lang/triton
- https://github.com/ROCm/triton
- https://github.com/l1351868270/implicit_gemm.triton
Licenses for third-party code are stored in the third_party directory. Each subdirectory contains its associated LICENSE file.
This repository is distributed under the Apache-2.0 license. See the included LICENSE file for details.