For development setup, see Development Guide below.
Please review these details before committing, especially for AI-generated code.
- Keep changes minimal — do not add what is not necessary.
- Comments are not always better when abundant. Ideally, the code should be self-explanatory.
- Files must end with a newline.
- Use Markdown syntax (backtick-fenced) for identifiers in comments and error messages.
- Comments and error messages must be in English.
- Comments and error messages should follow the language's conventions first. If the language does not specify, use complete sentences — capitalize the first letter and end with punctuation.
- Follow the Google C++ Style Guide strictly. Use the default
.clang-format. - Operator parameters: inputs first, outputs last. For naming, prefer PyTorch conventions, then ONNX, then CUDA APIs. Attributes go between inputs and outputs.
- Do not use exceptions. Use
assertfor error handling — debug builds will trigger assertion messages (which should include__FILE__,__LINE__,__func__at minimum), and release builds will have assertions compiled out. - Error and warning messages follow the LLVM Coding Standards.
- Kernel file naming (excluding extension): custom kernels without a well-known algorithm name should be named
kernel. Multiple implementations usekernel_v2,kernel_v3, etc. Well-known algorithms use the algorithm name (e.g.flash_attention_v2). Library-based implementations use the library name (e.g.blas). - Separate kernel from kernel launcher. Launchers use
.h. Kernels follow platform conventions (e.g..cuh). Non-template kernels still require header/source separation (e.g..cuh+.cu). - Initializer list order must match member declaration order.
- One blank line between classes, between classes and functions, and between functions.
- One blank line between each member (including both functions and variables) within a class.
- One blank line before and after the contents of a namespace.
Follow PEP 8 as the primary style guide. For anything PEP 8 does not cover in detail, refer to the GDScript style guide—while it targets a different language, its non-syntax conventions are still applicable.
-
Comments should be complete English sentences, starting with a capital letter and ending with punctuation. Use Markdown syntax when referencing code within comments.
-
Error messages and framework conventions: When a framework has an established convention (e.g.,
pytest.skipmessages are typically lowercase without a trailing period), follow that convention. Otherwise, use the same rules as comments. -
Function signatures: If a function has no docstring or comment, do not add a blank line between the function signature and the function body.
-
Blank lines around control flow: Add a blank line before and after
if,for, and similar statements. -
Return statements: Add a blank line before a
returnstatement, unless it directly follows a control flow statement likeiforfor. -
Docstrings: Follow PEP 257 conventions.
Commit messages must follow Conventional Commits.
- Small PRs should be squashed. Large PRs may keep multiple commits, but each commit must be meaningful and well-formed.
- PR titles follow the same Conventional Commits format as commit messages.
- Before merging (or after each stage of changes), build and test on all supported platforms. Include the results in PRs.
Branch names use the format <type>/xxx-yyyy-zzzz, where <type> matches the PR title's Conventional Commits type, and words are joined with hyphens.
Using Nvidia as an example:
pip install .[dev] -C cmake.define.WITH_CPU=ON -C cmake.define.WITH_NVIDIA=ONAuto-detection is supported for some platforms, so you can also simply run:
pip install .[dev]
[dev]installs optional development dependencies (e.g.pytest) that are not needed for production but required for development and testing. After the first install, subsequent installs only needpip install ..
Platform maintainers can add auto-detection in CMakeLists.txt under the if(AUTO_DETECT_DEVICES) section.
pytest- Base class in
src/base/: the class must inherit fromOperator<Op>(e.g.class Gemm : public Operator<Gemm>). Seesrc/base/gemm.h. - Platform implementation in
src/<platform>/: the class must inherit from the base (e.g.class Blas : public Gemm). Seesrc/cuda/gemm/blas.handsrc/nvidia/gemm/cublas.h. - Tests in
tests/:- Use
pytest.mark.parametrizefor parameterization. Dependent parameters go in one decorator (e.g.@pytest.mark.parametrize("dtype, rtol, atol", ...)); independent parameters use separate decorators, ordered by parameter declaration. dtypeanddeviceparameterization is included by default. Override with explicitpytest.mark.parametrizeif needed.- Use
pytest.mark.auto_act_and_assertfor automatic execution and comparison — just return aPayload. Requires thatfuncandrefshare the same calling convention and that all checked values are return values. - See
tests/test_add.pyandtests/test_gemm.py.
- Use
Since DataType is an enum used to represent data types generically, we often need to map between DataType and native C++ types (e.g. float, int32_t).
TypeMap: mapsDataTypeto native types. Use the aliasTypeMapTypeto get the type directly, e.g.TypeMapType<dev, DataType::kFloat32>isfloat. Note, the first template argument is aDevice::Typesince data types like float16 and bfloat16 are not the same across the platforms. Thus, aDevice::Typeis required to specify which native type aDataTypemaps to.DataTypeMap: maps native types back toDataType. Use the aliasDataTypeMapValue, e.g.DataTypeMapValue<float>isDataType::kFloat32.
DispatchFunc is the runtime dispatch interface defined in dispatcher.h. It supports arbitrary types, multi-dispatch, and mixed-type dispatch with any return type.
DispatchFunc</* supported types */>(
/* runtime value to dispatch on */,
/* lambda with dispatched logic */,
/* context string for error messages (recommended) */,
/* forwarded args for the lambda (optional) */
);DispatchFunc<Device::Type::kCpu, Device::Type::kNvidia>(
Device::Type::kNvidia,
[](auto tag) {
constexpr Device::Type Dev = decltype(tag)::value;
},
"DeviceTest");DataType dtype = DataType::kFloat32;
DispatchFunc<Device::Type::Cpu, FloatTypes>(
dtype,
[](auto tag) {
using T = typename decltype(tag)::type;
// Use T as the resolved native type.
},
"DataType Dispatch");Dispatching DataType is a little bit special.
-
Due to the previously mentioned
TypeMapreason, aDevice::Typeis needed as the first template argument; -
Since
DataTypeis frequently used, the supported type list can use predefined shorthands fromdata_type.h(e.g.FloatTypes=List<DataType::kFloat32, DataType::kFloat64>). To combine shorthands, useConcatTypefromcommon/traits.h:
DispatchFunc<ConcatType<List<DataType::kFloat16>, FloatTypes>>(...);For types other than DataType and Device::Type, pass the type as the first template argument:
DispatchFunc<QuantMode, QuantMode::kNone, QuantMode::kWeightOnly>(
QuantMode::kWeightOnly,
[](auto tag) {
constexpr QuantMode M = decltype(tag)::value;
},
"QuantDispatch");This also works for native types like int (e.g. block sizes):
DispatchFunc<int, 128, 256, 512, 1024>(
runtime_block_size,
[](auto tag) {
constexpr int BlockSize = decltype(tag)::value;
},
"BlockSizeDispatch");Use List boundaries to separate supported sets for each dispatched value. Pass runtime values in an initializer list:
DispatchFunc<List<Device::Type::kCpu, Device::Type::kNvidia>,
List<Device::Type::kAscend, Device::Type::kMetax>>(
{Device::Type::kNvidia, Device::Type::kMetax},
[](auto tag1, auto tag2) {
constexpr Device::Type D1 = decltype(tag1)::value;
constexpr Device::Type D2 = decltype(tag2)::value;
},
"MultiDeviceTest");Similarly, DataType requires a Device::Type at the front:
DispatchFunc<Device::Type::kCpu, FloatTypes, List<DataType::kInt32, DataType::kInt64>>(
{DataType::kFloat64, DataType::kInt32},
[](auto tag1, auto tag2) {
using T1 = typename decltype(tag1)::type;
using T2 = typename decltype(tag2)::type;
},
"MultiDataTypeTest");When dispatching different types simultaneously (e.g. DataType + Device::Type, or DataType + block size), cast values to int64_t and use ListGet<N> from common/traits.h:
DispatchFunc<FloatTypes, List<Device::Type::kCpu, Device::Type::kNvidia>>(
{static_cast<int64_t>(DataType::kFloat32),
static_cast<int64_t>(Device::Type::kNvidia)},
[](auto list_tag) {
constexpr DataType DT = static_cast<DataType>(ListGet<0>(list_tag));
constexpr Device::Type Dev = static_cast<Device::Type>(ListGet<1>(list_tag));
using T = TypeMapType<Device::Type::kCpu, DT>;
},
"MixedDispatch");Note that in mixed multi-type dispatch, DataType is not treated specially. Therefore, we neither should nor can place Device::Type at the front of the DataType list. Inside the lambda, we obtain it as a DataType and then convert it to the native type if needed.
If DT is not used within the lambda, you can inline its definition directly into the using T = ... statement, like this:
using T = TypeMapType<Device::Type::kCpu, ListGet<0>(list_tag)>;no such option: -Cduring install: Upgrade pip withpython -m pip install --upgrade pip.- Segmentation fault during tests: Run
pytest -n 1. Unknown CMake command "pybind11_add_module": Install pybind11 withpip install pybind11[global]. See the pybind11 installation guide.- Auto-detection (
AUTO_DETECT_DEVICES) fails: Some machines may not expose devices in expected paths (e.g./dev/nvidia*). Use explicit CMake defines instead (e.g.-C cmake.define.WITH_NVIDIA=ON). bash: pytest: command not found: Usepython -m pytest.CUBLAS_STATUS_INVALID_VALUEincublasSgemmStridedBatched: PyTorch version issue. Downgrade totorch<=2.9.1.