Draft
Conversation
…ld integration Add Ascend platform scaffolding: - `device_.h`: `DeviceEnabled<kAscend>` specialization - `data_type_.h`: `toAclDtype()`, `isIntegerDtype()` - `common.h`: `buildAclTensor()` with optional transpose - `workspace_pool_.h`: stream-keyed workspace allocator - `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset) - 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul, ReshapeAndCache, RotaryEmbedding) Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.
…emove missing include - Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to match the generic 4-arg / 3-arg calling convention used by examples. - Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`. - Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file does not exist until the kernels commit).
- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`. - Add `get_npu_stream()` helper and NPU device detection in test utils. - Add `skip_unsupported_dtype` fixture for Ascend in conftest. - Update `runtime_api.h` with Ascend backend entry.
The `aclrtMalloc` call was the sole expression inside `assert()`, so it was compiled away in release builds (NDEBUG). This left the workspace buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR (161001) for any operation that requires workspace (e.g. alpha != 1.0).
`CudaCausalSoftmax` was missing `#include "cuda/runtime_utils.h"`, causing `RuntimeUtils` to be undefined. Drop `std::forward` from `Operator::make` nested lambda — NVCC instantiates the body during SFINAE invocability checks even inside `if constexpr` false branches, causing template resolution failures. All operator constructors take parameters by value, so lvalue pass has identical semantics.
Upgrade base image from `nvcr.io/nvidia/pytorch:24.10-py3` (CUDA 12.6) to `25.12-py3` (CUDA 13.1), aligning CI with the local dev environment. Restore `std::forward<Args>(args)...` in `Operator::make`, as the NVCC bug that required dropping it is fixed in the newer toolkit.
`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on MetaX's clang-based compiler (`-Wc++11-narrowing`).
- Add blank lines between struct/class members per style guide - Capitalize comments and use backtick syntax for code refs in `matmul.h` - Move `import re` to module level in `generate_wrappers.py` - Add blank lines before `for`/`return` per PEP 8 in `generate_wrappers.py` - Replace `-k npu` with `--devices ascend` in CI config
- Fix `ruff format` violations in `generate_wrappers.py` and `test_gemm.py`. - Fix `ruff isort` violation: move `import re` into stdlib group. - Add backticks around identifiers in comments (`numel()`, `operator()`, `make()`, `torch_npu`, `uint16`/`uint32`/`uint64`). - Add missing blank line after `if` block in `skip_unsupported_dtype`. - Remove `.worktrees/` from project `.gitignore` (belongs in global gitignore).
Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm, ReshapeAndCache, RotaryEmbedding, FlashAttention.
Pass stream to all CANN ops in existing tests; add FlashAttention, ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.
This reverts commit 26c2bdc.
…/Linear/Mul operators Descriptor caching (`AclTensorCache` + `aclSetRawTensorAddr`), executor caching (`aclSetAclOpExecutorRepeatable`), D2H sync elimination, `add_rms_norm` decomposition, and `WorkspacePool` thread-local fast path. Host dispatch dropped from ~255 us/call to 17-57 us/call for all cacheable operators. New operators: Cast (`aclnnCast`), Cat (`aclnnCat` with TensorList executor caching), Linear (`aclnnAddmm`/`aclnnBaddbmm`/ `aclnnMatmul`), Mul (`aclnnMul`). Full regression: 2040 passed, 0 failed.
Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local cached pointers remain valid across `unordered_map` rehashes. Remove unused `detail::reshapeView` helper from FlashAttention.
…tion Normalize negative `dim` in the base class constructor (e.g. -1 → last dimension). Add comment in the Ascend kernel explaining why `aclSetRawTensorAddr` on TensorList-contained descriptors is sufficient without `aclSetInputTensorAddr`. Add negative-dim test case.
Introduce a Python DSL for declarative operator registration and automated CUDA-like backend wrapper generation. Key components: - `dsl/decorators.py`: `@manual_op` and `@infini_op` decorators - `dsl/compiler/codegen.py`: generates `Operator<Op, kBackend>` wrapper files from shared `Cuda*<Runtime<...>>` templates - `dsl/ops/*.py`: all 14 existing operators registered as `@manual_op` - `dsl/__main__.py`: CLI with `--verify` mode to diff against existing hand-written wrappers Verify mode confirms 14/14 existing wrapper files match generated output byte-for-byte. Also identifies 2 missing wrappers (moore/causal_softmax, moore/rms_norm) that could be auto-generated. `generate_wrappers.py` is preserved — the DSL compiler handles wrapper generation only; binding generation remains in the existing script.
…then-transform Add reusable kernel templates parameterized on `Device::Type` and user-provided functors, enabling cross-platform code sharing across CUDA-like backends and CPU.
…, and C++ codegen Implements the full DSL compiler pipeline: - `dag.py`: compute DAG representation with node kinds and topological sort - `parser.py`: AST parser that translates `@infini_op` function bodies into DAGs - `patterns.py`: pattern matcher mapping DAGs to C++ template bricks - `infini_codegen.py`: C++ code generator emitting CUDA and CPU kernel files - `primitives.py`: DSL type annotations (`Tensor`, `Scalar`) and primitive functions - Example `@infini_op` definitions for `AddDsl` and `RmsNormDsl` - 10 unit tests covering parser, pattern matching, and codegen
…and dispatcher fallback - Add `implementation_index` support using the Gemm (cuBLAS/cuBLASLt) pattern: DSL-generated kernels register as `Operator<Op, kDev, Impl::kDsl>` alongside hand-written `Operator<Op, kDev>` implementations. - Introduce `src/impl.h` with global `Impl::kDefault`/`Impl::kDsl` constants and operator-specific `GemmImpl::kCublas`/`GemmImpl::kCublasLt`. - Add per-operator `registry.h` files declaring `ActiveImplementationsImpl` with named constants for Add, RmsNorm, Mul, Swiglu, and Gemm. - Add dispatcher fallback in `DispatchImplementation`: when the requested `implementation_index` is not in the active list, fall back to the first available implementation instead of aborting. - Add per-operator Python string `implementation` parameter (e.g., `implementation="dsl"`, `implementation="cublaslt"`) via `impl_names.json` generated by the DSL compiler and consumed by `generate_wrappers.py`. - Migrate Mul and Swiglu to `@infini_op` with `impl_index=1`. - Standardize Swiglu base class: rename `gate_*` fields to `other_*` for consistency with `BinaryElementwiseBrick` interface. - All 4272 tests pass (0 failures). Pre-existing CUDA crashes for operators without NVIDIA implementations (Cast, Cat, Linear, Matmul, AddRmsNorm) are unrelated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `UNARY_ELEMENTWISE` brick support to the DSL compiler's CUDA and CPU code generators, enabling single-input operators like Cast to be compiled from `@infini_op` Python definitions into C++ template code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rnels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix CUDA unary kernel to use explicit template args on functor call
(`Op{}.template operator()<TIn, TOut>()`) for correct return type
deduction.
- Fix CPU unary codegen to use `Caster` instead of `static_cast` for
fp16/bf16 types that lack implicit conversions.
- Create `dsl/ops/cast_dsl.py` registering Cast at `impl_index=1`.
- Generate CUDA/CPU/nvidia kernel files and registries for Cast.
- Add `tests/test_cast_dsl.py` with 40 test cases (fp32<->fp16,
bf16<->fp32, contiguous and non-contiguous tensors).
- Add `tests/benchmark_dsl.py` with DSL vs hand-written performance
comparison (all within 0.95x-1.01x, well within 80-120% target).
- Cat: custom CUDA concat kernel with multi-input indexing and device metadata management. Supports arbitrary dimension and variable inputs. - Linear: cuBLAS GEMM delegation + optional bias-add kernel. Reuses existing BLAS infrastructure. - Matmul: cuBLASLt primary (impl_index=0) + cuBLAS fallback (impl_index=1). Fixed alpha=1, beta=0. Heuristic algorithm selection for optimal perf. - Fix CPU Linear to work with new GEMM-style base class members. - Add bindings override mechanism for operators with complex signatures (std::vector<Tensor>). Tests: Cat 30, Linear 72, Matmul 80 passed on CUDA.
Implement the fused Add + RmsNorm operator (residual = x1 + x2, y = rms_norm(residual) * weight) for NVIDIA GPUs, following vLLM's design with CUB block reduction for variance computation.
Implement the ReshapeAndCache operator that writes key/value tensors into a paged KV cache using slot mapping, following the vLLM design pattern. Includes base class, CUDA kernel, NVIDIA wrapper, and tests.
# Conflicts: # src/base/add_rms_norm.h # tests/test_add_rms_norm.py
# Conflicts: # src/base/reshape_and_cache.h # tests/test_reshape_and_cache.py
…DA kernels - AddRmsNorm: fused add + rms_norm kernel using CUB block reduction. One block per row, two-pass (add+accumulate, then normalize+scale). - ReshapeAndCache: KV cache write kernel with slot_mapping. Each block handles one token, writing key/value into paged cache layout. - RotaryEmbedding: rotary position embeddings supporting both NeoX (split-half) and GPT-J (interleaved) styles. In-place on query/key. Tests: AddRmsNorm 36, ReshapeAndCache 15, RotaryEmbedding 12 passed on CUDA.
- Remove unused BLOCK_SIZE template param from CatKernel and BiasAddKernel. - Fix BiasAddKernel<T, 0> instantiation (was semantically wrong). - Delete unnecessary nvidia/rotary_embedding/registry.h (single-impl operators use the default List<0>). - Fix duplicate try/except in test_linear.py reference function.
- Add FlashInfer as git submodule at third_party/flashinfer/. - Create CudaFlashAttention wrapping FlashInfer's SinglePrefillWithKVCacheDispatched for single-sequence attention. - Support causal and non-causal masks, head sizes 64/128/256. - Runtime head_dim dispatch to compile-time template parameters. - Add FlashInfer + CUTLASS include paths to CMakeLists.txt. - Tests: 6 CUDA fp16 tests pass (causal/non-causal, MHA/GQA). bf16 has a launch failure on this GPU — FlashInfer compatibility issue, not an InfiniOps bug.
Without explicit CMAKE_CUDA_ARCHITECTURES, CMake may default to a lower architecture (e.g., SM75) even on newer GPUs. This caused FlashInfer's bf16 prefill kernel to fail at runtime on A100 (SM80), since bf16 tensor core operations require SM80+. Now auto-detects the GPU's compute capability via nvidia-smi during CMake configure and sets CMAKE_CUDA_ARCHITECTURES accordingly. Root cause verified: CMAKE_CUDA_ARCHITECTURES was 75, FlashInfer's prefill.cuh explicitly asserts "do not support bf16 on sm75".
- benchmark_all.py: 85 test cases covering all 14 CUDA operators (Add, Mul, Cast, Swiglu, RmsNorm, CausalSoftmax, AddRmsNorm, Cat, Gemm, Matmul, Linear, RotaryEmbedding, ReshapeAndCache, FlashAttention) - Baseline report on A100-SXM4-80GB: Gemm/Matmul at 235-249 TFLOPS (75-80% peak), FlashAttention at 286 TFLOPS (92% peak) - Identified optimization priorities: Gemm cuBLAS→cuBLASLt, Linear BLAS upgrade, CausalSoftmax fused kernel
- Replace Linear's cuBLAS (BlasGemmStridedBatchedEx) with cuBLASLt heuristic algorithm selection. Measured 0.210ms → 0.187ms on (1024,4096,4096) fp16 on A100. - Keep Gemm default as cuBLAS (index 0) for test stability. cuBLASLt available at implementation="cublaslt" (2.9x faster on 1024³, but TF32 precision differs from cuBLAS reference). - Add cuBLASLt recommendation comment in Gemm registry.h.
…g generation `_get_all_ops` now accepts an optional `output_dir` parameter and searches both `src/` and the output directory for `Operator<>` specializations. This supports the migration of auto-generated wrapper files from `src/<platform>/` to `generated/<platform>/`.
Move all auto-generated wrapper, DSL, and registry files from `src/` to `generated/`. `src/<platform>/` now only contains platform adapter files (device_.h, runtime_.h, etc.) and hand-written multi-impl operators (Gemm, Matmul). - Add `cuda` backend entries to manual_op definitions for operators that have CUDA kernels (Cat, Linear, AddRmsNorm, FlashAttention, ReshapeAndCache, RotaryEmbedding). - Fix registry generation to omit `Impl::kDefault` when no hand-written implementation exists for a device (prevents segfault on dispatch). - Add `generated/` to CMake include paths for both infiniops and ops targets. - Remove registry.h includes from hand-written CPU files. - Update bindings generator to scan `generated/` for operator specializations. New platform onboarding: provide 4 adapter files + CMake flag → build. New operator onboarding: base class + CUDA kernel + DSL registration → build. All wrappers auto-generated. Tests: all 14 operators pass on CUDA (1734 passed, 1 pre-existing Gemm bf16 precision failure).
Extend codegen to support BLAS-style wrappers: when a @manual_op has `"blas": True` in its cuda backend entry, the compiler generates `Operator<Op, kDev> : public BlasOp<Blas<kDev>>` wrappers for all CUDA-like platforms, instead of the standard `Runtime<kDev>` wrapper. - Delete hand-written `src/nvidia/gemm/cublas.h` (now auto-generated). - Remove explicit nvidia/metax/iluvatar/moore entries from Gemm's DSL definition — codegen derives them from the shared cuda entry. - Fix `generate_blas_wrapper` include guard naming and registry include. - Update `examples/runtime_api.h` to use generated path.
Auto-select between prefill and decode based on query sequence length: - seq_len > 1 → SinglePrefillWithKVCacheDispatched (existing) - seq_len == 1 → SingleDecodeWithKVCacheDispatched (new) Decode path uses FlashInfer's optimized decode kernel with NHD layout. Verified: max diff < 0.0001 vs PyTorch SDPA reference on fp16/bf16, MHA and GQA (32/8 heads), KV lengths up to 256.
Extend CudaFlashAttention to handle batch prefill (packed sequences with cu_seqlens) and paged decode (block_table-based KV cache) by looping over sequences and calling FlashInfer's single-sequence kernels. This is functionally correct; a future optimization can switch to FlashInfer's native batch kernels with scheduler workspace.
… kernels Batch prefill now uses `BatchPrefillWithRaggedKVCacheDispatched` with the `PrefillPlan` scheduler (split-KV disabled), and paged decode uses `BatchDecodeWithPagedKVCacheDispatched` with the `DecodePlan` scheduler. This eliminates serial kernel launches and host-device synchronization per sequence, enabling the GPU to process all sequences in a single kernel launch.
…e in FlashAttention Eliminate 11 cudaMalloc/cudaFree calls per FlashAttention invocation (batch prefill: 4 device + 1 pinned; paged decode: 6 device + 1 pinned) by using pre-allocated memory: - Override `workspace_size_in_bytes()` to request 264 MB device workspace (128 MB int + 128 MB float + 8 MB scratch for metadata arrays). - Allocate a fallback `default_workspace_` in the constructor, following the Cambricon pattern, so callers that do not set handle workspace still work correctly. - Allocate pinned host staging buffer once in the constructor instead of per-call cudaMallocHost/cudaFreeHost. - Partition the device workspace via pointer arithmetic with overflow assertions in both LaunchBatchPrefill and LaunchPagedDecode.
…ensors Add BinaryElementwiseVecKernel with 128-bit coalesced load/store and grid-stride loop. When all three tensors are contiguous, the brick dispatches the vectorized path instead of the scalar per-element kernel. Measured on A100 with Add (4096,4096) fp16: - Before (scalar): 570 GB/s (29% HBM bandwidth) - After (vectorized): 1646 GB/s (82% HBM bandwidth) - PyTorch reference: 1650 GB/s The improvement applies to DSL-generated operators (Add, Mul, Swiglu at impl_index=1). Hand-written CudaAdd still uses its own kernel and does not benefit — a follow-up should either vectorize it or switch the default to the DSL implementation.
Replace hand-written per-element kernels with BinaryElementwiseBrick, which automatically dispatches vectorized 128-bit load/store for contiguous tensors. Measured on A100 (4096² fp16): - Add: 0.164ms → 0.077ms (2.1x faster, 1315 GB/s) - Swiglu: ~0.164ms → 0.062ms (~2.6x faster, 1612 GB/s)
…iguous tensors Add UnaryElementwiseVecKernel with grid-stride loop for contiguous path. Improves GPU occupancy and memory access coalescing. Cast fp32→fp16 (4096² on A100): 0.161ms → 0.092ms (1.75x, 1094 GB/s).
Record profiling-driven optimization results on A100: - Round 1: Vectorized binary elementwise brick (Add DSL: 612→1646 GB/s) - Round 2: Refactor CudaAdd/CudaSwiglu to use brick (Add: 2.1x, Swiglu: 2.6x) - Round 3: Grid-stride loop for unary elementwise (Cast: 1.75x) - Round 4: RmsNorm analysis (3.3x slower than PyTorch, deferred) - Round 5: Full post-optimization benchmark Key results: Mul/Swiglu match PyTorch, FlashAttention 12% faster, Matmul 2x faster (cuBLASLt). Remaining gaps in Add (20%), Cast (22%), RmsNorm (3.3x).
Add 128-bit vectorized input load and output store to UnaryElementwiseVecKernel for contiguous tensors. Cast fp32→fp16 (4096² on A100): 0.092ms → 0.078ms (+17%, 1285 GB/s). Still 22% gap vs PyTorch (1645 GB/s) — likely needs output-type-based vectorization strategy.
Cache x values in shared memory during the reduce phase, reuse them in the transform phase. Eliminates the second global memory read. RmsNorm (32,32,4096) fp16 on A100 (CUDA event timing): - Before: ~35 us - After: 27.1 us - PyTorch: 22.2 us (from 2.27x gap to 1.22x gap) RmsNorm (128,1,8192): InfiniOps 10.0 us vs PyTorch 11.3 us (1.14x faster).
Cache residual (x1+x2) in shared memory during reduce phase, reuse in transform phase to avoid re-reading x_out from global memory. AddRmsNorm (32,32,4096) fp16 on A100: 42.6 us → 41.3 us (3% improvement). Limited gain because this operator has 4 global memory accesses (read x1, x2; write x_out, y_out) and shared memory only eliminates 1 re-read.
Vectorized uint4 global load + smem cache was slower (30.2 us) than plain smem cache (27.1 us) due to reinterpret_cast overhead and potential bank conflicts. Revert to the simpler shared memory caching approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.