Feat/cross platform dsl by zhangyue207 · Pull Request #50 · InfiniTensor/InfiniOps

zhangyue207 · 2026-04-12T18:46:17Z

No description provided.

…ld integration Add Ascend platform scaffolding: - `device_.h`: `DeviceEnabled<kAscend>` specialization - `data_type_.h`: `toAclDtype()`, `isIntegerDtype()` - `common.h`: `buildAclTensor()` with optional transpose - `workspace_pool_.h`: stream-keyed workspace allocator - `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset) - 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul, ReshapeAndCache, RotaryEmbedding) Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.

…emove missing include - Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to match the generic 4-arg / 3-arg calling convention used by examples. - Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`. - Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file does not exist until the kernels commit).

- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`. - Add `get_npu_stream()` helper and NPU device detection in test utils. - Add `skip_unsupported_dtype` fixture for Ascend in conftest. - Update `runtime_api.h` with Ascend backend entry.

The `aclrtMalloc` call was the sole expression inside `assert()`, so it was compiled away in release builds (NDEBUG). This left the workspace buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR (161001) for any operation that requires workspace (e.g. alpha != 1.0).

`CudaCausalSoftmax` was missing `#include "cuda/runtime_utils.h"`, causing `RuntimeUtils` to be undefined. Drop `std::forward` from `Operator::make` nested lambda — NVCC instantiates the body during SFINAE invocability checks even inside `if constexpr` false branches, causing template resolution failures. All operator constructors take parameters by value, so lvalue pass has identical semantics.

Upgrade base image from `nvcr.io/nvidia/pytorch:24.10-py3` (CUDA 12.6) to `25.12-py3` (CUDA 13.1), aligning CI with the local dev environment. Restore `std::forward<Args>(args)...` in `Operator::make`, as the NVCC bug that required dropping it is fixed in the newer toolkit.

`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on MetaX's clang-based compiler (`-Wc++11-narrowing`).

- Add blank lines between struct/class members per style guide - Capitalize comments and use backtick syntax for code refs in `matmul.h` - Move `import re` to module level in `generate_wrappers.py` - Add blank lines before `for`/`return` per PEP 8 in `generate_wrappers.py` - Replace `-k npu` with `--devices ascend` in CI config

- Fix `ruff format` violations in `generate_wrappers.py` and `test_gemm.py`. - Fix `ruff isort` violation: move `import re` into stdlib group. - Add backticks around identifiers in comments (`numel()`, `operator()`, `make()`, `torch_npu`, `uint16`/`uint32`/`uint64`). - Add missing blank line after `if` block in `skip_unsupported_dtype`. - Remove `.worktrees/` from project `.gitignore` (belongs in global gitignore).

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm, ReshapeAndCache, RotaryEmbedding, FlashAttention.

Pass stream to all CANN ops in existing tests; add FlashAttention, ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.

This reverts commit 26c2bdc.

…/Linear/Mul operators Descriptor caching (`AclTensorCache` + `aclSetRawTensorAddr`), executor caching (`aclSetAclOpExecutorRepeatable`), D2H sync elimination, `add_rms_norm` decomposition, and `WorkspacePool` thread-local fast path. Host dispatch dropped from ~255 us/call to 17-57 us/call for all cacheable operators. New operators: Cast (`aclnnCast`), Cat (`aclnnCat` with TensorList executor caching), Linear (`aclnnAddmm`/`aclnnBaddbmm`/ `aclnnMatmul`), Mul (`aclnnMul`). Full regression: 2040 passed, 0 failed.

Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local cached pointers remain valid across `unordered_map` rehashes. Remove unused `detail::reshapeView` helper from FlashAttention.

…tion Normalize negative `dim` in the base class constructor (e.g. -1 → last dimension). Add comment in the Ascend kernel explaining why `aclSetRawTensorAddr` on TensorList-contained descriptors is sufficient without `aclSetInputTensorAddr`. Add negative-dim test case.

Introduce a Python DSL for declarative operator registration and automated CUDA-like backend wrapper generation. Key components: - `dsl/decorators.py`: `@manual_op` and `@infini_op` decorators - `dsl/compiler/codegen.py`: generates `Operator<Op, kBackend>` wrapper files from shared `Cuda*<Runtime<...>>` templates - `dsl/ops/*.py`: all 14 existing operators registered as `@manual_op` - `dsl/__main__.py`: CLI with `--verify` mode to diff against existing hand-written wrappers Verify mode confirms 14/14 existing wrapper files match generated output byte-for-byte. Also identifies 2 missing wrappers (moore/causal_softmax, moore/rms_norm) that could be auto-generated. `generate_wrappers.py` is preserved — the DSL compiler handles wrapper generation only; binding generation remains in the existing script.

…then-transform Add reusable kernel templates parameterized on `Device::Type` and user-provided functors, enabling cross-platform code sharing across CUDA-like backends and CPU.

…, and C++ codegen Implements the full DSL compiler pipeline: - `dag.py`: compute DAG representation with node kinds and topological sort - `parser.py`: AST parser that translates `@infini_op` function bodies into DAGs - `patterns.py`: pattern matcher mapping DAGs to C++ template bricks - `infini_codegen.py`: C++ code generator emitting CUDA and CPU kernel files - `primitives.py`: DSL type annotations (`Tensor`, `Scalar`) and primitive functions - Example `@infini_op` definitions for `AddDsl` and `RmsNormDsl` - 10 unit tests covering parser, pattern matching, and codegen

…and dispatcher fallback - Add `implementation_index` support using the Gemm (cuBLAS/cuBLASLt) pattern: DSL-generated kernels register as `Operator<Op, kDev, Impl::kDsl>` alongside hand-written `Operator<Op, kDev>` implementations. - Introduce `src/impl.h` with global `Impl::kDefault`/`Impl::kDsl` constants and operator-specific `GemmImpl::kCublas`/`GemmImpl::kCublasLt`. - Add per-operator `registry.h` files declaring `ActiveImplementationsImpl` with named constants for Add, RmsNorm, Mul, Swiglu, and Gemm. - Add dispatcher fallback in `DispatchImplementation`: when the requested `implementation_index` is not in the active list, fall back to the first available implementation instead of aborting. - Add per-operator Python string `implementation` parameter (e.g., `implementation="dsl"`, `implementation="cublaslt"`) via `impl_names.json` generated by the DSL compiler and consumed by `generate_wrappers.py`. - Migrate Mul and Swiglu to `@infini_op` with `impl_index=1`. - Standardize Swiglu base class: rename `gate_*` fields to `other_*` for consistency with `BinaryElementwiseBrick` interface. - All 4272 tests pass (0 failures). Pre-existing CUDA crashes for operators without NVIDIA implementations (Cast, Cat, Linear, Matmul, AddRmsNorm) are unrelated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `UNARY_ELEMENTWISE` brick support to the DSL compiler's CUDA and CPU code generators, enabling single-input operators like Cast to be compiled from `@infini_op` Python definitions into C++ template code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rnels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix CUDA unary kernel to use explicit template args on functor call (`Op{}.template operator()<TIn, TOut>()`) for correct return type deduction. - Fix CPU unary codegen to use `Caster` instead of `static_cast` for fp16/bf16 types that lack implicit conversions. - Create `dsl/ops/cast_dsl.py` registering Cast at `impl_index=1`. - Generate CUDA/CPU/nvidia kernel files and registries for Cast. - Add `tests/test_cast_dsl.py` with 40 test cases (fp32<->fp16, bf16<->fp32, contiguous and non-contiguous tensors). - Add `tests/benchmark_dsl.py` with DSL vs hand-written performance comparison (all within 0.95x-1.01x, well within 80-120% target).

- Cat: custom CUDA concat kernel with multi-input indexing and device metadata management. Supports arbitrary dimension and variable inputs. - Linear: cuBLAS GEMM delegation + optional bias-add kernel. Reuses existing BLAS infrastructure. - Matmul: cuBLASLt primary (impl_index=0) + cuBLAS fallback (impl_index=1). Fixed alpha=1, beta=0. Heuristic algorithm selection for optimal perf. - Fix CPU Linear to work with new GEMM-style base class members. - Add bindings override mechanism for operators with complex signatures (std::vector<Tensor>). Tests: Cat 30, Linear 72, Matmul 80 passed on CUDA.

Implement the fused Add + RmsNorm operator (residual = x1 + x2, y = rms_norm(residual) * weight) for NVIDIA GPUs, following vLLM's design with CUB block reduction for variance computation.

Implement the ReshapeAndCache operator that writes key/value tensors into a paged KV cache using slot mapping, following the vLLM design pattern. Includes base class, CUDA kernel, NVIDIA wrapper, and tests.

# Conflicts: # src/base/add_rms_norm.h # tests/test_add_rms_norm.py

# Conflicts: # src/base/reshape_and_cache.h # tests/test_reshape_and_cache.py

…DA kernels - AddRmsNorm: fused add + rms_norm kernel using CUB block reduction. One block per row, two-pass (add+accumulate, then normalize+scale). - ReshapeAndCache: KV cache write kernel with slot_mapping. Each block handles one token, writing key/value into paged cache layout. - RotaryEmbedding: rotary position embeddings supporting both NeoX (split-half) and GPT-J (interleaved) styles. In-place on query/key. Tests: AddRmsNorm 36, ReshapeAndCache 15, RotaryEmbedding 12 passed on CUDA.

- Remove unused BLOCK_SIZE template param from CatKernel and BiasAddKernel. - Fix BiasAddKernel<T, 0> instantiation (was semantically wrong). - Delete unnecessary nvidia/rotary_embedding/registry.h (single-impl operators use the default List<0>). - Fix duplicate try/except in test_linear.py reference function.

- Add FlashInfer as git submodule at third_party/flashinfer/. - Create CudaFlashAttention wrapping FlashInfer's SinglePrefillWithKVCacheDispatched for single-sequence attention. - Support causal and non-causal masks, head sizes 64/128/256. - Runtime head_dim dispatch to compile-time template parameters. - Add FlashInfer + CUTLASS include paths to CMakeLists.txt. - Tests: 6 CUDA fp16 tests pass (causal/non-causal, MHA/GQA). bf16 has a launch failure on this GPU — FlashInfer compatibility issue, not an InfiniOps bug.

Without explicit CMAKE_CUDA_ARCHITECTURES, CMake may default to a lower architecture (e.g., SM75) even on newer GPUs. This caused FlashInfer's bf16 prefill kernel to fail at runtime on A100 (SM80), since bf16 tensor core operations require SM80+. Now auto-detects the GPU's compute capability via nvidia-smi during CMake configure and sets CMAKE_CUDA_ARCHITECTURES accordingly. Root cause verified: CMAKE_CUDA_ARCHITECTURES was 75, FlashInfer's prefill.cuh explicitly asserts "do not support bf16 on sm75".

- benchmark_all.py: 85 test cases covering all 14 CUDA operators (Add, Mul, Cast, Swiglu, RmsNorm, CausalSoftmax, AddRmsNorm, Cat, Gemm, Matmul, Linear, RotaryEmbedding, ReshapeAndCache, FlashAttention) - Baseline report on A100-SXM4-80GB: Gemm/Matmul at 235-249 TFLOPS (75-80% peak), FlashAttention at 286 TFLOPS (92% peak) - Identified optimization priorities: Gemm cuBLAS→cuBLASLt, Linear BLAS upgrade, CausalSoftmax fused kernel

… report

- Replace Linear's cuBLAS (BlasGemmStridedBatchedEx) with cuBLASLt heuristic algorithm selection. Measured 0.210ms → 0.187ms on (1024,4096,4096) fp16 on A100. - Keep Gemm default as cuBLAS (index 0) for test stability. cuBLASLt available at implementation="cublaslt" (2.9x faster on 1024³, but TF32 precision differs from cuBLAS reference). - Add cuBLASLt recommendation comment in Gemm registry.h.

…n trade-off

…g generation `_get_all_ops` now accepts an optional `output_dir` parameter and searches both `src/` and the output directory for `Operator<>` specializations. This supports the migration of auto-generated wrapper files from `src/<platform>/` to `generated/<platform>/`.

Move all auto-generated wrapper, DSL, and registry files from `src/` to `generated/`. `src/<platform>/` now only contains platform adapter files (device_.h, runtime_.h, etc.) and hand-written multi-impl operators (Gemm, Matmul). - Add `cuda` backend entries to manual_op definitions for operators that have CUDA kernels (Cat, Linear, AddRmsNorm, FlashAttention, ReshapeAndCache, RotaryEmbedding). - Fix registry generation to omit `Impl::kDefault` when no hand-written implementation exists for a device (prevents segfault on dispatch). - Add `generated/` to CMake include paths for both infiniops and ops targets. - Remove registry.h includes from hand-written CPU files. - Update bindings generator to scan `generated/` for operator specializations. New platform onboarding: provide 4 adapter files + CMake flag → build. New operator onboarding: base class + CUDA kernel + DSL registration → build. All wrappers auto-generated. Tests: all 14 operators pass on CUDA (1734 passed, 1 pre-existing Gemm bf16 precision failure).

Extend codegen to support BLAS-style wrappers: when a @manual_op has `"blas": True` in its cuda backend entry, the compiler generates `Operator<Op, kDev> : public BlasOp<Blas<kDev>>` wrappers for all CUDA-like platforms, instead of the standard `Runtime<kDev>` wrapper. - Delete hand-written `src/nvidia/gemm/cublas.h` (now auto-generated). - Remove explicit nvidia/metax/iluvatar/moore entries from Gemm's DSL definition — codegen derives them from the shared cuda entry. - Fix `generate_blas_wrapper` include guard naming and registry include. - Update `examples/runtime_api.h` to use generated path.

Auto-select between prefill and decode based on query sequence length: - seq_len > 1 → SinglePrefillWithKVCacheDispatched (existing) - seq_len == 1 → SingleDecodeWithKVCacheDispatched (new) Decode path uses FlashInfer's optimized decode kernel with NHD layout. Verified: max diff < 0.0001 vs PyTorch SDPA reference on fp16/bf16, MHA and GQA (32/8 heads), KV lengths up to 256.

Extend CudaFlashAttention to handle batch prefill (packed sequences with cu_seqlens) and paged decode (block_table-based KV cache) by looping over sequences and calling FlashInfer's single-sequence kernels. This is functionally correct; a future optimization can switch to FlashInfer's native batch kernels with scheduler workspace.

… kernels Batch prefill now uses `BatchPrefillWithRaggedKVCacheDispatched` with the `PrefillPlan` scheduler (split-KV disabled), and paged decode uses `BatchDecodeWithPagedKVCacheDispatched` with the `DecodePlan` scheduler. This eliminates serial kernel launches and host-device synchronization per sequence, enabling the GPU to process all sequences in a single kernel launch.

…e in FlashAttention Eliminate 11 cudaMalloc/cudaFree calls per FlashAttention invocation (batch prefill: 4 device + 1 pinned; paged decode: 6 device + 1 pinned) by using pre-allocated memory: - Override `workspace_size_in_bytes()` to request 264 MB device workspace (128 MB int + 128 MB float + 8 MB scratch for metadata arrays). - Allocate a fallback `default_workspace_` in the constructor, following the Cambricon pattern, so callers that do not set handle workspace still work correctly. - Allocate pinned host staging buffer once in the constructor instead of per-call cudaMallocHost/cudaFreeHost. - Partition the device workspace via pointer arithmetic with overflow assertions in both LaunchBatchPrefill and LaunchPagedDecode.

…ensors Add BinaryElementwiseVecKernel with 128-bit coalesced load/store and grid-stride loop. When all three tensors are contiguous, the brick dispatches the vectorized path instead of the scalar per-element kernel. Measured on A100 with Add (4096,4096) fp16: - Before (scalar): 570 GB/s (29% HBM bandwidth) - After (vectorized): 1646 GB/s (82% HBM bandwidth) - PyTorch reference: 1650 GB/s The improvement applies to DSL-generated operators (Add, Mul, Swiglu at impl_index=1). Hand-written CudaAdd still uses its own kernel and does not benefit — a follow-up should either vectorize it or switch the default to the DSL implementation.

Replace hand-written per-element kernels with BinaryElementwiseBrick, which automatically dispatches vectorized 128-bit load/store for contiguous tensors. Measured on A100 (4096² fp16): - Add: 0.164ms → 0.077ms (2.1x faster, 1315 GB/s) - Swiglu: ~0.164ms → 0.062ms (~2.6x faster, 1612 GB/s)

…iguous tensors Add UnaryElementwiseVecKernel with grid-stride loop for contiguous path. Improves GPU occupancy and memory access coalescing. Cast fp32→fp16 (4096² on A100): 0.161ms → 0.092ms (1.75x, 1094 GB/s).

Record profiling-driven optimization results on A100: - Round 1: Vectorized binary elementwise brick (Add DSL: 612→1646 GB/s) - Round 2: Refactor CudaAdd/CudaSwiglu to use brick (Add: 2.1x, Swiglu: 2.6x) - Round 3: Grid-stride loop for unary elementwise (Cast: 1.75x) - Round 4: RmsNorm analysis (3.3x slower than PyTorch, deferred) - Round 5: Full post-optimization benchmark Key results: Mul/Swiglu match PyTorch, FlashAttention 12% faster, Matmul 2x faster (cuBLASLt). Remaining gaps in Add (20%), Cast (22%), RmsNorm (3.3x).

Add 128-bit vectorized input load and output store to UnaryElementwiseVecKernel for contiguous tensors. Cast fp32→fp16 (4096² on A100): 0.092ms → 0.078ms (+17%, 1285 GB/s). Still 22% gap vs PyTorch (1645 GB/s) — likely needs output-type-based vectorization strategy.

Cache x values in shared memory during the reduce phase, reuse them in the transform phase. Eliminates the second global memory read. RmsNorm (32,32,4096) fp16 on A100 (CUDA event timing): - Before: ~35 us - After: 27.1 us - PyTorch: 22.2 us (from 2.27x gap to 1.22x gap) RmsNorm (128,1,8192): InfiniOps 10.0 us vs PyTorch 11.3 us (1.14x faster).

Cache residual (x1+x2) in shared memory during reduce phase, reuse in transform phase to avoid re-reading x_out from global memory. AddRmsNorm (32,32,4096) fp16 on A100: 42.6 us → 41.3 us (3% improvement). Limited gain because this operator has 4 global memory accesses (read x1, x2; write x_out, y_out) and shared memory only eliminates 1 re-read.

Vectorized uint4 global load + smem cache was slower (30.2 us) than plain smem cache (27.1 us) due to reinterpret_cast overhead and potential bank conflicts. Revert to the simpler shared memory caching approach.

zhangyue and others added 30 commits April 11, 2026 00:36

style(ascend): apply clang-format to framework headers

e4b7e49

fix(nvidia): restore CUDA::cublasLt link dependency

6b782a2

feat(test): add --devices option to pytest for platform-name filtering

0fc990f

fix: add explicit narrowing casts in RotaryEmbedding initializer list

4c6adba

`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on MetaX's clang-based compiler (`-Wc++11-narrowing`).

feat(ascend): add 9 Ascend operator kernels

537fc6d

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm, ReshapeAndCache, RotaryEmbedding, FlashAttention.

test(ascend): add NPU stream injection and new operator tests

6341457

Pass stream to all CANN ops in existing tests; add FlashAttention, ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.

docs: add Ascend FlashAttention design spec

e1fa963

Revert "docs: add Ascend FlashAttention design spec"

aa4703d

This reverts commit 26c2bdc.

fix(ascend): stabilize WorkspacePool pointers and remove dead code

c85dcc6

Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local cached pointers remain valid across `unordered_map` rehashes. Remove unused `detail::reshapeView` helper from FlashAttention.

feat(dsl): add C++ template bricks for binary elementwise and reduce-…

fa5bb45

…then-transform Add reusable kernel templates parameterized on `Device::Type` and user-provided functors, enabling cross-platform code sharing across CUDA-like backends and CPU.

feat(dsl): add CUDA unary elementwise brick template

067c85a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(dsl): add CPU unary elementwise brick template

293bc4e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test(dsl): add performance benchmark comparing DSL vs hand-written ke…

7cb62bd

…rnels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(dsl): extract binding generation into dsl/compiler/bindings.py

57fde3a

feat(dsl): integrate binding generation into python -m dsl

c6fcc17

zhangyue207 added 29 commits April 11, 2026 16:31

feat(nvidia): add fused AddRmsNorm CUDA kernel

5412680

Implement the fused Add + RmsNorm operator (residual = x1 + x2, y = rms_norm(residual) * weight) for NVIDIA GPUs, following vLLM's design with CUB block reduction for variance computation.

feat(nvidia): add ReshapeAndCache CUDA kernel for KV cache

cf18c08

Implement the ReshapeAndCache operator that writes key/value tensors into a paged KV cache using slot mapping, following the vLLM design pattern. Includes base class, CUDA kernel, NVIDIA wrapper, and tests.

Merge branch 'worktree-agent-ab5f4b23' into feat/cross-platform-dsl

e7e49d9

# Conflicts: # src/base/add_rms_norm.h # tests/test_add_rms_norm.py

Merge branch 'worktree-agent-a09ccf36' into feat/cross-platform-dsl

79df991

# Conflicts: # src/base/reshape_and_cache.h # tests/test_reshape_and_cache.py

fix(docs): correct CausalSoftmax optimization suggestion in benchmark…

9986795

… report

docs(test): annotate Gemm cuBLASLt performance advantage and precisio…

5d3edf4

…n trade-off

docs: update optimization log with 5 rounds of profiling and analysis

666c436

revert: restore smem-cache RmsNorm without vectorized global load

ffab633

Vectorized uint4 global load + smem cache was slower (30.2 us) than plain smem cache (27.1 us) due to reinterpret_cast overhead and potential bank conflicts. Revert to the simpler shared memory caching approach.

zhangyue207 marked this pull request as draft April 12, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/cross platform dsl#50

Feat/cross platform dsl#50
zhangyue207 wants to merge 66 commits into
masterfrom
feat/cross-platform-dsl

zhangyue207 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangyue207 commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant