Skip to content

Feat/cross platform dsl#50

Draft
zhangyue207 wants to merge 66 commits intomasterfrom
feat/cross-platform-dsl
Draft

Feat/cross platform dsl#50
zhangyue207 wants to merge 66 commits intomasterfrom
feat/cross-platform-dsl

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

No description provided.

zhangyue and others added 30 commits April 11, 2026 00:36
…ld integration

Add Ascend platform scaffolding:
- `device_.h`: `DeviceEnabled<kAscend>` specialization
- `data_type_.h`: `toAclDtype()`, `isIntegerDtype()`
- `common.h`: `buildAclTensor()` with optional transpose
- `workspace_pool_.h`: stream-keyed workspace allocator
- `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset)
- 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul,
  ReshapeAndCache, RotaryEmbedding)

Integrate into CMake build system, Python binding generation (stream +
optional tensor support), and examples runtime API.
…emove missing include

- Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to
  match the generic 4-arg / 3-arg calling convention used by examples.
- Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`.
- Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file
  does not exist until the kernels commit).
- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`.
- Add `get_npu_stream()` helper and NPU device detection in test utils.
- Add `skip_unsupported_dtype` fixture for Ascend in conftest.
- Update `runtime_api.h` with Ascend backend entry.
The `aclrtMalloc` call was the sole expression inside `assert()`, so it
was compiled away in release builds (NDEBUG). This left the workspace
buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR
(161001) for any operation that requires workspace (e.g. alpha != 1.0).
`CudaCausalSoftmax` was missing `#include "cuda/runtime_utils.h"`,
causing `RuntimeUtils` to be undefined. Drop `std::forward` from
`Operator::make` nested lambda — NVCC instantiates the body during
SFINAE invocability checks even inside `if constexpr` false branches,
causing template resolution failures. All operator constructors take
parameters by value, so lvalue pass has identical semantics.
Upgrade base image from `nvcr.io/nvidia/pytorch:24.10-py3` (CUDA 12.6)
to `25.12-py3` (CUDA 13.1), aligning CI with the local dev environment.
Restore `std::forward<Args>(args)...` in `Operator::make`, as the NVCC
bug that required dropping it is fixed in the newer toolkit.
`Tensor::Size` (`unsigned long`) to `int64_t` narrowing is an error on
MetaX's clang-based compiler (`-Wc++11-narrowing`).
- Add blank lines between struct/class members per style guide
- Capitalize comments and use backtick syntax for code refs in `matmul.h`
- Move `import re` to module level in `generate_wrappers.py`
- Add blank lines before `for`/`return` per PEP 8 in `generate_wrappers.py`
- Replace `-k npu` with `--devices ascend` in CI config
- Fix `ruff format` violations in `generate_wrappers.py` and `test_gemm.py`.
- Fix `ruff isort` violation: move `import re` into stdlib group.
- Add backticks around identifiers in comments (`numel()`, `operator()`,
  `make()`, `torch_npu`, `uint16`/`uint32`/`uint64`).
- Add missing blank line after `if` block in `skip_unsupported_dtype`.
- Remove `.worktrees/` from project `.gitignore` (belongs in global gitignore).
Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm,
ReshapeAndCache, RotaryEmbedding, FlashAttention.
Pass stream to all CANN ops in existing tests; add FlashAttention,
ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.
…/Linear/Mul operators

Descriptor caching (`AclTensorCache` + `aclSetRawTensorAddr`), executor caching
(`aclSetAclOpExecutorRepeatable`), D2H sync elimination, `add_rms_norm` decomposition,
and `WorkspacePool` thread-local fast path. Host dispatch dropped from ~255 us/call to
17-57 us/call for all cacheable operators. New operators: Cast (`aclnnCast`), Cat
(`aclnnCat` with TensorList executor caching), Linear (`aclnnAddmm`/`aclnnBaddbmm`/
`aclnnMatmul`), Mul (`aclnnMul`). Full regression: 2040 passed, 0 failed.
Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local
cached pointers remain valid across `unordered_map` rehashes.  Remove
unused `detail::reshapeView` helper from FlashAttention.
…tion

Normalize negative `dim` in the base class constructor (e.g. -1 → last
dimension).  Add comment in the Ascend kernel explaining why
`aclSetRawTensorAddr` on TensorList-contained descriptors is sufficient
without `aclSetInputTensorAddr`.  Add negative-dim test case.
Introduce a Python DSL for declarative operator registration and
automated CUDA-like backend wrapper generation.

Key components:
- `dsl/decorators.py`: `@manual_op` and `@infini_op` decorators
- `dsl/compiler/codegen.py`: generates `Operator<Op, kBackend>` wrapper
  files from shared `Cuda*<Runtime<...>>` templates
- `dsl/ops/*.py`: all 14 existing operators registered as `@manual_op`
- `dsl/__main__.py`: CLI with `--verify` mode to diff against existing
  hand-written wrappers

Verify mode confirms 14/14 existing wrapper files match generated output
byte-for-byte. Also identifies 2 missing wrappers (moore/causal_softmax,
moore/rms_norm) that could be auto-generated.

`generate_wrappers.py` is preserved — the DSL compiler handles wrapper
generation only; binding generation remains in the existing script.
…then-transform

Add reusable kernel templates parameterized on `Device::Type` and user-provided
functors, enabling cross-platform code sharing across CUDA-like backends and CPU.
…, and C++ codegen

Implements the full DSL compiler pipeline:
- `dag.py`: compute DAG representation with node kinds and topological sort
- `parser.py`: AST parser that translates `@infini_op` function bodies into DAGs
- `patterns.py`: pattern matcher mapping DAGs to C++ template bricks
- `infini_codegen.py`: C++ code generator emitting CUDA and CPU kernel files
- `primitives.py`: DSL type annotations (`Tensor`, `Scalar`) and primitive functions
- Example `@infini_op` definitions for `AddDsl` and `RmsNormDsl`
- 10 unit tests covering parser, pattern matching, and codegen
…and dispatcher fallback

- Add `implementation_index` support using the Gemm (cuBLAS/cuBLASLt)
  pattern: DSL-generated kernels register as `Operator<Op, kDev, Impl::kDsl>`
  alongside hand-written `Operator<Op, kDev>` implementations.

- Introduce `src/impl.h` with global `Impl::kDefault`/`Impl::kDsl` constants
  and operator-specific `GemmImpl::kCublas`/`GemmImpl::kCublasLt`.

- Add per-operator `registry.h` files declaring `ActiveImplementationsImpl`
  with named constants for Add, RmsNorm, Mul, Swiglu, and Gemm.

- Add dispatcher fallback in `DispatchImplementation`: when the requested
  `implementation_index` is not in the active list, fall back to the first
  available implementation instead of aborting.

- Add per-operator Python string `implementation` parameter (e.g.,
  `implementation="dsl"`, `implementation="cublaslt"`) via `impl_names.json`
  generated by the DSL compiler and consumed by `generate_wrappers.py`.

- Migrate Mul and Swiglu to `@infini_op` with `impl_index=1`.

- Standardize Swiglu base class: rename `gate_*` fields to `other_*` for
  consistency with `BinaryElementwiseBrick` interface.

- All 4272 tests pass (0 failures). Pre-existing CUDA crashes for operators
  without NVIDIA implementations (Cast, Cat, Linear, Matmul, AddRmsNorm)
  are unrelated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `UNARY_ELEMENTWISE` brick support to the DSL compiler's CUDA and
CPU code generators, enabling single-input operators like Cast to be
compiled from `@infini_op` Python definitions into C++ template code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rnels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix CUDA unary kernel to use explicit template args on functor call
  (`Op{}.template operator()<TIn, TOut>()`) for correct return type
  deduction.
- Fix CPU unary codegen to use `Caster` instead of `static_cast` for
  fp16/bf16 types that lack implicit conversions.
- Create `dsl/ops/cast_dsl.py` registering Cast at `impl_index=1`.
- Generate CUDA/CPU/nvidia kernel files and registries for Cast.
- Add `tests/test_cast_dsl.py` with 40 test cases (fp32<->fp16,
  bf16<->fp32, contiguous and non-contiguous tensors).
- Add `tests/benchmark_dsl.py` with DSL vs hand-written performance
  comparison (all within 0.95x-1.01x, well within 80-120% target).
- Cat: custom CUDA concat kernel with multi-input indexing and device
  metadata management. Supports arbitrary dimension and variable inputs.
- Linear: cuBLAS GEMM delegation + optional bias-add kernel. Reuses
  existing BLAS infrastructure.
- Matmul: cuBLASLt primary (impl_index=0) + cuBLAS fallback (impl_index=1).
  Fixed alpha=1, beta=0. Heuristic algorithm selection for optimal perf.
- Fix CPU Linear to work with new GEMM-style base class members.
- Add bindings override mechanism for operators with complex signatures
  (std::vector<Tensor>).

Tests: Cat 30, Linear 72, Matmul 80 passed on CUDA.
Implement the fused Add + RmsNorm operator (residual = x1 + x2,
y = rms_norm(residual) * weight) for NVIDIA GPUs, following vLLM's
design with CUB block reduction for variance computation.
Implement the ReshapeAndCache operator that writes key/value tensors
into a paged KV cache using slot mapping, following the vLLM design
pattern. Includes base class, CUDA kernel, NVIDIA wrapper, and tests.
# Conflicts:
#	src/base/add_rms_norm.h
#	tests/test_add_rms_norm.py
# Conflicts:
#	src/base/reshape_and_cache.h
#	tests/test_reshape_and_cache.py
…DA kernels

- AddRmsNorm: fused add + rms_norm kernel using CUB block reduction.
  One block per row, two-pass (add+accumulate, then normalize+scale).
- ReshapeAndCache: KV cache write kernel with slot_mapping. Each block
  handles one token, writing key/value into paged cache layout.
- RotaryEmbedding: rotary position embeddings supporting both NeoX
  (split-half) and GPT-J (interleaved) styles. In-place on query/key.

Tests: AddRmsNorm 36, ReshapeAndCache 15, RotaryEmbedding 12 passed on CUDA.
- Remove unused BLOCK_SIZE template param from CatKernel and BiasAddKernel.
- Fix BiasAddKernel<T, 0> instantiation (was semantically wrong).
- Delete unnecessary nvidia/rotary_embedding/registry.h (single-impl
  operators use the default List<0>).
- Fix duplicate try/except in test_linear.py reference function.
- Add FlashInfer as git submodule at third_party/flashinfer/.
- Create CudaFlashAttention wrapping FlashInfer's
  SinglePrefillWithKVCacheDispatched for single-sequence attention.
- Support causal and non-causal masks, head sizes 64/128/256.
- Runtime head_dim dispatch to compile-time template parameters.
- Add FlashInfer + CUTLASS include paths to CMakeLists.txt.
- Tests: 6 CUDA fp16 tests pass (causal/non-causal, MHA/GQA).
  bf16 has a launch failure on this GPU — FlashInfer compatibility
  issue, not an InfiniOps bug.
Without explicit CMAKE_CUDA_ARCHITECTURES, CMake may default to a lower
architecture (e.g., SM75) even on newer GPUs.  This caused FlashInfer's
bf16 prefill kernel to fail at runtime on A100 (SM80), since bf16 tensor
core operations require SM80+.

Now auto-detects the GPU's compute capability via nvidia-smi during CMake
configure and sets CMAKE_CUDA_ARCHITECTURES accordingly.

Root cause verified: CMAKE_CUDA_ARCHITECTURES was 75, FlashInfer's
prefill.cuh explicitly asserts "do not support bf16 on sm75".
- benchmark_all.py: 85 test cases covering all 14 CUDA operators
  (Add, Mul, Cast, Swiglu, RmsNorm, CausalSoftmax, AddRmsNorm, Cat,
  Gemm, Matmul, Linear, RotaryEmbedding, ReshapeAndCache, FlashAttention)
- Baseline report on A100-SXM4-80GB: Gemm/Matmul at 235-249 TFLOPS
  (75-80% peak), FlashAttention at 286 TFLOPS (92% peak)
- Identified optimization priorities: Gemm cuBLAS→cuBLASLt, Linear
  BLAS upgrade, CausalSoftmax fused kernel
- Replace Linear's cuBLAS (BlasGemmStridedBatchedEx) with cuBLASLt
  heuristic algorithm selection. Measured 0.210ms → 0.187ms on
  (1024,4096,4096) fp16 on A100.
- Keep Gemm default as cuBLAS (index 0) for test stability.
  cuBLASLt available at implementation="cublaslt" (2.9x faster on
  1024³, but TF32 precision differs from cuBLAS reference).
- Add cuBLASLt recommendation comment in Gemm registry.h.
…g generation

`_get_all_ops` now accepts an optional `output_dir` parameter and searches
both `src/` and the output directory for `Operator<>` specializations. This
supports the migration of auto-generated wrapper files from `src/<platform>/`
to `generated/<platform>/`.
Move all auto-generated wrapper, DSL, and registry files from `src/` to
`generated/`.  `src/<platform>/` now only contains platform adapter files
(device_.h, runtime_.h, etc.) and hand-written multi-impl operators
(Gemm, Matmul).

- Add `cuda` backend entries to manual_op definitions for operators
  that have CUDA kernels (Cat, Linear, AddRmsNorm, FlashAttention,
  ReshapeAndCache, RotaryEmbedding).
- Fix registry generation to omit `Impl::kDefault` when no hand-written
  implementation exists for a device (prevents segfault on dispatch).
- Add `generated/` to CMake include paths for both infiniops and ops
  targets.
- Remove registry.h includes from hand-written CPU files.
- Update bindings generator to scan `generated/` for operator
  specializations.

New platform onboarding: provide 4 adapter files + CMake flag → build.
New operator onboarding: base class + CUDA kernel + DSL registration → build.
All wrappers auto-generated.

Tests: all 14 operators pass on CUDA (1734 passed, 1 pre-existing Gemm
bf16 precision failure).
Extend codegen to support BLAS-style wrappers: when a @manual_op has
`"blas": True` in its cuda backend entry, the compiler generates
`Operator<Op, kDev> : public BlasOp<Blas<kDev>>` wrappers for all
CUDA-like platforms, instead of the standard `Runtime<kDev>` wrapper.

- Delete hand-written `src/nvidia/gemm/cublas.h` (now auto-generated).
- Remove explicit nvidia/metax/iluvatar/moore entries from Gemm's
  DSL definition — codegen derives them from the shared cuda entry.
- Fix `generate_blas_wrapper` include guard naming and registry include.
- Update `examples/runtime_api.h` to use generated path.
Auto-select between prefill and decode based on query sequence length:
- seq_len > 1 → SinglePrefillWithKVCacheDispatched (existing)
- seq_len == 1 → SingleDecodeWithKVCacheDispatched (new)

Decode path uses FlashInfer's optimized decode kernel with NHD layout.
Verified: max diff < 0.0001 vs PyTorch SDPA reference on fp16/bf16,
MHA and GQA (32/8 heads), KV lengths up to 256.
Extend CudaFlashAttention to handle batch prefill (packed sequences
with cu_seqlens) and paged decode (block_table-based KV cache) by
looping over sequences and calling FlashInfer's single-sequence
kernels. This is functionally correct; a future optimization can
switch to FlashInfer's native batch kernels with scheduler workspace.
… kernels

Batch prefill now uses `BatchPrefillWithRaggedKVCacheDispatched` with the
`PrefillPlan` scheduler (split-KV disabled), and paged decode uses
`BatchDecodeWithPagedKVCacheDispatched` with the `DecodePlan` scheduler.
This eliminates serial kernel launches and host-device synchronization
per sequence, enabling the GPU to process all sequences in a single
kernel launch.
…e in FlashAttention

Eliminate 11 cudaMalloc/cudaFree calls per FlashAttention invocation
(batch prefill: 4 device + 1 pinned; paged decode: 6 device + 1 pinned)
by using pre-allocated memory:

- Override `workspace_size_in_bytes()` to request 264 MB device workspace
  (128 MB int + 128 MB float + 8 MB scratch for metadata arrays).
- Allocate a fallback `default_workspace_` in the constructor, following
  the Cambricon pattern, so callers that do not set handle workspace
  still work correctly.
- Allocate pinned host staging buffer once in the constructor instead of
  per-call cudaMallocHost/cudaFreeHost.
- Partition the device workspace via pointer arithmetic with overflow
  assertions in both LaunchBatchPrefill and LaunchPagedDecode.
…ensors

Add BinaryElementwiseVecKernel with 128-bit coalesced load/store and
grid-stride loop.  When all three tensors are contiguous, the brick
dispatches the vectorized path instead of the scalar per-element kernel.

Measured on A100 with Add (4096,4096) fp16:
- Before (scalar):     570 GB/s (29% HBM bandwidth)
- After (vectorized): 1646 GB/s (82% HBM bandwidth)
- PyTorch reference:  1650 GB/s

The improvement applies to DSL-generated operators (Add, Mul, Swiglu at
impl_index=1).  Hand-written CudaAdd still uses its own kernel and does
not benefit — a follow-up should either vectorize it or switch the
default to the DSL implementation.
Replace hand-written per-element kernels with BinaryElementwiseBrick,
which automatically dispatches vectorized 128-bit load/store for
contiguous tensors.

Measured on A100 (4096² fp16):
- Add:    0.164ms → 0.077ms (2.1x faster, 1315 GB/s)
- Swiglu: ~0.164ms → 0.062ms (~2.6x faster, 1612 GB/s)
…iguous tensors

Add UnaryElementwiseVecKernel with grid-stride loop for contiguous path.
Improves GPU occupancy and memory access coalescing.

Cast fp32→fp16 (4096² on A100): 0.161ms → 0.092ms (1.75x, 1094 GB/s).
Record profiling-driven optimization results on A100:
- Round 1: Vectorized binary elementwise brick (Add DSL: 612→1646 GB/s)
- Round 2: Refactor CudaAdd/CudaSwiglu to use brick (Add: 2.1x, Swiglu: 2.6x)
- Round 3: Grid-stride loop for unary elementwise (Cast: 1.75x)
- Round 4: RmsNorm analysis (3.3x slower than PyTorch, deferred)
- Round 5: Full post-optimization benchmark

Key results: Mul/Swiglu match PyTorch, FlashAttention 12% faster,
Matmul 2x faster (cuBLASLt). Remaining gaps in Add (20%), Cast (22%),
RmsNorm (3.3x).
Add 128-bit vectorized input load and output store to
UnaryElementwiseVecKernel for contiguous tensors.

Cast fp32→fp16 (4096² on A100): 0.092ms → 0.078ms (+17%, 1285 GB/s).
Still 22% gap vs PyTorch (1645 GB/s) — likely needs output-type-based
vectorization strategy.
Cache x values in shared memory during the reduce phase, reuse them
in the transform phase.  Eliminates the second global memory read.

RmsNorm (32,32,4096) fp16 on A100 (CUDA event timing):
- Before: ~35 us
- After:  27.1 us
- PyTorch: 22.2 us (from 2.27x gap to 1.22x gap)

RmsNorm (128,1,8192): InfiniOps 10.0 us vs PyTorch 11.3 us (1.14x faster).
Cache residual (x1+x2) in shared memory during reduce phase, reuse in
transform phase to avoid re-reading x_out from global memory.

AddRmsNorm (32,32,4096) fp16 on A100: 42.6 us → 41.3 us (3% improvement).
Limited gain because this operator has 4 global memory accesses (read x1,
x2; write x_out, y_out) and shared memory only eliminates 1 re-read.
Vectorized uint4 global load + smem cache was slower (30.2 us) than
plain smem cache (27.1 us) due to reinterpret_cast overhead and
potential bank conflicts.  Revert to the simpler shared memory
caching approach.
@zhangyue207 zhangyue207 marked this pull request as draft April 12, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant