Skip to content

feat(ascend): add 9 Ascend operator kernels#47

Open
zhangyue207 wants to merge 25 commits intomasterfrom
feat/ascend-operators
Open

feat(ascend): add 9 Ascend operator kernels#47
zhangyue207 wants to merge 25 commits intomasterfrom
feat/ascend-operators

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm,ReshapeAndCache, RotaryEmbedding, FlashAttention.

@zhangyue207 zhangyue207 force-pushed the feat/ascend-framework branch 2 times, most recently from bf9e4b1 to 7398f9f Compare April 13, 2026 13:41
Base automatically changed from feat/ascend-framework to master April 14, 2026 03:55
zhangyue added 11 commits April 15, 2026 13:34
- Add AclTensorCache for descriptor reuse across operator calls
- Rename ToAclDtype/IsIntegerDtype to toAclDtype/isIntegerDtype (camelCase)
- Extend WorkspacePool with multi-slot support and capture-mode assertion
- Optimize Gemm kernel with executor/scalar caching
- Add CacheKey hash support for operator instance caching
- Fix generate_wrappers.py argument ordering and format
- Rename skip_unsupported_dtypes fixture, add get_npu_stream utility
Add base classes: Cast, Cat, Linear, Matmul (replaces MatMul), Mul,
PagedAttention, SiluAndMul.

Rename AddRmsNorm params to match CANN convention (x1/x2/gamma/y_out/x_out).
Remove verbose doc comments from FlashAttention, ReshapeAndCache,
RotaryEmbedding base classes (implementation details belong in kernels).
Add ACLNN-based implementations for: Add, Cast, Cat, CausalSoftmax,
FlashAttention, Linear, Matmul, Mul, RmsNorm, RotaryEmbedding,
ReshapeAndCache (+ v2), Swiglu, SiluAndMul.

All kernels use AclTensorCache for descriptor reuse and
WorkspacePool for device memory management. Executor instances
are cached with aclSetAclOpExecutorRepeatable for repeat dispatch.
Add alternative implementations with registries:
- AddRmsNorm: decomposed (0), fused aclnnAddRmsNorm (1), custom AscendC (2)
- RmsNorm: ACLNN (0), custom AscendC (1)
- RotaryEmbedding: ACLNN (0), ATB Rope (1)
- ReshapeAndCache: ACLNN (0), ScatterPaKvCache (1), ATB (2)
- Swiglu: decomposed (0), fused aclnnSwiGlu (1)
- SiluAndMul: fused aclnnSwiGlu (0), registry (1)
- PagedAttention: ATB (0)
Standalone AscendC kernel project with CMake build system.
Includes op_host tiling, op_kernel device code, precision tests,
and msprof benchmarks for both operators.
Add new tests: Cast, Cat, E2E Layer, FlashAttention, Linear, Matmul,
Mul, PagedAttention, ReshapeAndCache, RotaryEmbedding, SiluAndMul.
Update existing tests with NPU stream handling and Ascend-specific
parametrization.
- C1: auto-format all C++ files with clang-format (25 files)
- C4: lowercase assert messages, remove trailing periods (10 messages)
- G4: backtick-fence identifiers in comments (causal_softmax)
- P5: add blank lines before return statements (generate_wrappers.py)
- C4: lowercase assert message starts (workspace_pool_, rms_norm, rotary_embedding)
- C4: remove trailing period from workspace_pool_ assert
- C9: add blank line between SlotKey struct members
- G4: backtick-fence identifiers in comments across 12 files
- G4: backtick-fence identifiers in assert messages (flash_attention, rotary_embedding)
- P1: remove duplicate `import re` in generate_wrappers.py
- P4: add blank lines around control flow in test_flash_attention.py
- C4: lowercase "rope" in ATB assert messages
- G4: backtick-fence `VariantPack`, `rotaryCoeff`, `sparseMode`, `hostData`
- G4: backtick-fence identifiers in Python test comments
- P4: add blank line before `if` in test_rms_norm_precision.py
@zhangyue207 zhangyue207 force-pushed the feat/ascend-operators branch from 3f43d57 to be48553 Compare April 15, 2026 07:06
zhangyue added 14 commits April 15, 2026 15:08
… loading

- Delete `test_rms_norm_precision.py` (duplicate of `tests/test_rms_norm.py`)
- Delete `run_rms_norm_precision_report.py` (another copy with hardcoded path)
- Unify `test_add_rms_norm.py` to use `import ascend_kernel` instead of
  ctypes manual loading
New operators and features:
- ApplyRotaryPosEmb: pre-gathered cos/sin operator with ATB backend
- TopkToppSampling: ATB-based fused sampling operator
- SiluAndMul: standalone operator backed by aclnnSwiGlu
- ATB PagedAttention: graph-safe decode attention

Enhancements:
- WorkspacePool: multi-slot support and capture-mode assertion
- Migrate temp buffers to WorkspacePool slots (Swiglu, CausalSoftmax,
  RmsNorm, AddRmsNorm)
- RotaryEmbedding: accept 2D [T, N*D] input, fix ATB cos/sin gathering
- ReshapeAndCache: handle int64 slot_mapping in ATB kernel
- Swiglu: add fused aclnnSwiGlu implementation (index=1)
- Parametrize rms_norm and reshape_and_cache tests by implementation_index
… data

The operator cache keys ignore data pointers (compare only shape/dtype/
device/strides).  When RotaryEmbedding was cached from one test and
reused by another with a different cos_sin_cache tensor (same shape,
different random data), the IndexSelect gathered from the old tables,
producing garbage output.

Track the cos_sin_cache data pointer and re-upload the expanded cos/sin
tables when it changes.  In production this is a single pointer
comparison per call (no-op); the cos_sin_cache weight tensor has a
stable address.

Fixes 6 rotary_embedding_2d test failures (head_size=64, fp16, both
CANN and ATB paths) that only reproduced when test_apply_rotary_pos_emb
ran first.
Replace per-operator stale-cache workaround with Operator::clear_cache()
generation counter.  pytest autouse fixture clears caches between test
modules.  Skip aclnnScatterPaKvCache (impl_index=1) on 910B hardware.

Synced from feat/ascend-operators commits c68633f, 57f96bf.
ATB Rope with rotaryCoeff=2 supports bf16 on 910B. Remove the
fp16-only skip guard — all 6 previously skipped bf16 test cases pass.
Extend PagedAttention base class and ATB kernel with optional
seq_lens_host / block_table_host params that skip aclrtMemcpy
D2H copies when caller provides CPU-pinned host tensors.

Add unit tests for host-tensor PA and FA paged decode with CPU
cu_seqlens_kv.
`aclDestroyAclOpExecutor` internally frees `aclTensor` descriptors it holds.
Add `AclTensorCache::release()` and `destroy()` methods, guard all destructors
with `isAclRuntimeAlive()`, and remove redundant `aclDestroyTensor` calls for
executor-owned tensors. Verified: CANN reference-counts tensors, so
destroy-tensor-then-destroy-executor order is safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant