Support for the Ascend backend: add npu runtime environment & JIT codegen#160
Open
PPPoint-t wants to merge 13 commits into
Open
Support for the Ascend backend: add npu runtime environment & JIT codegen#160PPPoint-t wants to merge 13 commits into
PPPoint-t wants to merge 13 commits into
Conversation
Rewrite problematic clamp on Ascend, add cache/config fallbacks
…is size-aware tuning
Replace the old name-based broadcast and loop/dot/where handling with a narrow SDPA key-boundary tail mask rewrite, including the stable_qk mask after exp2. Keep Ascend codegen compatibility fixes localized in Ascendifier: autotune key filtering, square block config rewrites, Ascend-safe config pruning, load fallback normalization, clamp lowering, and CANN libdevice routing. Clean up helper naming and rewrite state flow so the AST passes are easier to follow without changing the generated SDPA behavior.
Drop the injected Ascend autotune prune helper and its debug-only meta selection plumbing, while keeping autotune key filtering and square block config rewrites so axis-limit handling still runs. Format all modified Ascend backend files to maintain consistent code style.
voltjia
requested changes
Jun 24, 2026
| output_dir=output_dir, | ||
| num_warps=num_warps, | ||
| num_stages=num_stages, | ||
| if caller in ("cuda", "ascend"): |
| ) | ||
|
|
||
| raise ValueError( | ||
| f"Unsupported caller '{caller}'. Expected one of: 'torch', 'cuda', 'ascend'." |
Collaborator
There was a problem hiding this comment.
其实没啥毛病,不过目前九齿代码里的 convention 应该是用 ` 而不是 ',所以最好是修改一下。
| :param application: The application of the tensors. | ||
| :param tensors: The tensors. | ||
| :param caller: Who will call the compute kernel. | ||
| :param caller: Kernel build route selector. |
Collaborator
There was a problem hiding this comment.
这个地方其实之前设计的有些问题,因为从目前组织层面来讲的话,cuda 和 ascend 并不平级,而是 nvidia 和 ascend 平级,因为还有很多别的后端,但是这样的话又会引入过多的 callers,所以这块应该是要么改成 python 和 c++,要么应该是一个别的参数,选 jit 或者 aot。算是个历史遗留问题。我的建议是这样,我看了一下,似乎除了这个文件,并没有别的真的用到了 "ascend",那不如咱们这块先不改,改成加个 TODO,之后我们调研一下,看看九齿主仓库的测试里,以及 ntops 里,已经使用 caller 的地方多不多,以及好不好改,等到后面单独提 PR 再说。换句话说,咱们就把 torch 当 JIT,cuda 当 AOT 吧,hhh 实属无奈之举。
Comment on lines
+26
to
+43
| def _resolve_cache_dir(): | ||
| for cache_dir in ( | ||
| pathlib.Path.home() / ".ninetoothed", | ||
| pathlib.Path("/tmp/.ninetoothed"), | ||
| ): | ||
| try: | ||
| cache_dir.mkdir(exist_ok=True) | ||
| probe = cache_dir / ".write_probe" | ||
| probe.write_text("", encoding="utf-8") | ||
| probe.unlink() | ||
| return cache_dir | ||
| except OSError: | ||
| continue | ||
|
|
||
| raise OSError("Failed to find a writable cache directory for ninetoothed.") | ||
|
|
||
|
|
||
| CACHE_DIR = _resolve_cache_dir() |
Comment on lines
+51
to
+59
| properties = {} | ||
|
|
||
| try: | ||
| device = triton.runtime.driver.active.get_current_device() | ||
| properties = triton.runtime.driver.active.utils.get_device_properties( | ||
| device | ||
| ) | ||
| except Exception: | ||
| properties = {} |
Collaborator
There was a problem hiding this comment.
这部分应该放到 self._min_num_elements = 1 这行上面,跟原来的部分对应,以及目前的 51 行的 properties = {} 应该是冗余的了,如果确定冗余就直接去掉吧。
| ][:max_axes] | ||
|
|
||
| @classmethod | ||
| def _rewrite_autotune_keyword(cls, keyword, max_axes): |
| keyword.value.elts = cls._filter_autotune_keys(keyword.value.elts, max_axes) | ||
|
|
||
| @classmethod | ||
| def _rewrite_autotune_call(cls, node, max_axes): |
| return | ||
|
|
||
| @classmethod | ||
| def _rewrite_load_call(cls, node): |
| keyword.value.value = 0.0 | ||
|
|
||
| @classmethod | ||
| def _rewrite_clamp_call(cls, node): |
| type(self)._rewrite_load_call(node) | ||
| return type(self)._rewrite_clamp_call(node) | ||
|
|
||
| def visit_Module(self, node): |
Collaborator
There was a problem hiding this comment.
这几个 visitor 按照在文档里出现的顺序排列一下吧:https://docs.python.org/3/library/ast.html。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prepare the runtime environment needed by Ascend/NPU execution & Add the JIT codegen path for Ascend/NPU kernels.
NINETOOTHED_CACHE_DIRCodeGenerator_npukernel and launch symbols inJIT.__call__To make reviewing manageable, the entire Ascend NPU backend feature is split into 4 stacked phases. Reviewers can use the links below to view the clean diff of each phase directly within this repository:
Note: These links are just for architectural preview. I will submit official PRs to this repository sequentially as each phase gets merged.