Support for the Ascend backend: add npu runtime environment & JIT codegen by PPPoint-t · Pull Request #160 · InfiniTensor/ninetoothed

PPPoint-t · 2026-06-08T13:04:30Z

Prepare the runtime environment needed by Ascend/NPU execution & Add the JIT codegen path for Ascend/NPU kernels.

Resolve the ninetoothed cache directory from NINETOOTHED_CACHE_DIR
Fall back to writable cache locations when the home cache is unavailable
Make default config calculation tolerate missing CUDA-style device properties
Add NPU device discovery support in test utilities
Generate both CUDA and NPU variants from CodeGenerator
Guard generated source with runtime NPU availability checks
Select _npu kernel and launch symbols in JIT.__call__
Add initial Ascendifier rewrites for CANN libdevice, load fallback, clamp lowering, dtype compatibility, and autotune key filtering

To make reviewing manageable, the entire Ascend NPU backend feature is split into 4 stacked phases. Reviewers can use the links below to view the clean diff of each phase directly within this repository:

Phase 1: Basic NPU Codegen Infrastructure -> This PR (Support for the Ascend backend: add npu runtime environment & JIT codegen #160)
Phase 2: SDPA AST Rewrites -> Preview in my fork: [Support for the Ascend backend: harden ascendifier sdpa AST rewrites PPPoint-t/ninetoothed#1]
Phase 3: Ascend AOT routing -> Preview in my fork: [Support for the Ascend backend: add ascend-aot backend routing PPPoint-t/ninetoothed#2]
Phase 4: AOT size-aware meta tuning -> Preview in my fork: [Support for the Ascend backend: harden ascend-aot size-aware meta tuning PPPoint-t/ninetoothed#3]

Note: These links are just for architectural preview. I will submit official PRs to this repository sequentially as each phase gets merged.

… mask fallback

Rewrite problematic clamp on Ascend, add cache/config fallbacks

…is size-aware tuning

Replace the old name-based broadcast and loop/dot/where handling with a narrow SDPA key-boundary tail mask rewrite, including the stable_qk mask after exp2. Keep Ascend codegen compatibility fixes localized in Ascendifier: autotune key filtering, square block config rewrites, Ascend-safe config pruning, load fallback normalization, clamp lowering, and CANN libdevice routing. Clean up helper naming and rewrite state flow so the AST passes are easier to follow without changing the generated SDPA behavior.

Drop the injected Ascend autotune prune helper and its debug-only meta selection plumbing, while keeping autotune key filtering and square block config rewrites so axis-limit handling still runs. Format all modified Ascend backend files to maintain consistent code style.

…gen path

voltjia · 2026-06-24T01:36:46Z

-        output_dir=output_dir,
-        num_warps=num_warps,
-        num_stages=num_stages,
+    if caller in ("cuda", "ascend"):


这个地方用 {...} 吧。

voltjia · 2026-06-24T01:38:12Z

+        )
+
+    raise ValueError(
+        f"Unsupported caller '{caller}'. Expected one of: 'torch', 'cuda', 'ascend'."


其实没啥毛病，不过目前九齿代码里的 convention 应该是用 ` 而不是 '，所以最好是修改一下。

voltjia · 2026-06-24T02:41:09Z

    :param application: The application of the tensors.
    :param tensors: The tensors.
-    :param caller: Who will call the compute kernel.
+    :param caller: Kernel build route selector.


这个地方其实之前设计的有些问题，因为从目前组织层面来讲的话，cuda 和 ascend 并不平级，而是 nvidia 和 ascend 平级，因为还有很多别的后端，但是这样的话又会引入过多的 callers，所以这块应该是要么改成 python 和 c++，要么应该是一个别的参数，选 jit 或者 aot。算是个历史遗留问题。我的建议是这样，我看了一下，似乎除了这个文件，并没有别的真的用到了 "ascend"，那不如咱们这块先不改，改成加个 TODO，之后我们调研一下，看看九齿主仓库的测试里，以及 ntops 里，已经使用 caller 的地方多不多，以及好不好改，等到后面单独提 PR 再说。换句话说，咱们就把 torch 当 JIT，cuda 当 AOT 吧，hhh 实属无奈之举。

voltjia · 2026-06-24T02:50:54Z

+def _resolve_cache_dir():
+    for cache_dir in (
+        pathlib.Path.home() / ".ninetoothed",
+        pathlib.Path("/tmp/.ninetoothed"),
+    ):
+        try:
+            cache_dir.mkdir(exist_ok=True)
+            probe = cache_dir / ".write_probe"
+            probe.write_text("", encoding="utf-8")
+            probe.unlink()
+            return cache_dir
+        except OSError:
+            continue
+
+    raise OSError("Failed to find a writable cache directory for ninetoothed.")
+
+
+CACHE_DIR = _resolve_cache_dir()


这部分更改是必须的嘛，如果不必要的话先不改。

voltjia · 2026-06-24T02:53:11Z

+        properties = {}
+
+        try:
+            device = triton.runtime.driver.active.get_current_device()
+            properties = triton.runtime.driver.active.utils.get_device_properties(
+                device
+            )
+        except Exception:
+            properties = {}


这部分应该放到 self._min_num_elements = 1 这行上面，跟原来的部分对应，以及目前的 51 行的 properties = {} 应该是冗余的了，如果确定冗余就直接去掉吧。

voltjia · 2026-06-24T07:05:30Z

+        ][:max_axes]
+
+    @classmethod
+    def _rewrite_autotune_keyword(cls, keyword, max_axes):


这个是不是也可以是个 staticmethod？

voltjia · 2026-06-24T07:05:42Z

+            keyword.value.elts = cls._filter_autotune_keys(keyword.value.elts, max_axes)
+
+    @classmethod
+    def _rewrite_autotune_call(cls, node, max_axes):


这个是不是也可以是个 staticmethod？

voltjia · 2026-06-24T07:05:55Z

+        return
+
+    @classmethod
+    def _rewrite_load_call(cls, node):


这个是不是也可以是个 staticmethod？

voltjia · 2026-06-24T07:05:59Z

+                keyword.value.value = 0.0
+
+    @classmethod
+    def _rewrite_clamp_call(cls, node):


这个是不是也可以是个 staticmethod？

voltjia · 2026-06-24T07:06:46Z

+        type(self)._rewrite_load_call(node)
+        return type(self)._rewrite_clamp_call(node)
+
+    def visit_Module(self, node):


这几个 visitor 按照在文档里出现的顺序排列一下吧：https://docs.python.org/3/library/ast.html。

PPPoint-t added 13 commits March 18, 2026 08:34

fix(_generate_autotune): limit autotune key length for Ascend backend

ba54121

feat(ascendifier): add AST passes for CANN libdevice routing and load…

57e4b1b

… mask fallback

fix(clamp): harden codegen and runtime compatibility

c4d38e7

Rewrite problematic clamp on Ascend, add cache/config fallbacks

feat(ascendifier): add SDPA broadcast and hazard rewrite helpers

571adf9

Merge remote-tracking branch 'upstream/master' into dev-ascend

f8395df

refactor(aot): split backend routing for cuda/ascend

04feccc

refactor(ascend-aot): reorganize ascendaotbackend and harden multi-ax…

dbc8854

…is size-aware tuning

Merge remote-tracking branch 'upstream/master' into dev-ascend

da7a36b

feat(ascendifier): prepare NPU runtime environment & add NPU JIT code…

ee4b6f7

…gen path

style: fix ruff import order

c742078

style: fix ruff formatting issues

2cab0c0

voltjia requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for the Ascend backend: add npu runtime environment & JIT codegen#160

Support for the Ascend backend: add npu runtime environment & JIT codegen#160
PPPoint-t wants to merge 13 commits into
InfiniTensor:masterfrom
PPPoint-t:dev-ascend-add-npu-codegen

PPPoint-t commented Jun 8, 2026 •

edited

Loading

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

voltjia Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

PPPoint-t commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PPPoint-t commented Jun 8, 2026 •

edited

Loading