Skip to content

Support for the Ascend backend: add npu runtime environment & JIT codegen#160

Open
PPPoint-t wants to merge 13 commits into
InfiniTensor:masterfrom
PPPoint-t:dev-ascend-add-npu-codegen
Open

Support for the Ascend backend: add npu runtime environment & JIT codegen#160
PPPoint-t wants to merge 13 commits into
InfiniTensor:masterfrom
PPPoint-t:dev-ascend-add-npu-codegen

Conversation

@PPPoint-t

@PPPoint-t PPPoint-t commented Jun 8, 2026

Copy link
Copy Markdown

Prepare the runtime environment needed by Ascend/NPU execution & Add the JIT codegen path for Ascend/NPU kernels.

  • Resolve the ninetoothed cache directory from NINETOOTHED_CACHE_DIR
  • Fall back to writable cache locations when the home cache is unavailable
  • Make default config calculation tolerate missing CUDA-style device properties
  • Add NPU device discovery support in test utilities
  • Generate both CUDA and NPU variants from CodeGenerator
  • Guard generated source with runtime NPU availability checks
  • Select _npu kernel and launch symbols in JIT.__call__
  • Add initial Ascendifier rewrites for CANN libdevice, load fallback, clamp lowering, dtype compatibility, and autotune key filtering

To make reviewing manageable, the entire Ascend NPU backend feature is split into 4 stacked phases. Reviewers can use the links below to view the clean diff of each phase directly within this repository:

Note: These links are just for architectural preview. I will submit official PRs to this repository sequentially as each phase gets merged.

Rewrite problematic clamp on Ascend, add cache/config fallbacks
Replace the old name-based broadcast and loop/dot/where handling with a narrow SDPA key-boundary tail mask rewrite, including the stable_qk mask after exp2.

Keep Ascend codegen compatibility fixes localized in Ascendifier: autotune key filtering, square block config rewrites, Ascend-safe config pruning, load fallback normalization, clamp lowering, and CANN libdevice routing.

Clean up helper naming and rewrite state flow so the AST passes are easier to follow without changing the generated SDPA behavior.
Drop the injected Ascend autotune prune helper and its debug-only meta selection plumbing, while keeping autotune key filtering and square block config rewrites so axis-limit handling still runs.
Format all modified Ascend backend files to maintain consistent code style.
Comment thread src/ninetoothed/make.py
output_dir=output_dir,
num_warps=num_warps,
num_stages=num_stages,
if caller in ("cuda", "ascend"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方用 {...} 吧。

Comment thread src/ninetoothed/make.py
)

raise ValueError(
f"Unsupported caller '{caller}'. Expected one of: 'torch', 'cuda', 'ascend'."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实没啥毛病,不过目前九齿代码里的 convention 应该是用 ` 而不是 ',所以最好是修改一下。

Comment thread src/ninetoothed/make.py
:param application: The application of the tensors.
:param tensors: The tensors.
:param caller: Who will call the compute kernel.
:param caller: Kernel build route selector.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方其实之前设计的有些问题,因为从目前组织层面来讲的话,cudaascend 并不平级,而是 nvidiaascend 平级,因为还有很多别的后端,但是这样的话又会引入过多的 callers,所以这块应该是要么改成 pythonc++,要么应该是一个别的参数,选 jit 或者 aot。算是个历史遗留问题。我的建议是这样,我看了一下,似乎除了这个文件,并没有别的真的用到了 "ascend",那不如咱们这块先不改,改成加个 TODO,之后我们调研一下,看看九齿主仓库的测试里,以及 ntops 里,已经使用 caller 的地方多不多,以及好不好改,等到后面单独提 PR 再说。换句话说,咱们就把 torch 当 JIT,cuda 当 AOT 吧,hhh 实属无奈之举。

Comment on lines +26 to +43
def _resolve_cache_dir():
for cache_dir in (
pathlib.Path.home() / ".ninetoothed",
pathlib.Path("/tmp/.ninetoothed"),
):
try:
cache_dir.mkdir(exist_ok=True)
probe = cache_dir / ".write_probe"
probe.write_text("", encoding="utf-8")
probe.unlink()
return cache_dir
except OSError:
continue

raise OSError("Failed to find a writable cache directory for ninetoothed.")


CACHE_DIR = _resolve_cache_dir()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分更改是必须的嘛,如果不必要的话先不改。

Comment on lines +51 to +59
properties = {}

try:
device = triton.runtime.driver.active.get_current_device()
properties = triton.runtime.driver.active.utils.get_device_properties(
device
)
except Exception:
properties = {}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分应该放到 self._min_num_elements = 1 这行上面,跟原来的部分对应,以及目前的 51 行的 properties = {} 应该是冗余的了,如果确定冗余就直接去掉吧。

][:max_axes]

@classmethod
def _rewrite_autotune_keyword(cls, keyword, max_axes):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是也可以是个 staticmethod

keyword.value.elts = cls._filter_autotune_keys(keyword.value.elts, max_axes)

@classmethod
def _rewrite_autotune_call(cls, node, max_axes):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是也可以是个 staticmethod

return

@classmethod
def _rewrite_load_call(cls, node):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是也可以是个 staticmethod

keyword.value.value = 0.0

@classmethod
def _rewrite_clamp_call(cls, node):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是也可以是个 staticmethod

type(self)._rewrite_load_call(node)
return type(self)._rewrite_clamp_call(node)

def visit_Module(self, node):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几个 visitor 按照在文档里出现的顺序排列一下吧:https://docs.python.org/3/library/ast.html。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants