Skip to content

Commit c8a3ff2

Browse files
author
zhangyue
committed
docs(paged_attention): explain why seq_lens_host / block_table_host exist
The rationale (CANN CPU-tensor contract + NPUGraph capturability) was only documented in the Ascend ATB kernel header. Surface it on the base class where the API contract lives, so any future backend implementor understands why the optional host tensors are part of the signature.
1 parent 1f4c15e commit c8a3ff2

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

src/base/paged_attention.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,19 @@ namespace infini::ops {
3232
//
3333
// Output layout:
3434
// output : [batch, num_heads, head_size]
35+
//
36+
// Optional host tensors: `seq_lens_host` and `block_table_host` are CPU
37+
// mirrors of `seq_lens` and `block_table`. They exist because CANN's
38+
// paged-attention APIs mandate CPU-resident metadata — aclnn declares
39+
// `qSeqLens` as a CPU tensor in its signature, and ATB
40+
// `PagedAttentionParam` reads `aclIntArray*` parameters from the
41+
// `hostData` field at `aclnnRunner::Setup()` time. Without caller-
42+
// provided host tensors, the kernel must synchronously D2H-copy both
43+
// each call, which (a) blocks the stream and (b) prevents NPUGraph
44+
// capture (sync copies are not capturable). When the caller already
45+
// has CPU-pinned copies (e.g. vLLM's `optimistic_seq_lens_cpu` and
46+
// `BlockTable.get_cpu_tensor()`), passing them through lets the kernel
47+
// skip both D2H copies and be captured into a full NPUGraph.
3548
class PagedAttention : public Operator<PagedAttention> {
3649
public:
3750
PagedAttention(const Tensor query, const Tensor key_cache,

0 commit comments

Comments
 (0)