Avoid per-step allocations in CUDA-graph decode(fix #175) by MrAnayDongre · Pull Request #176 · GeeeekExplorer/nano-vllm

MrAnayDongre · 2026-02-23T05:03:19Z

Fixes #175
This change removes per-step tensor allocations in the CUDA-graph decode path by:

Allocating persistent pinned CPU staging buffers once during capture_cudagraph().
Filling and padding staging buffers up to the selected bucket size during prepare_decode().
Copying staging buffers into captured CUDA graph inputs before graph.replay().

ygch · 2026-02-25T06:24:26Z

你好，想问一下修改之后的Decode速度有多大的变化呢？

ygch · 2026-02-25T06:27:19Z

Hello, I would like to ask how much the decoding speed has changed after the modification.

Avoid per-step allocations in CUDA-graph decode

1c906a2

Provide feedback