Skip to content

Avoid per-step allocations in CUDA-graph decode(fix #175)#176

Open
MrAnayDongre wants to merge 1 commit into
GeeeekExplorer:mainfrom
MrAnayDongre:fix-175-cudagraph-decode-alloc
Open

Avoid per-step allocations in CUDA-graph decode(fix #175)#176
MrAnayDongre wants to merge 1 commit into
GeeeekExplorer:mainfrom
MrAnayDongre:fix-175-cudagraph-decode-alloc

Conversation

@MrAnayDongre
Copy link
Copy Markdown

Fixes #175
This change removes per-step tensor allocations in the CUDA-graph decode path by:

  • Allocating persistent pinned CPU staging buffers once during capture_cudagraph().
  • Filling and padding staging buffers up to the selected bucket size during prepare_decode().
  • Copying staging buffers into captured CUDA graph inputs before graph.replay().

@ygch
Copy link
Copy Markdown

ygch commented Feb 25, 2026

你好,想问一下修改之后的Decode速度有多大的变化呢?

@ygch
Copy link
Copy Markdown

ygch commented Feb 25, 2026

Hello, I would like to ask how much the decoding speed has changed after the modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Perf: CUDA graph decode still allocates tensors per step

2 participants