Skip to content

[Others]【Hackathon 10th Spring No.49】【RFC】并行 N-gram 匹配 GPU Kernel 设计文档 (V2.0)#1294

Closed
cloudforge1 wants to merge 1 commit into
PaddlePaddle:masterfrom
CloudForge-Solutions:task/049-rfc-parallel-ngram-kernel
Closed

[Others]【Hackathon 10th Spring No.49】【RFC】并行 N-gram 匹配 GPU Kernel 设计文档 (V2.0)#1294
cloudforge1 wants to merge 1 commit into
PaddlePaddle:masterfrom
CloudForge-Solutions:task/049-rfc-parallel-ngram-kernel

Conversation

@cloudforge1
Copy link
Copy Markdown
Contributor

Motivation

RFC V2.0 design document for Hackathon 10th Spring Task No.49: 为 FastDeploy 支持投机解码功能 — 并行 N-gram 匹配 GPU Kernel。

This RFC supersedes the design section (§三) of the prior RFC #1213 (NKNaN, 2026-02-07) by providing a complete parallel architecture specification that was absent from the original. The prior RFC's §三 totaled 12 lines with no thread model, no parallelism strategy, and no extreme-config performance analysis.

What this RFC adds over #1213:

Gap in #1213 This RFC
No parallel strategy Two-phase kernel: Phase 1 <<<bsz, 1024>>> + Phase 2 <<<1, 1024>>>
No thread model 1024 threads/block parallel sliding-window search + atomicMin64 CAS
No threshold GPU-ization CUB BlockScan prefix-sum, O(bsz) parallel
No extreme config analysis Tested bsz=256, seq=128K: GPU 162µs vs CPU 275ms (1,700×)
No template specialization parallel_ngram_search_specialized<1/2/3> with register caching
No early-exit mechanism A2 early-exit: skip positions beyond current best match

Implementation PRs (already submitted, CI-verified on SM90 H20/H100):

  • PR #6960 — Standard two-phase parallel kernel
  • PR #7136 — Template-specialized optimization variant

Context: @freeliuzc clarified that the kernel must use parallel strategies for acceleration at bsz=256, seq_len=128K — requirements not present in the original task description or #1213. This RFC formally documents the parallel architecture that meets those requirements.

Modifications

  • Added RFC design document: rfcs/FastDeploy/20260404_parallel_ngram_match_gpu_kernel_for_fastdeploy.md

Usage or Command

N/A — documentation only.

Accuracy Tests

N/A — design document. Implementation accuracy verified in PRs #6960 and #7136 (11/11 tests passed, GPU output bit-identical to CPU reference).

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. N/A — RFC only.
  • Provide accuracy results. N/A — RFC only.
  • If submitting to release branch, cherry-pick from develop. N/A — targeting master.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 4, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@cloudforge1
Copy link
Copy Markdown
Contributor Author

Closing: will resubmit as inline edit to the existing merged RFC file per Baidu convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant