perf(inference): streaming TRT optimizations and optional bucketed Flow engines by BeckYang26 · Pull Request #1894 · FunAudioLLM/CosyVoice

BeckYang26 · 2026-05-22T06:29:33Z

Summary

This PR improves CosyVoice2/3 streaming TTS + vLLM + TensorRT Flow inference in two commits. Both are backward-compatible: no breaking API changes, no new environment variables in cosyvoice/, and CosyVoice1 / non-streaming paths are untouched.

Benchmark and test details: #1892

Commit 1 — `perf(inference): reduce streaming latency and optimize TRT inference`

Streaming (CosyVoice2/3):

Replace fixed time.sleep(0.1) polling with threading.Condition wake-ups so the main thread proceeds as soon as tokens arrive (~100ms polling waste removed).
Override llm_job() to notify() after each token and on completion.
Fix a bug: use local hop_len in the streaming loop instead of mutating self.token_hop_len, so repeated streaming requests reset hop correctly.
Add first_token_hop_len as a class constant (currently 25, aligned with training chunk_size); trigger threshold unchanged vs upstream.

TensorRT (Flow DiT):

Reuse one TRT execution context across all Euler steps in solve_euler() (was acquire/release per step).
Use CUDA stream wait_stream() instead of device-wide synchronize() for finer overlap with vLLM.
Fix TrtContextWrapper to store a real torch.cuda.Stream object.

Files: cli/model.py, flow/flow_matching.py, utils/common.py

Commit 2 — `feat(inference): optional bucketed TensorRT engines for Flow DiT`

Motivation: Default single TRT plan is built for max seq_len=3000. Short streaming chunks still run on a large engine, increasing latency and VRAM per context.

Changes:

Add opt-in trt_bucket parameter to AutoModel / CosyVoice* / load_trt() (default False).
Four bucket engines (256 / 768 / 1536 / 3000 mel frames), auto-built under {model_dir}/trt_bucket_plans/ when plans are missing.
Introduce TrtBucketedContextWrapper for seq_len-aware routing via acquire_estimator(seq_len=...).
Prefer flow.decoder.estimator.fp32.optimize.onnx, fallback to official fp32.onnx.
Fix get_trt_kwargs() to use 6 ONNX inputs (x, mask, mu, t, spks, cond) with per-bucket profiles.

Files: utils/file_utils.py, utils/common.py, cli/model.py, cli/cosyvoice.py, flow/flow_matching.py

Compatibility

Setting	Behavior
Default (`trt_bucket=False`)	Same as upstream + Commit 1 streaming/TRT fixes
`trt_bucket=True`	Uses bucket plans under `trt_bucket_plans/`; does not read `*.mygpu.plan`
Existing `mygpu.plan`	Not auto-rebuilt; rebuild only when missing or empty

Test plan

CosyVoice3 + load_vllm=True + load_trt=True, stream=True: first-packet latency vs upstream
stream=False regression
Multiple consecutive streaming requests: hop length resets correctly
trt_bucket=False: single-plan path unchanged
trt_bucket=True + optimize.onnx: auto-build 4 bucket plans on first run
Short vs long utterances: smaller buckets used for short seq_len

Review guide

Use the Commits tab to review each commit independently:

Streaming + TRT hot-path optimizations (3 files)
Optional bucket TRT feature (5 files, opt-in)

- 流式推理用 Condition 替代 sleep 轮询，首包 hop_len 仍保持 25 - solve_euler 内复用 TRT context，用 wait_stream 替代全量 sync - 修复 TrtContextWrapper 中 CUDA Stream 对象创建方式

- 新增 TrtBucketedContextWrapper，按 seq_len 路由 256/768/1536/3000 四档 engine - load_trt 支持 trt_bucket 参数，缺失 plan 时自动从 optimize.onnx 构建 - get_trt_kwargs 对齐 export_onnx 六个输入，支持 max_len 分桶 profile - CosyVoice/CosyVoice2/3 新增 trt_bucket 构造参数，优先使用 optimize.onnx - flow_matching 调用 acquire_estimator(seq_len=...) 完成运行时选桶

BeckYang26 · 2026-05-22T06:32:04Z

        # FSQ silent and breath token
        self.silent_tokens = [1, 2, 28, 29, 55, 248, 494, 2241, 2242, 2322, 2323]
+        self.condition_dict = {}
+        self.first_token_hop_len = 25


这里first_token_hop_len可以改更小一点，首包会更快，但可能音质有损失

BeckYang26 added 2 commits May 21, 2026 05:04

perf(inference): 降低流式首包延迟并优化 TRT 推理

b6ec1e5

- 流式推理用 Condition 替代 sleep 轮询，首包 hop_len 仍保持 25 - solve_euler 内复用 TRT context，用 wait_stream 替代全量 sync - 修复 TrtContextWrapper 中 CUDA Stream 对象创建方式

BeckYang26 commented May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894

perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894
BeckYang26 wants to merge 2 commits into
FunAudioLLM:mainfrom
BeckYang26:perf/streaming-trt-inference

BeckYang26 commented May 22, 2026 •

edited

Loading

Uh oh!

BeckYang26 May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BeckYang26 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1 — perf(inference): reduce streaming latency and optimize TRT inference

Commit 2 — feat(inference): optional bucketed TensorRT engines for Flow DiT

Compatibility

Test plan

Review guide

Uh oh!

BeckYang26 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BeckYang26 commented May 22, 2026 •

edited

Loading

Commit 1 — `perf(inference): reduce streaming latency and optimize TRT inference`

Commit 2 — `feat(inference): optional bucketed TensorRT engines for Flow DiT`