Skip to content

perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894

Open
BeckYang26 wants to merge 2 commits into
FunAudioLLM:mainfrom
BeckYang26:perf/streaming-trt-inference
Open

perf(inference): streaming TRT optimizations and optional bucketed Flow engines#1894
BeckYang26 wants to merge 2 commits into
FunAudioLLM:mainfrom
BeckYang26:perf/streaming-trt-inference

Conversation

@BeckYang26
Copy link
Copy Markdown

@BeckYang26 BeckYang26 commented May 22, 2026

Summary

This PR improves CosyVoice2/3 streaming TTS + vLLM + TensorRT Flow inference in two commits. Both are backward-compatible: no breaking API changes, no new environment variables in cosyvoice/, and CosyVoice1 / non-streaming paths are untouched.

Benchmark and test details: #1892

Commit 1 — perf(inference): reduce streaming latency and optimize TRT inference

Streaming (CosyVoice2/3):

  • Replace fixed time.sleep(0.1) polling with threading.Condition wake-ups so the main thread proceeds as soon as tokens arrive (~100ms polling waste removed).
  • Override llm_job() to notify() after each token and on completion.
  • Fix a bug: use local hop_len in the streaming loop instead of mutating self.token_hop_len, so repeated streaming requests reset hop correctly.
  • Add first_token_hop_len as a class constant (currently 25, aligned with training chunk_size); trigger threshold unchanged vs upstream.

TensorRT (Flow DiT):

  • Reuse one TRT execution context across all Euler steps in solve_euler() (was acquire/release per step).
  • Use CUDA stream wait_stream() instead of device-wide synchronize() for finer overlap with vLLM.
  • Fix TrtContextWrapper to store a real torch.cuda.Stream object.

Files: cli/model.py, flow/flow_matching.py, utils/common.py


Commit 2 — feat(inference): optional bucketed TensorRT engines for Flow DiT

Motivation: Default single TRT plan is built for max seq_len=3000. Short streaming chunks still run on a large engine, increasing latency and VRAM per context.

Changes:

  • Add opt-in trt_bucket parameter to AutoModel / CosyVoice* / load_trt() (default False).
  • Four bucket engines (256 / 768 / 1536 / 3000 mel frames), auto-built under {model_dir}/trt_bucket_plans/ when plans are missing.
  • Introduce TrtBucketedContextWrapper for seq_len-aware routing via acquire_estimator(seq_len=...).
  • Prefer flow.decoder.estimator.fp32.optimize.onnx, fallback to official fp32.onnx.
  • Fix get_trt_kwargs() to use 6 ONNX inputs (x, mask, mu, t, spks, cond) with per-bucket profiles.

Files: utils/file_utils.py, utils/common.py, cli/model.py, cli/cosyvoice.py, flow/flow_matching.py


Compatibility

Setting Behavior
Default (trt_bucket=False) Same as upstream + Commit 1 streaming/TRT fixes
trt_bucket=True Uses bucket plans under trt_bucket_plans/; does not read *.mygpu.plan
Existing mygpu.plan Not auto-rebuilt; rebuild only when missing or empty

Test plan

  • CosyVoice3 + load_vllm=True + load_trt=True, stream=True: first-packet latency vs upstream
  • stream=False regression
  • Multiple consecutive streaming requests: hop length resets correctly
  • trt_bucket=False: single-plan path unchanged
  • trt_bucket=True + optimize.onnx: auto-build 4 bucket plans on first run
  • Short vs long utterances: smaller buckets used for short seq_len

Review guide

Use the Commits tab to review each commit independently:

  1. Streaming + TRT hot-path optimizations (3 files)
  2. Optional bucket TRT feature (5 files, opt-in)

- 流式推理用 Condition 替代 sleep 轮询,首包 hop_len 仍保持 25
- solve_euler 内复用 TRT context,用 wait_stream 替代全量 sync
- 修复 TrtContextWrapper 中 CUDA Stream 对象创建方式
- 新增 TrtBucketedContextWrapper,按 seq_len 路由 256/768/1536/3000 四档 engine
- load_trt 支持 trt_bucket 参数,缺失 plan 时自动从 optimize.onnx 构建
- get_trt_kwargs 对齐 export_onnx 六个输入,支持 max_len 分桶 profile
- CosyVoice/CosyVoice2/3 新增 trt_bucket 构造参数,优先使用 optimize.onnx
- flow_matching 调用 acquire_estimator(seq_len=...) 完成运行时选桶
Comment thread cosyvoice/cli/model.py
# FSQ silent and breath token
self.silent_tokens = [1, 2, 28, 29, 55, 248, 494, 2241, 2242, 2322, 2323]
self.condition_dict = {}
self.first_token_hop_len = 25
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里first_token_hop_len可以改更小一点,首包会更快,但可能音质有损失

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant