Skip to content

Feat/acceleration#1524

Open
Apageoflove wants to merge 5 commits into
zai-org:mainfrom
Apageoflove:feat/acceleration
Open

Feat/acceleration#1524
Apageoflove wants to merge 5 commits into
zai-org:mainfrom
Apageoflove:feat/acceleration

Conversation

@Apageoflove
Copy link
Copy Markdown

Summary

针对 #1373 讨论中大家反馈的推理速度问题,整理了一套完整的加速方案。

解决的问题:

  • INT4 量化比 FP16 慢 → bitsandbytes NF4 量化
  • GPU 利用率低 (20-30%) → vLLM / 批量推理
  • 体感慢 → 流式输出 (CLI/SSE/Gradio)
  • 低显存无法运行 → INT4 + CPU offload
  • 追求极致速度 → TensorRT-LLM

新增文件:

  • acceleration/01_quantization_acceleration.py 量化加速
  • acceleration/02_vllm_acceleration.py vLLM 加速
  • acceleration/03_streaming_optimization.py 流式输出
  • acceleration/04_batch_inference.py 批量推理
  • acceleration/05_low_vram_offload.py 低显存方案
  • acceleration/06_tensorrt_llm_guide.py TensorRT-LLM 指南
  • acceleration/07_benchmark_suite.py 速度对比测试
  • acceleration/README.md 使用说明

Test plan

  • 本地运行量化加速脚本验证 INT4 速度提升
  • 本地运行流式输出脚本验证体感改善
  • 本地运行 benchmark 脚本对比各方案速度

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant