Skip to content

Commit f80cc31

Browse files
authored
Merge branch 'release/2.2' into release/2.2
2 parents f6ccb6a + d43c2f2 commit f80cc31

81 files changed

Lines changed: 1832 additions & 416 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ English | [简体中文](README_CN.md)
2626
# FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
2727

2828
## News
29+
**[2025-09] 🔥 FastDeploy v2.2 is newly released!** It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for [baidu/ERNIE-21B-A3B-Thinking](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking)!
30+
2931
**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.
3032

3133
**[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live!** Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌[Sign up here](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[Event details](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
@@ -57,8 +59,9 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**,
5759
- [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md)
5860
- [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
5961
- [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
62+
- [MetaX GPU](./docs/get_started/installation/metax_gpu.md.md)
6063

61-
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
64+
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!
6265

6366
## Get Started
6467

@@ -68,20 +71,12 @@ Learn how to use FastDeploy through our documentation:
6871
- [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md)
6972
- [Offline Inference Development](./docs/offline_inference.md)
7073
- [Online Service Deployment](./docs/online_serving/README.md)
71-
- [Full Supported Models List](./docs/supported_models.md)
7274
- [Best Practices](./docs/best_practices/README.md)
7375

7476
## Supported Models
7577

76-
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
77-
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
78-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||||128K |
79-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 |||||| 128K |
80-
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
81-
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
82-
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
83-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 ||||||128K |
84-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
78+
Learn how to download models, enable using the torch format, and more:
79+
- [Full Supported Models List](./docs/supported_models.md)
8580

8681
## Advanced Usage
8782

README_CN.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@
2626
# FastDeploy :基于飞桨的大语言模型与视觉语言模型推理部署工具包
2727

2828
## 最新活动
29-
**[2025-08] 🔥 FastDeploy v2.1 全新发布:** 全新的KV Cache调度策略,更多模型支持PD分离和CUDA Graph,昆仑、海光等更多硬件支持增强,全方面优化服务和推理引擎的性能。
29+
**[2025-09] 🔥 FastDeploy v2.2 全新发布**: HuggingFace生态模型兼容,性能进一步优化,更新增对[baidu/ERNIE-21B-A3B-Thinking](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking)支持!
30+
31+
**[2025-08] FastDeploy v2.1 发布**:全新的KV Cache调度策略,更多模型支持PD分离和CUDA Graph,昆仑、海光等更多硬件支持增强,全方面优化服务和推理引擎的性能。
3032

3133
**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务,即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
3234

@@ -55,8 +57,9 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
5557
- [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md)
5658
- [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
5759
- [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
60+
- [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md.md)
5861

59-
**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 和 沐曦(MetaX)GPU 在内的其他硬件平台正在开发测试中。敬请关注更新!
62+
**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 等其他硬件平台正在开发测试中。敬请关注更新!
6063

6164
## 入门指南
6265

@@ -66,20 +69,12 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
6669
- [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md)
6770
- [离线推理](./docs/zh/offline_inference.md)
6871
- [在线服务](./docs/zh/online_serving/README.md)
69-
- [模型支持列表](./docs/zh/supported_models.md)
7072
- [最佳实践](./docs/zh/best_practices/README.md)
7173

7274
## 支持模型列表
7375

74-
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
75-
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
76-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||||128K |
77-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 |||||| 128K |
78-
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
79-
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
80-
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
81-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 ||||||128K |
82-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 |||||| 128K |
76+
通过我们的文档了解如何下载模型,如何支持torch格式等:
77+
- [模型支持列表](./docs/zh/supported_models.md)
8378

8479
## 进阶用法
8580

dockerfiles/Dockerfile.gpu

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.0
2-
ARG PADDLE_VERSION=3.1.1
3-
ARG FD_VERSION=2.1.0
1+
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
2+
ARG PADDLE_VERSION=3.2.0
3+
ARG FD_VERSION=2.2.0
44

55
ENV DEBIAN_FRONTEND=noninteractive
66

docs/assets/images/favicon.ico

4.19 KB
Binary file not shown.

docs/assets/images/logo.jpg

13.6 KB
Loading

docs/best_practices/ERNIE-4.5-0.3B-Paddle.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,40 +19,38 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
1919
### 1.2 Install fastdeploy
2020
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
2121

22-
- Model Download,For detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
22+
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
2323

2424
## 2.How to Use
2525
### 2.1 Basic: Launching the Service
2626
Start the service by following command:
2727
```bash
28-
export ENABLE_V1_KVCACHE_SCHEDULER=1
2928
python -m fastdeploy.entrypoints.openai.api_server \
3029
--model baidu/ERNIE-4.5-0.3B-Paddle \
3130
--tensor-parallel-size 1 \
3231
--quantization wint4 \
3332
--max-model-len 32768 \
34-
--max-num-seqs 128
33+
--max-num-seqs 128 \
34+
--load_choices "default_v1"
3535
```
3636
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
3737
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
38+
- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
3839

3940
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
4041

4142
### 2.2 Advanced: How to get better performance
4243
#### 2.2.1 Correctly set parameters that match the application scenario
4344
Evaluate average input length, average output length, and maximum context length
4445
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
45-
- **Enable the service management global block**
46-
47-
```
48-
export ENABLE_V1_KVCACHE_SCHEDULER=1
49-
```
5046

5147
#### 2.2.2 Prefix Caching
5248
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
5349

5450
**How to enable:**
55-
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
51+
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
52+
53+
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
5654
```
5755
--enable-prefix-caching
5856
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
6159
#### 2.2.3 Chunked Prefill
6260
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
6361

64-
**How to enable:** Add the following lines to the startup parameters
62+
**How to enable:**
63+
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
64+
65+
For versions 2.1 and earlier, you need to enable it manually by adding
6566
```
6667
--enable-chunked-prefill
6768
```
@@ -79,7 +80,7 @@ Notes:
7980

8081
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
8182

82-
#### 2.2.6 Rejection Sampling
83+
#### 2.2.5 Rejection Sampling
8384
**Idea:**
8485
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
8586

docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,40 +19,38 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
1919
### 1.2 Install fastdeploy and prepare the model
2020
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
2121

22-
- Model Download,For detail, please refer to [Supported Models](../supported_models.md). **Please note that models with Paddle suffix need to be used for Fastdeploy**
22+
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
2323

2424
## 2.How to Use
2525
### 2.1 Basic: Launching the Service
2626
Start the service by following command:
2727
```bash
28-
export ENABLE_V1_KVCACHE_SCHEDULER=1
2928
python -m fastdeploy.entrypoints.openai.api_server \
3029
--model baidu/ERNIE-4.5-21B-A3B-Paddle \
3130
--tensor-parallel-size 1 \
3231
--quantization wint4 \
3332
--max-model-len 32768 \
34-
--max-num-seqs 128
33+
--max-num-seqs 128 \
34+
--load_choices "default_v1"
3535
```
3636
- `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
3737
- `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
38+
- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
3839

3940
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)
4041

4142
### 2.2 Advanced: How to get better performance
4243
#### 2.2.1 Correctly set parameters that match the application scenario
4344
Evaluate average input length, average output length, and maximum context length
4445
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
45-
- **Enable the service management global block**
46-
47-
```
48-
export ENABLE_V1_KVCACHE_SCHEDULER=1
49-
```
5046

5147
#### 2.2.2 Prefix Caching
5248
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
5349

5450
**How to enable:**
55-
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
51+
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
52+
53+
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
5654
```
5755
--enable-prefix-caching
5856
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
6159
#### 2.2.3 Chunked Prefill
6260
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
6361

64-
**How to enable:** Add the following lines to the startup parameters
62+
**How to enable:**
63+
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
64+
65+
For versions 2.1 and earlier, you need to enable it manually by adding
6566
```
6667
--enable-chunked-prefill
6768
```
@@ -77,7 +78,9 @@ Add the following lines to the startup parameters
7778
```
7879
Notes:
7980
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
80-
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
81+
- Use `export FD_DISABLE_CHUNKED_PREFILL=1` to disable Chunked Prefill.
82+
- When setting `speculative-config`, Prefix Caching will be automatically disabled.
83+
2. MTP currently does not support service management global blocks, When setting `speculative-config`, service management global blocks will be automatically disabled.
8184
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
8285

8386
#### 2.2.5 CUDAGraph
@@ -110,7 +113,6 @@ export FD_SAMPLING_CLASS=rejection
110113
# prefill
111114
export CUDA_VISIBLE_DEVICES=0,1,2,3
112115
export INFERENCE_MSG_QUEUE_ID=1315
113-
export FLAGS_max_partition_size=2048
114116
export FD_ATTENTION_BACKEND=FLASH_ATTN
115117
export FD_LOG_DIR="prefill_log"
116118
@@ -130,7 +132,6 @@ python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-21B-A
130132
# decode
131133
export CUDA_VISIBLE_DEVICES=4,5,6,7
132134
export INFERENCE_MSG_QUEUE_ID=1215
133-
export FLAGS_max_partition_size=2048
134135
export FD_LOG_DIR="decode_log"
135136
136137
quant_type=block_wise_fp8

0 commit comments

Comments
 (0)