You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-11Lines changed: 6 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,8 @@ English | [简体中文](README_CN.md)
26
26
# FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
27
27
28
28
## News
29
+
**[2025-09] 🔥 FastDeploy v2.2 is newly released!** It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for [baidu/ERNIE-21B-A3B-Thinking](https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking)!
30
+
29
31
**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.
30
32
31
33
**[2025-07] The FastDeploy 2.0 Inference Deployment Challenge is now live!** Complete the inference deployment task for the ERNIE 4.5 series open-source models to win official FastDeploy 2.0 merch and generous prizes! 🎁 You're welcome to try it out and share your feedback! 📌[Sign up here](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[Event details](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
64
+
**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!
62
65
63
66
## Get Started
64
67
@@ -68,20 +71,12 @@ Learn how to use FastDeploy through our documentation:
68
71
-[ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md)
Copy file name to clipboardExpand all lines: docs/best_practices/ERNIE-4.5-0.3B-Paddle.md
+12-11Lines changed: 12 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,40 +19,38 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-0.3B` on the following
19
19
### 1.2 Install fastdeploy
20
20
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
21
21
22
-
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).**Please note that models with Paddle suffix need to be used for Fastdeploy**:
22
+
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
37
37
-`--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
38
+
-`--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
38
39
39
40
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
40
41
41
42
### 2.2 Advanced: How to get better performance
42
43
#### 2.2.1 Correctly set parameters that match the application scenario
43
44
Evaluate average input length, average output length, and maximum context length
44
45
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
45
-
-**Enable the service management global block**
46
-
47
-
```
48
-
export ENABLE_V1_KVCACHE_SCHEDULER=1
49
-
```
50
46
51
47
#### 2.2.2 Prefix Caching
52
48
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
53
49
54
50
**How to enable:**
55
-
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine.
51
+
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
52
+
53
+
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
56
54
```
57
55
--enable-prefix-caching
58
56
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
61
59
#### 2.2.3 Chunked Prefill
62
60
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
63
61
64
-
**How to enable:** Add the following lines to the startup parameters
62
+
**How to enable:**
63
+
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
64
+
65
+
For versions 2.1 and earlier, you need to enable it manually by adding
65
66
```
66
67
--enable-chunked-prefill
67
68
```
@@ -79,7 +80,7 @@ Notes:
79
80
80
81
- Usually, no additional parameters need to be set, but CUDAGraph will generate some additional memory overhead, which may need to be adjusted in some scenarios with limited memory. For detailed parameter adjustments, please refer to [GraphOptimizationBackend](../features/graph_optimization.md) for related configuration parameter descriptions
81
82
82
-
#### 2.2.6 Rejection Sampling
83
+
#### 2.2.5 Rejection Sampling
83
84
**Idea:**
84
85
Rejection sampling is to generate samples from a proposal distribution that is easy to sample, avoiding explicit sorting to increase the sampling speed, which has a significant improvement on small-sized models.
Copy file name to clipboardExpand all lines: docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md
+14-13Lines changed: 14 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,40 +19,38 @@ The minimum number of GPUs required to deploy `ERNIE-4.5-21B-A3B` on the followi
19
19
### 1.2 Install fastdeploy and prepare the model
20
20
- Installation: For detail, please refer to [Fastdeploy Installation](../get_started/installation/README.md).
21
21
22
-
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).**Please note that models with Paddle suffix need to be used for Fastdeploy**:
22
+
- Model Download,For detail, please refer to [Supported Models](../supported_models.md).
-`--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed).
37
37
-`--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency.
38
+
-`--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage.
38
39
39
40
For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。
40
41
41
42
### 2.2 Advanced: How to get better performance
42
43
#### 2.2.1 Correctly set parameters that match the application scenario
43
44
Evaluate average input length, average output length, and maximum context length
44
45
- Set max-model-len according to the maximum context length. For example, if the average input length is 1000 and the output length is 30000, then it is recommended to set it to 32768
45
-
-**Enable the service management global block**
46
-
47
-
```
48
-
export ENABLE_V1_KVCACHE_SCHEDULER=1
49
-
```
50
46
51
47
#### 2.2.2 Prefix Caching
52
48
**Idea:** The core idea of Prefix Caching is to avoid repeated calculations by caching the intermediate calculation results of the input sequence (KV Cache), thereby speeding up the response speed of multiple requests with the same prefix. For details, refer to [prefix-cache](../features/prefix_caching.md)
53
49
54
50
**How to enable:**
55
-
Add the following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
51
+
Since version 2.2 (including the develop branch), Prefix Caching has been enabled by default.
52
+
53
+
For versions 2.1 and earlier, you need to enable it manually by adding following lines to the startup parameters, where `--enable-prefix-caching` enables prefix caching, and `--swap-space` enables CPU cache in addition to GPU cache. The size is GB and should be adjusted according to the actual situation of the machine. The recommended value is `(total machine memory - model size) * 20%`. If the service fails to start because other programs are occupying memory, try reducing the `--swap-space` value.
56
54
```
57
55
--enable-prefix-caching
58
56
--swap-space 50
@@ -61,7 +59,10 @@ Add the following lines to the startup parameters, where `--enable-prefix-cachin
61
59
#### 2.2.3 Chunked Prefill
62
60
**Idea:** This strategy is adopted to split the prefill stage request into small-scale sub-chunks, and execute them in batches mixed with the decode request. This can better balance the computation-intensive (Prefill) and memory-intensive (Decode) operations, optimize GPU resource utilization, reduce the computational workload and memory usage of a single Prefill, thereby reducing the peak memory usage and avoiding the problem of insufficient memory. For details, please refer to [Chunked Prefill](../features/chunked_prefill.md)
63
61
64
-
**How to enable:** Add the following lines to the startup parameters
62
+
**How to enable:**
63
+
Since version 2.2 (including the develop branch), Chunked Prefill has been enabled by default.
64
+
65
+
For versions 2.1 and earlier, you need to enable it manually by adding
65
66
```
66
67
--enable-chunked-prefill
67
68
```
@@ -77,7 +78,9 @@ Add the following lines to the startup parameters
77
78
```
78
79
Notes:
79
80
1. MTP currently does not support simultaneous use with Prefix Caching, Chunked Prefill, and CUDAGraph.
80
-
2. MTP currently does not support service management global blocks, i.e. do not run with `export ENABLE_V1_KVCACHE_SCHEDULER=1`
81
+
- Use `export FD_DISABLE_CHUNKED_PREFILL=1` to disable Chunked Prefill.
82
+
- When setting `speculative-config`, Prefix Caching will be automatically disabled.
83
+
2. MTP currently does not support service management global blocks, When setting `speculative-config`, service management global blocks will be automatically disabled.
81
84
3. MTP currently does not support rejection sampling, i.e. do not run with `export FD_SAMPLING_CLASS=rejection`
0 commit comments