Skip to content

Commit 5c48af8

Browse files
authored
[Feat] Basic scripts for deployment best practices (#556)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ OUR OFFICIAL WEBSITE. --> # Purpose What this PR does / why we need it? Provide basic scripts and corresponding documentation for best practices. # Usage example ## 1. Start the ray server of master node. <img width="1169" height="772" alt="image" src="https://github.com/user-attachments/assets/0f354633-4510-4ec6-917e-a7080b474d1b" /> ## 2. Start the ray server of the first worker node. <img width="1036" height="517" alt="image" src="https://github.com/user-attachments/assets/8a9f48ed-de9c-44cf-b759-82c54b87e105" /> ## 3. Start the vllm server in master node. <img width="1503" height="933" alt="image" src="https://github.com/user-attachments/assets/c8d17cc4-d99a-4c6e-8a50-c2f3a2060176" />
1 parent ac75dcf commit 5c48af8

6 files changed

Lines changed: 574 additions & 0 deletions

File tree

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Single-Machine Deployment (CUDA or Ascend)
2+
3+
This scenario applies to a single physical server and uses two files:
4+
- `vllm/config.properties`
5+
- `vllm/run_vllm.sh`
6+
7+
Modify the parameters in `config.properties` according to your actual requirements (e.g., model, memory).
8+
9+
**Note:** `Multi-node Configuration`, `Ray Configuration` and `Ascend Multi-node Data Parallel` **can be ignored**, as they are only used in multi-machine inference scenarios.
10+
11+
After completing the configuration, launch the service with:
12+
```bash
13+
bash run_vllm.sh
14+
```
15+
16+
# Multi-Machine Deployment (CUDA)
17+
In multi-node CUDA deployments, vLLM relies on Ray as its distributed backend. Therefore, in addition to `vllm/config.properties` and `vllm/run_vllm.sh`, you must also use `vllm/start_ray.sh` to start the Ray cluster. For a two-node deployment, follow these steps:
18+
19+
step1 Modify config.properties
20+
- Set `master_ip` to the IP address of the head node
21+
- Set `worker_ip` to the IP address of the worker node
22+
- Set `node_num` to 2
23+
- Set `distributed_executor_backend` to `ray`
24+
- `Ascend Multi-node Data Parallelism` **can be ignored**, as it is only used in Ascend multi-machine data parallelism inference scenarios.
25+
- Adjust other vLLM parameters as needed
26+
27+
step2 Start the Ray cluster
28+
- On the head node:
29+
```bash
30+
NODE=0 bash start_ray.sh
31+
```
32+
- On the worker node:
33+
```bash
34+
NODE=1 bash start_ray.sh
35+
```
36+
37+
step3 Launch the vLLM service
38+
39+
Run the following command on **either node**:
40+
```bash
41+
bash run_vllm.sh
42+
```
43+
44+
**Scaling Note:** To deploy across more machines, set `node_num` to the actual number of nodes and ensure that each worker node’s `worker_ip` is configured to its own IP address.
45+
46+
# Multi-Machine Deployment (Ascend)
47+
48+
Ascend multi-node deployments differ based on whether **Data Parallelism (DP)** is enabled.
49+
50+
## Case 1: DP = 1 (No Data Parallelism)
51+
52+
This case follows the same procedure as CUDA multi-machine deployment and requires the following files:
53+
- `vllm/config.properties`
54+
- `vllm/run_vllm.sh`
55+
- `start_ray.sh`
56+
57+
Follow the exact steps described in the **CUDA Multi-Machine Deployment** section above.
58+
59+
## Case 2: DP > 1 (Data Parallelism Enabled)
60+
This scenario requires the following files:
61+
- `vllm/config.properties`
62+
- `vllm/run_vllm_dp.sh`
63+
64+
For a two-node deployment, follow these steps:
65+
66+
step1 Modify `config.properties`
67+
- Set `master_ip` and `worker_ip`
68+
- Set `dp_size_local` to the number of DP per node
69+
- `Ray Configuration` can be ignored, as it is not used in the current scenario.
70+
- Adjust other vLLM parameters as needed
71+
72+
step2 Launch the vLLM service
73+
- On the head node:
74+
```bash
75+
NODE=0 bash run_vllm_dp.sh
76+
```
77+
- On the worker node:
78+
```bash
79+
NODE=1 bash run_vllm_dp.sh
80+
```
81+
82+
**Scaling Note:** When deploying across more nodes, ensure that each worker node’s `worker_ip` is correctly set to its local IP address.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
#!/bin/bash
2+
3+
load_config() {
4+
local config_file="${CONFIG_FILE:-$(dirname "${BASH_SOURCE[0]}")/config.properties}"
5+
6+
if [[ ! -f "$config_file" ]]; then
7+
echo "ERROR: Config file '$config_file' not found!" >&2
8+
exit 1
9+
fi
10+
11+
while IFS= read -r line; do
12+
line=$(echo "$line" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
13+
[[ -z "$line" || "$line" == \#* ]] && continue
14+
15+
if [[ "$line" == export\ * ]]; then
16+
rest="${line#export }"
17+
eval "export $rest"
18+
else
19+
if [[ "$line" == *=* ]]; then
20+
key="${line%%=*}"
21+
value="${line#*=}"
22+
key=$(echo "$key" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
23+
value=$(echo "$value" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
24+
eval "$key=\$value"
25+
else
26+
echo "WARNING: Invalid config line (no '=' found): $line" >&2
27+
fi
28+
fi
29+
done < "$config_file"
30+
}
31+
32+
ensure_ifconfig_installed() {
33+
if command -v ifconfig >/dev/null 2>&1; then
34+
return 0
35+
fi
36+
37+
echo "'ifconfig' not found. Attempting to install net-tools..."
38+
39+
if command -v apt-get >/dev/null 2>&1; then
40+
echo "Detected apt-get (Debian/Ubuntu). Installing net-tools..."
41+
sudo apt-get update && sudo apt-get install -y net-tools
42+
elif command -v yum >/dev/null 2>&1; then
43+
echo "Detected yum (RHEL/CentOS). Installing net-tools..."
44+
sudo yum install -y net-tools
45+
elif command -v dnf >/dev/null 2>&1; then
46+
echo "Detected dnf (Fedora). Installing net-tools..."
47+
sudo dnf install -y net-tools
48+
else
49+
echo "ERROR: No supported package manager (apt/yum/dnf) found."
50+
echo "Please install 'net-tools' manually, 'ifconfig' is required to get network interface information."
51+
exit 1
52+
fi
53+
54+
if ! command -v ifconfig >/dev/null 2>&1; then
55+
echo "ERROR: Failed to install net-tools. Please install 'net-tools' manually, 'ifconfig' is required to get network interface information."
56+
exit 1
57+
fi
58+
59+
echo "✅ ifconfig is now available."
60+
}
61+
62+
get_interface_by_ip() {
63+
local target_ip="$1"
64+
ifconfig | awk -v target="$target_ip" '
65+
/^[[:alnum:]]/ {
66+
iface = $1
67+
sub(/:$/, "", iface)
68+
}
69+
/inet / {
70+
for (i = 1; i <= NF; i++) {
71+
gsub(/addr:/, "", $i)
72+
if ($i == target) {
73+
print iface
74+
exit
75+
}
76+
}
77+
}
78+
'
79+
}
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
#****************************************
2+
# Devices Visible Configuration *
3+
#****************************************
4+
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
5+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
6+
7+
8+
#****************************************
9+
# Multi-node Configuration *
10+
#****************************************
11+
master_ip=<MASTER IP>
12+
worker_ip=<WORKER IP>
13+
14+
15+
#****************************************
16+
# Ray Configuration *
17+
#****************************************
18+
# Number of nodes in multi-node inference
19+
node_num=<NUMBER OF NODES>
20+
21+
22+
#****************************************
23+
# Ascend Multi-node Data Parallelism *
24+
#****************************************
25+
export HCCL_OP_EXPANSION_MODE="AIV"
26+
export OMP_PROC_BIND=false
27+
export OMP_NUM_THREADS=100
28+
export HCCL_BUFFSIZE=200
29+
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
30+
export VLLM_ASCEND_ENABLE_MLAPO=1
31+
export HCCL_INTRA_PCIE_ENABLE=1
32+
export HCCL_INTRA_ROCE_ENABLE=0
33+
dp_rpc_port=13389
34+
dp_size_local=<NUMBER OF DP PER NODE>
35+
36+
37+
#****************************************
38+
# Common vLLM Configuration *
39+
#****************************************
40+
# For multi-node and multi-npu inference
41+
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
42+
export VLLM_ALLREDUCE_USE_SYMM_MEM=0
43+
# For multi-node and multi-gpu inference
44+
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
45+
# Run deepseek v3.1+ on CUDA
46+
export VLLM_USE_DEEP_GEMM=0
47+
export VLLM_LOGGING_LEVEL=INFO
48+
model=/home/models/QwQ-32B
49+
# served_model_name=QwQ-32B
50+
server_host=0.0.0.0
51+
server_port=7850
52+
tp_size=4
53+
dp_size=1
54+
pp_size=1
55+
seed=1024
56+
enable_expert_parallel=false
57+
enable_prefix_caching=false
58+
max_model_len=20000
59+
# max_num_batched_tokens=2048
60+
# max_num_seqs=20
61+
# block_size=128
62+
gpu_memory_utilization=0.87
63+
# NONE | PIECEWISE | FULL | FULL_DECODE_ONLY | FULL_AND_PIECEWISE
64+
graph_mode=FULL_DECODE_ONLY
65+
quantization=NONE
66+
# mp | ray ; mp for single-node inference, ray for multi-node inference
67+
distributed_executor_backend=mp
68+
# async_scheduling=false
69+
70+
# speculative decoding configuration
71+
enable_speculative_decoding=false
72+
speculative_decode_model=NONE
73+
speculative_decode_method=deepseek_mtp
74+
num_speculative_tokens=1
75+
76+
77+
#****************************************
78+
# extra vLLM Configuration for Ascend *
79+
#****************************************
80+
enable_ascend_scheduler=false
81+
# enable_torchair_graph=false
82+
83+
84+
#****************************************
85+
# UCM Configuration *
86+
#****************************************
87+
# set true to enable UCM
88+
ucm_enable=false
89+
ucm_config_yaml_path=/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml
90+
91+
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
#!/bin/bash
2+
echo $CONFIG_FILE
3+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
4+
source "$SCRIPT_DIR/common.sh"
5+
6+
start_server() {
7+
[[ -z "$model" ]] && { echo "ERROR: model not set in config.properties" >&2; exit 1; }
8+
9+
if [[ "$ucm_enable" == "true" ]]; then
10+
[[ -z "$ucm_config_yaml_path" ]] && {
11+
echo "ERROR: ucm_config_yaml_path not set but ucm_enable=true" >&2
12+
exit 1
13+
}
14+
LOG_FILE="vllm_ucm.log"
15+
else
16+
LOG_FILE="vllm.log"
17+
fi
18+
19+
echo ""
20+
echo "===== vllm server configuration ====="
21+
echo "model = $model"
22+
echo "served_model_name = ${served_model_name:-<default>}"
23+
echo "tp_size = $tp_size"
24+
echo "dp_size = $dp_size"
25+
echo "pp_size = $pp_size"
26+
echo "enable_expert_parallel = $enable_expert_parallel"
27+
echo "max_model_len = $max_model_len"
28+
echo "max_num_batched_tokens = $max_num_batch_tokens"
29+
echo "max_num_seqs = $max_num_seqs"
30+
echo "block_size = $block_size"
31+
echo "gpu_memory_utilization = $gpu_memory_utilization"
32+
echo "quantization = $quantization"
33+
echo "server_host = $server_host"
34+
echo "server_port = $server_port"
35+
echo "distributed_backend = $distributed_executor_backend"
36+
echo "enable_prefix_caching = $enable_prefix_caching"
37+
echo "async_scheduling = $async_scheduling"
38+
echo "graph_mode = $graph_mode"
39+
if [[ "$ucm_enable" == "true" ]]; then
40+
echo "ucm_config_file = $ucm_config_yaml_path"
41+
fi
42+
echo "log_file = $LOG_FILE"
43+
echo "====================================="
44+
echo ""
45+
46+
CMD=(
47+
vllm serve "$model"
48+
--max-model-len "$max_model_len"
49+
--tensor-parallel-size "$tp_size"
50+
--data-parallel-size "$dp_size"
51+
--pipeline-parallel-size "$pp_size"
52+
--gpu-memory-utilization "$gpu_memory_utilization"
53+
--trust-remote-code
54+
--host "$server_host"
55+
--port "$server_port"
56+
--distributed-executor-backend "$distributed_executor_backend"
57+
)
58+
59+
# --- Optional numeric/string params ---
60+
if [[ -n "$block_size" ]]; then CMD+=("--block-size" "$block_size"); fi
61+
if [[ -n $max_num_batched_tokens ]]; then CMD+=("--max-num-batched-tokens" "$max_num_batched_tokens"); fi
62+
if [[ -n $max_num_seqs ]]; then CMD+=("--max-num-seqs" "$max_num_seqs"); fi
63+
if [[ -n "$seed" ]]; then CMD+=("--seed" "$seed"); fi
64+
if [[ -n "$served_model_name" ]]; then CMD+=("--served-model-name" "$served_model_name"); fi
65+
if [[ -n "$quantization" ]] && [[ "$quantization" != "NONE" ]]; then CMD+=("--quantization" "$quantization"); fi
66+
if [[ -n "$graph_mode" ]]; then
67+
COMPILATION_CONFIG='{"cudagraph_mode":"'"$graph_mode"'"}'
68+
CMD+=("--compilation-config" "$COMPILATION_CONFIG")
69+
fi
70+
71+
# --- Boolean flags ---
72+
if [[ "$async_scheduling" == "true" ]]; then CMD+=("--async-scheduling"); fi
73+
if [[ "$enable_expert_parallel" == "true" ]]; then CMD+=("--enable-expert-parallel"); fi
74+
if [[ "$enable_prefix_caching" == "false" ]]; then CMD+=("--no-enable-prefix-caching"); fi
75+
76+
# --- Advanced configs (JSON) ---
77+
if [[ "$enable_speculative_decoding" == "true" ]]; then
78+
SPECULATIVE_CONFIG='{"model":"'"$speculative_decode_model"'", "num_speculative_tokens": "'"$num_speculative_tokens"'", "method":"'"$speculative_decode_method"'"}'
79+
CMD+=("--speculative-config" "$SPECULATIVE_CONFIG")
80+
fi
81+
82+
ADDITIONAL_CONFIG="{"
83+
SEP=""
84+
if [[ -n "$enable_ascend_scheduler" ]]; then
85+
ADDITIONAL_CONFIG+="${SEP}\"ascend_scheduler_config\":{\"enabled\":$enable_ascend_scheduler}"
86+
SEP=","
87+
fi
88+
if [[ -n "$enable_torchair_graph" ]]; then
89+
ADDITIONAL_CONFIG+="${SEP}\"torchair_graph_config\":{\"enabled\":$enable_torchair_graph}"
90+
SEP=","
91+
fi
92+
ADDITIONAL_CONFIG+="}"
93+
if [[ "$ADDITIONAL_CONFIG" != "{}" ]]; then CMD+=("--additional-config" "$ADDITIONAL_CONFIG"); fi
94+
95+
if [[ "$ucm_enable" == "true" ]]; then
96+
KV_CONFIG_JSON="{
97+
\"kv_connector\":\"UCMConnector\",
98+
\"kv_connector_module_path\":\"ucm.integration.vllm.ucm_connector\",
99+
\"kv_role\":\"kv_both\",
100+
\"kv_connector_extra_config\":{\"UCM_CONFIG_FILE\":\"$ucm_config_yaml_path\"}
101+
}"
102+
CMD+=("--kv-transfer-config" "$KV_CONFIG_JSON")
103+
fi
104+
105+
echo "Executing command: ${CMD[*]}"
106+
echo ""
107+
108+
"${CMD[@]}" 2>&1 | tee "$LOG_FILE"
109+
}
110+
111+
load_config
112+
start_server

0 commit comments

Comments
 (0)