Skip to content

Commit 7a9b69e

Browse files
author
ruicen
committed
delete tokenizer folder;refactor the deployment test script to take external parameters
1 parent 917de3a commit 7a9b69e

11 files changed

Lines changed: 488 additions & 96 deletions

File tree

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ python3 tools/spec_benchmark.py \
154154
测试`transformers`加载量化模型离线推理:
155155

156156
```shell
157-
python deploy/offline.py $MODEL_PATH
157+
python deploy/offline.py $MODEL_PATH "Hello, my name is"
158158
```
159159

160160
其中 `MODEL_PATH` 为量化产出模型路径。
@@ -168,32 +168,35 @@ python deploy/offline.py $MODEL_PATH
168168
[vLLM](https://github.com/vllm-project/vllm) 服务启动脚本,建议版本`vllm>=0.8.5.post1`,部署MOE INT8量化模型需要`vllm>=0.9.2`
169169

170170
```shell
171-
bash deploy/run_vllm.sh $MODEL_PATH
171+
bash deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
172172
```
173+
其中`-d`为可见设备,`-t`为张量并行度,`-p`为流水线并行度,`-g`为显存使用率。
173174

174175
**SGLang**
175176

176177
[SGLang](https://github.com/sgl-project/sglang) 服务启动脚本,建议版本 `sglang>=0.4.6.post1`
177178

178179
```shell
179-
bash deploy/run_sglang.sh $MODEL_PATH
180+
bash deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
180181
```
181182

182183
#### 3. 服务调用
183184

184185
通过 [OpenAI 格式](https://platform.openai.com/docs/api-reference/introduction) 接口发起请求:
185186

186187
```shell
187-
bash deploy/openai.sh $MODEL_PATH
188+
bash deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
188189
```
190+
其中`-p`为输入prompt
189191

190192
#### 4. 效果验证
191193

192194
使用 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 评估量化模型精度,建议版本`lm-eval>=0.4.8`
193195

194196
```shell
195-
bash deploy/lm_eval.sh $MODEL_PATH
197+
bash deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
196198
```
199+
其中`RESULT_PATH`为测试结果保存目录,`-b`为batch size大小,`--tasks`为评测任务,`-n`为few-shot数量
197200

198201
详细操作指南请参阅[部署文档](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html)
199202

README_en.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ If you need to load a quantized model via `transformers`, please set the `deploy
154154
To test offline inference with a quantized model loaded via `transformers`, run the following command:
155155

156156
```shell
157-
python deploy/offline.py $MODEL_PATH
157+
python deploy/offline.py $MODEL_PATH "Hello, my name is"
158158
```
159159

160160
Where `MODEL_PATH` is the path to the quantized model output.
@@ -169,33 +169,36 @@ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm
169169

170170

171171
```shell
172-
bash deploy/run_vllm.sh $MODEL_PATH
172+
bash deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
173173
```
174+
Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization.
174175

175176
**SGLang**
176177

177178

178179
Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
179180

180181
```shell
181-
bash deploy/run_sglang.sh $MODEL_PATH
182+
bash deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
182183
```
183184

184185
#### 3. Service Invocation
185186

186187
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
187188

188189
```shell
189-
bash deploy/openai.sh $MODEL_PATH
190+
bash deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
190191
```
192+
where `-p` is the input prompt.
191193

192194
#### 4. Performance Evaluation
193195

194196
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
195197

196198
```shell
197-
bash deploy/lm_eval.sh $MODEL_PATH
199+
bash deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
198200
```
201+
where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples.
199202

200203
For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
201204

angelslim/models/llm/kimi_k2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,11 @@
1616
from transformers import AutoModelForCausalLM
1717
from transformers.models.deepseek_v3 import DeepseekV3Config
1818

19-
from ...tokenizer import TikTokenTokenizer
2019
from ...utils import print_info
2120
from ..model_factory import SlimModelFactory
2221
from .deepseek import DeepSeek
2322
from .modeling_deepseek import DeepseekV3ForCausalLM
23+
from .tiktoken_tokenizer import TikTokenTokenizer
2424

2525

2626
@SlimModelFactory.register
File renamed without changes.

angelslim/tokenizer/__init__.py

Lines changed: 0 additions & 15 deletions
This file was deleted.

scripts/deploy/lm_eval.sh

Lines changed: 124 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,149 @@
11
#!/bin/bash
22

3-
# Set environment variables
4-
export CUDA_VISIBLE_DEVICES=0,1,2,3
5-
export PYTHON_MULTIPROCESSING_METHOD=spawn
6-
export VLLM_WORKER_MULTIPROC_METHOD=spawn
7-
export HF_ALLOW_CODE_EVAL=1
3+
usage() {
4+
cat << EOF
5+
Usage: $0 [OPTIONS] <model_path1> <model_path2> ...
6+
7+
Options:
8+
-d, --devices DEVICES CUDA devices to use (default: 0,1,2,3)
9+
-t, --tensor-parallel SIZE Tensor parallel size (default: 4)
10+
-g, --gpu-memory UTILIZATION GPU memory utilization (default: 0.9)
11+
-r, --result-dir DIR Base result directory (default: ./results)
12+
-b, --batch-size SIZE Batch size for auto tasks (default: auto)
13+
--tasks TASK1,TASK2,... Comma-separated list of tasks to evaluate (default: ceval-valid,mmlu,gsm8k,humaneval)
14+
-n, --num-fewshot NUM Number of few-shot examples (default: 0)
15+
-h, --help Show this help message
16+
17+
Examples:
18+
bash $0 -d 0,1 -t 2 --gpu-memory 0.8 /path/to/model1 /path/to/model2
19+
bash $0 --tasks ceval-valid,mmlu,gsm8k,humaneval /path/to/model1
20+
EOF
21+
}
822

23+
CUDA_VISIBLE_DEVICES="0,1,2,3"
924
INFERENCE_TP_SIZE=4
25+
GPU_MEMORY_UTILIZATION=0.9
26+
RESULT_BASE_DIR="./results"
27+
BATCH_SIZE="auto"
28+
TASKS=("ceval-valid" "mmlu" "gsm8k" "humaneval")
29+
NUM_FEWSHOT=0
30+
31+
POSITIONAL_ARGS=()
32+
33+
while [[ $# -gt 0 ]]; do
34+
case $1 in
35+
-d|--devices)
36+
CUDA_VISIBLE_DEVICES="$2"
37+
shift 2
38+
;;
39+
-t|--tensor-parallel)
40+
INFERENCE_TP_SIZE="$2"
41+
shift 2
42+
;;
43+
-g|--gpu-memory)
44+
GPU_MEMORY_UTILIZATION="$2"
45+
shift 2
46+
;;
47+
-r|--result-dir)
48+
RESULT_BASE_DIR="$2"
49+
shift 2
50+
;;
51+
-b|--batch-size)
52+
BATCH_SIZE="$2"
53+
shift 2
54+
;;
55+
--tasks)
56+
IFS=',' read -ra TASKS <<< "$2"
57+
shift 2
58+
;;
59+
-n|--num-fewshot)
60+
NUM_FEWSHOT="$2"
61+
shift 2
62+
;;
63+
-h|--help)
64+
usage
65+
exit 0
66+
;;
67+
-*|--*)
68+
echo "Error: Unknown option: $1"
69+
usage
70+
exit 1
71+
;;
72+
*)
73+
POSITIONAL_ARGS+=("$1")
74+
shift
75+
;;
76+
esac
77+
done
78+
79+
set -- "${POSITIONAL_ARGS[@]}"
1080

1181
# Check if model paths are provided
1282
if [ $# -eq 0 ]; then
1383
echo "Usage: $0 <model_path1> <model_path2> ..."
1484
exit 1
1585
fi
1686

87+
# Set environment variables
88+
export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
89+
export PYTHON_MULTIPROCESSING_METHOD=spawn
90+
export VLLM_WORKER_MULTIPROC_METHOD=spawn
91+
export HF_ALLOW_CODE_EVAL=1
92+
93+
echo "======================================================"
94+
echo " Model Evaluation Configuration"
95+
echo "======================================================"
96+
echo "CUDA Visible Devices: $CUDA_VISIBLE_DEVICES"
97+
echo "Tensor Parallel Size: $INFERENCE_TP_SIZE"
98+
echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
99+
echo "Result Base Directory: $RESULT_BASE_DIR"
100+
echo "Batch Size: $BATCH_SIZE"
101+
echo "Number of Few-shot: $NUM_FEWSHOT"
102+
echo "Tasks to Evaluate: ${TASKS[*]}"
103+
echo "Number of Models: $#"
104+
echo "Model Paths:"
105+
for model_path in "$@"; do
106+
echo " - $model_path"
107+
done
108+
echo "======================================================"
109+
echo
110+
17111
# Iterate over all provided model paths
18112
for MODEL_PATH in "$@"; do
19113
# Extract model name from path (last directory name)
20114
MODEL_NAME=$(basename "$MODEL_PATH")
21115
echo "======================================================"
22116
echo "Evaluating model: $MODEL_NAME"
23117
echo "Model path: $MODEL_PATH"
24-
echo "======================================================"
25118

26119
# Create dedicated result directory for the model
27-
RESULT_PATH="./results/$MODEL_NAME"
120+
RESULT_PATH="$RESULT_BASE_DIR/$MODEL_NAME"
28121
mkdir -p "$RESULT_PATH"
29122

30-
# Evaluate ceval, mmlu, gsm8k
31-
lm_eval --model vllm \
32-
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
33-
--tasks ceval-valid \
34-
--num_fewshot 5 \
35-
--batch_size auto \
36-
--output_path "$RESULT_PATH/ceval_results.json" 2>&1 | tee "$RESULT_PATH/ceval.log"
37-
38-
lm_eval --model vllm \
39-
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
40-
--tasks mmlu \
41-
--num_fewshot 4 \
42-
--batch_size 1 \
43-
--output_path "$RESULT_PATH/mmlu_results.json" 2>&1 | tee "$RESULT_PATH/mmlu.log"
44-
45-
lm_eval --model vllm \
46-
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
47-
--tasks gsm8k \
48-
--num_fewshot 5 \
49-
--batch_size auto \
50-
--output_path "$RESULT_PATH/gsm8k_results.json" 2>&1 | tee "$RESULT_PATH/gsm8k.log"
51-
52-
# Evaluate humaneval
53-
lm_eval --model vllm \
54-
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
55-
--tasks humaneval \
56-
--num_fewshot 0 \
57-
--batch_size auto \
58-
--confirm_run_unsafe_code \
59-
--output_path "$RESULT_PATH/humaneval_results.json" 2>&1 | tee "$RESULT_PATH/humaneval.log"
60-
123+
for TASK in "${TASKS[@]}"; do
124+
echo "=============================================="
125+
echo "Evaluating task: $TASK"
126+
echo "Number of few-shot: $NUM_FEWSHOT"
127+
echo "=============================================="
128+
if [[ "$TASK" == *"humaneval"* ]]; then
129+
# Evaluate humaneval
130+
lm_eval --model vllm \
131+
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
132+
--tasks $TASK \
133+
--num_fewshot $NUM_FEWSHOT \
134+
--batch_size $BATCH_SIZE \
135+
--confirm_run_unsafe_code \
136+
--output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
137+
else
138+
lm_eval --model vllm \
139+
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
140+
--tasks $TASK \
141+
--num_fewshot $NUM_FEWSHOT \
142+
--batch_size $BATCH_SIZE \
143+
--output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
144+
fi
145+
done
146+
61147
echo "Evaluation completed for $MODEL_NAME"
62148
echo "Results saved to: $RESULT_PATH"
63149
done

0 commit comments

Comments
 (0)