Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ python3 tools/spec_benchmark.py \
测试`transformers`加载量化模型离线推理:

```shell
python deploy/offline.py $MODEL_PATH
python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
```

其中 `MODEL_PATH` 为量化产出模型路径。
Expand All @@ -168,32 +168,35 @@ python deploy/offline.py $MODEL_PATH
[vLLM](https://github.com/vllm-project/vllm) 服务启动脚本,建议版本`vllm>=0.8.5.post1`,部署MOE INT8量化模型需要`vllm>=0.9.2`。

```shell
bash deploy/run_vllm.sh $MODEL_PATH
bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
```
其中`-d`为可见设备,`-t`为张量并行度,`-p`为流水线并行度,`-g`为显存使用率。

**SGLang**

[SGLang](https://github.com/sgl-project/sglang) 服务启动脚本,建议版本 `sglang>=0.4.6.post1`:

```shell
bash deploy/run_sglang.sh $MODEL_PATH
bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
```

#### 3. 服务调用

通过 [OpenAI 格式](https://platform.openai.com/docs/api-reference/introduction) 接口发起请求:

```shell
bash deploy/openai.sh $MODEL_PATH
bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
```
其中`-p`为输入prompt

#### 4. 效果验证

使用 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 评估量化模型精度,建议版本`lm-eval>=0.4.8`:

```shell
bash deploy/lm_eval.sh $MODEL_PATH
bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
```
其中`RESULT_PATH`为测试结果保存目录,`-b`为batch size大小,`--tasks`为评测任务,`-n`为few-shot数量

详细操作指南请参阅[部署文档](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html)。

Expand Down
13 changes: 8 additions & 5 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ If you need to load a quantized model via `transformers`, please set the `deploy
To test offline inference with a quantized model loaded via `transformers`, run the following command:

```shell
python deploy/offline.py $MODEL_PATH
python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
```

Where `MODEL_PATH` is the path to the quantized model output.
Expand All @@ -169,33 +169,36 @@ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm


```shell
bash deploy/run_vllm.sh $MODEL_PATH
bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
```
Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization.

**SGLang**


Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.

```shell
bash deploy/run_sglang.sh $MODEL_PATH
bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
```

#### 3. Service Invocation

Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):

```shell
bash deploy/openai.sh $MODEL_PATH
bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
```
where `-p` is the input prompt.

#### 4. Performance Evaluation

Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:

```shell
bash deploy/lm_eval.sh $MODEL_PATH
bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
```
where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples.

For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).

Expand Down
2 changes: 1 addition & 1 deletion angelslim/models/llm/kimi_k2.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@
from transformers import AutoModelForCausalLM
from transformers.models.deepseek_v3 import DeepseekV3Config

from ...tokenizer import TikTokenTokenizer
from ...utils import print_info
from ..model_factory import SlimModelFactory
from .deepseek import DeepSeek
from .modeling_deepseek import DeepseekV3ForCausalLM
from .tiktoken_tokenizer import TikTokenTokenizer


@SlimModelFactory.register
Expand Down
15 changes: 0 additions & 15 deletions angelslim/tokenizer/__init__.py

This file was deleted.

162 changes: 124 additions & 38 deletions scripts/deploy/lm_eval.sh
Original file line number Diff line number Diff line change
@@ -1,63 +1,149 @@
#!/bin/bash

# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTHON_MULTIPROCESSING_METHOD=spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export HF_ALLOW_CODE_EVAL=1
usage() {
cat << EOF
Usage: $0 [OPTIONS] <model_path1> <model_path2> ...

Options:
-d, --devices DEVICES CUDA devices to use (default: 0,1,2,3)
-t, --tensor-parallel SIZE Tensor parallel size (default: 4)
-g, --gpu-memory UTILIZATION GPU memory utilization (default: 0.9)
-r, --result-dir DIR Base result directory (default: ./results)
-b, --batch-size SIZE Batch size for auto tasks (default: auto)
--tasks TASK1,TASK2,... Comma-separated list of tasks to evaluate (default: ceval-valid,mmlu,gsm8k,humaneval)
-n, --num-fewshot NUM Number of few-shot examples (default: 0)
-h, --help Show this help message

Examples:
bash $0 -d 0,1 -t 2 --gpu-memory 0.8 /path/to/model1 /path/to/model2
bash $0 --tasks ceval-valid,mmlu,gsm8k,humaneval /path/to/model1
EOF
}

CUDA_VISIBLE_DEVICES="0,1,2,3"
INFERENCE_TP_SIZE=4
GPU_MEMORY_UTILIZATION=0.9
RESULT_BASE_DIR="./results"
BATCH_SIZE="auto"
TASKS=("ceval-valid" "mmlu" "gsm8k" "humaneval")
NUM_FEWSHOT=0

POSITIONAL_ARGS=()

while [[ $# -gt 0 ]]; do
case $1 in
-d|--devices)
CUDA_VISIBLE_DEVICES="$2"
shift 2
;;
-t|--tensor-parallel)
INFERENCE_TP_SIZE="$2"
shift 2
;;
-g|--gpu-memory)
GPU_MEMORY_UTILIZATION="$2"
shift 2
;;
-r|--result-dir)
RESULT_BASE_DIR="$2"
shift 2
;;
-b|--batch-size)
BATCH_SIZE="$2"
shift 2
;;
--tasks)
IFS=',' read -ra TASKS <<< "$2"
shift 2
;;
-n|--num-fewshot)
NUM_FEWSHOT="$2"
shift 2
;;
-h|--help)
usage
exit 0
;;
-*|--*)
echo "Error: Unknown option: $1"
usage
exit 1
;;
*)
POSITIONAL_ARGS+=("$1")
shift
;;
esac
done

set -- "${POSITIONAL_ARGS[@]}"

# Check if model paths are provided
if [ $# -eq 0 ]; then
echo "Usage: $0 <model_path1> <model_path2> ..."
exit 1
fi

# Set environment variables
export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
export PYTHON_MULTIPROCESSING_METHOD=spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export HF_ALLOW_CODE_EVAL=1

echo "======================================================"
echo " Model Evaluation Configuration"
echo "======================================================"
echo "CUDA Visible Devices: $CUDA_VISIBLE_DEVICES"
echo "Tensor Parallel Size: $INFERENCE_TP_SIZE"
echo "GPU Memory Utilization: $GPU_MEMORY_UTILIZATION"
echo "Result Base Directory: $RESULT_BASE_DIR"
echo "Batch Size: $BATCH_SIZE"
echo "Number of Few-shot: $NUM_FEWSHOT"
echo "Tasks to Evaluate: ${TASKS[*]}"
echo "Number of Models: $#"
echo "Model Paths:"
for model_path in "$@"; do
echo " - $model_path"
done
echo "======================================================"
echo

# Iterate over all provided model paths
for MODEL_PATH in "$@"; do
# Extract model name from path (last directory name)
MODEL_NAME=$(basename "$MODEL_PATH")
echo "======================================================"
echo "Evaluating model: $MODEL_NAME"
echo "Model path: $MODEL_PATH"
echo "======================================================"

# Create dedicated result directory for the model
RESULT_PATH="./results/$MODEL_NAME"
RESULT_PATH="$RESULT_BASE_DIR/$MODEL_NAME"
mkdir -p "$RESULT_PATH"

# Evaluate ceval, mmlu, gsm8k
lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks ceval-valid \
--num_fewshot 5 \
--batch_size auto \
--output_path "$RESULT_PATH/ceval_results.json" 2>&1 | tee "$RESULT_PATH/ceval.log"

lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks mmlu \
--num_fewshot 4 \
--batch_size 1 \
--output_path "$RESULT_PATH/mmlu_results.json" 2>&1 | tee "$RESULT_PATH/mmlu.log"

lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto \
--output_path "$RESULT_PATH/gsm8k_results.json" 2>&1 | tee "$RESULT_PATH/gsm8k.log"

# Evaluate humaneval
lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks humaneval \
--num_fewshot 0 \
--batch_size auto \
--confirm_run_unsafe_code \
--output_path "$RESULT_PATH/humaneval_results.json" 2>&1 | tee "$RESULT_PATH/humaneval.log"

for TASK in "${TASKS[@]}"; do
echo "=============================================="
echo "Evaluating task: $TASK"
echo "Number of few-shot: $NUM_FEWSHOT"
echo "=============================================="
if [[ "$TASK" == *"humaneval"* ]]; then
# Evaluate humaneval
lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks $TASK \
--num_fewshot $NUM_FEWSHOT \
--batch_size $BATCH_SIZE \
--confirm_run_unsafe_code \
--output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
else
lm_eval --model vllm \
--model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
--tasks $TASK \
--num_fewshot $NUM_FEWSHOT \
--batch_size $BATCH_SIZE \
--output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
fi
done

echo "Evaluation completed for $MODEL_NAME"
echo "Results saved to: $RESULT_PATH"
done
Expand Down
Loading