Tencent · ali-88123 · Oct 31, 2025 · Oct 31, 2025 · Oct 31, 2025
diff --git a/README.md b/README.md
@@ -154,7 +154,7 @@ python3 tools/spec_benchmark.py \
 测试`transformers`加载量化模型离线推理：
 
 ```shell
-python deploy/offline.py $MODEL_PATH
+python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
 ```
 
 其中 `MODEL_PATH` 为量化产出模型路径。
@@ -168,32 +168,35 @@ python deploy/offline.py $MODEL_PATH
 [vLLM](https://github.com/vllm-project/vllm) 服务启动脚本，建议版本`vllm>=0.8.5.post1`，部署MOE INT8量化模型需要`vllm>=0.9.2`。
 
 ```shell
-bash deploy/run_vllm.sh $MODEL_PATH
+bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
 ```
+其中`-d`为可见设备，`-t`为张量并行度，`-p`为流水线并行度，`-g`为显存使用率。
 
 **SGLang**
 
 [SGLang](https://github.com/sgl-project/sglang) 服务启动脚本，建议版本 `sglang>=0.4.6.post1`：
 
 ```shell
-bash deploy/run_sglang.sh $MODEL_PATH
+bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
 ```
 
 #### 3. 服务调用
 
 通过 [OpenAI 格式](https://platform.openai.com/docs/api-reference/introduction) 接口发起请求：
 
 ```shell
-bash deploy/openai.sh $MODEL_PATH
+bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
 ```
+其中`-p`为输入prompt
 
 #### 4. 效果验证
 
 使用 [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) 评估量化模型精度，建议版本`lm-eval>=0.4.8`：
 
 ```shell
-bash deploy/lm_eval.sh $MODEL_PATH
+bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
 ```
+其中`RESULT_PATH`为测试结果保存目录，`-b`为batch size大小，`--tasks`为评测任务，`-n`为few-shot数量
 
 详细操作指南请参阅[部署文档](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html)。
 

diff --git a/README_en.md b/README_en.md
@@ -154,7 +154,7 @@ If you need to load a quantized model via `transformers`, please set the `deploy
 To test offline inference with a quantized model loaded via `transformers`, run the following command:
 
 ```shell
-python deploy/offline.py $MODEL_PATH
+python scripts/deploy/offline.py $MODEL_PATH "Hello, my name is"
 ```
 
 Where `MODEL_PATH` is the path to the quantized model output.
@@ -169,33 +169,36 @@ Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm
 
 
 ```shell
-bash deploy/run_vllm.sh $MODEL_PATH
+bash scripts/deploy/run_vllm.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -p 1 -g 0.8 --max-model-len 4096
 ```
+Where `-d` is the visible devices, `-t` is tensor parallel size, `-p` is pipeline parallel size, and `-g` is the GPU memory utilization.
 
 **SGLang**
 
 
 Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
 
 ```shell
-bash deploy/run_sglang.sh $MODEL_PATH
+bash scripts/deploy/run_sglang.sh --model-path $MODEL_PATH --port 8080 -d 0,1,2,3 -t 4 -g 0.8
 ```
 
 #### 3. Service Invocation
 
 Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
 
 ```shell
-bash deploy/openai.sh $MODEL_PATH
+bash scripts/deploy/openai.sh -m $MODEL_PATH -p "Hello, my name is" --port 8080 --max-tokens 4096 --temperature 0.7 --top-p 0.8 --top-k 20 --repetition-penalty 1.05 --system-prompt "You are a helpful assistant."
 ```
+where `-p` is the input prompt.
 
 #### 4. Performance Evaluation
 
 Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
 
 ```shell
-bash deploy/lm_eval.sh $MODEL_PATH
+bash scripts/deploy/lm_eval.sh -d 0,1 -t 2 -g 0.8 -r $RESULT_PATH -b "auto" --tasks ceval-valid,mmlu,gsm8k,humaneval -n 0 $MODEL_PATH
 ```
+where `RESULT_PATH` is the directory for saving test results, `-b` is batch size, `--tasks` specifies the evaluation tasks, and `-n` is the number of few-shot examples.
 
 For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
 

diff --git a/angelslim/models/llm/kimi_k2.py b/angelslim/models/llm/kimi_k2.py
@@ -16,11 +16,11 @@
 from transformers import AutoModelForCausalLM
 from transformers.models.deepseek_v3 import DeepseekV3Config
 
-from ...tokenizer import TikTokenTokenizer
 from ...utils import print_info
 from ..model_factory import SlimModelFactory
 from .deepseek import DeepSeek
 from .modeling_deepseek import DeepseekV3ForCausalLM
+from .tiktoken_tokenizer import TikTokenTokenizer
 
 
 @SlimModelFactory.register

diff --git a/angelslim/tokenizer/kimi_k2.py → angelslim/models/llm/tiktoken_tokenizer.py b/angelslim/tokenizer/kimi_k2.py → angelslim/models/llm/tiktoken_tokenizer.py
diff --git a/angelslim/tokenizer/__init__.py b/angelslim/tokenizer/__init__.py
diff --git a/scripts/deploy/lm_eval.sh b/scripts/deploy/lm_eval.sh
@@ -1,63 +1,149 @@
 #!/bin/bash
 
-# Set environment variables
-export CUDA_VISIBLE_DEVICES=0,1,2,3
-export PYTHON_MULTIPROCESSING_METHOD=spawn
-export VLLM_WORKER_MULTIPROC_METHOD=spawn
-export HF_ALLOW_CODE_EVAL=1
+usage() {
+    cat << EOF
+Usage: $0 [OPTIONS] <model_path1> <model_path2> ...
+
+Options:
+  -d, --devices DEVICES          CUDA devices to use (default: 0,1,2,3)
+  -t, --tensor-parallel SIZE     Tensor parallel size (default: 4)
+  -g, --gpu-memory UTILIZATION   GPU memory utilization (default: 0.9)
+  -r, --result-dir DIR           Base result directory (default: ./results)
+  -b, --batch-size SIZE          Batch size for auto tasks (default: auto)
+  --tasks TASK1,TASK2,...        Comma-separated list of tasks to evaluate (default: ceval-valid,mmlu,gsm8k,humaneval)
+  -n, --num-fewshot NUM          Number of few-shot examples (default: 0)
+  -h, --help                     Show this help message
+
+Examples:
+  bash $0 -d 0,1 -t 2 --gpu-memory 0.8 /path/to/model1 /path/to/model2
+  bash $0 --tasks ceval-valid,mmlu,gsm8k,humaneval /path/to/model1
+EOF
+}
 
+CUDA_VISIBLE_DEVICES="0,1,2,3"
 INFERENCE_TP_SIZE=4
+GPU_MEMORY_UTILIZATION=0.9
+RESULT_BASE_DIR="./results"
+BATCH_SIZE="auto"
+TASKS=("ceval-valid" "mmlu" "gsm8k" "humaneval")
+NUM_FEWSHOT=0
+
+POSITIONAL_ARGS=()
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        -d|--devices)
+            CUDA_VISIBLE_DEVICES="$2"
+            shift 2
+            ;;
+        -t|--tensor-parallel)
+            INFERENCE_TP_SIZE="$2"
+            shift 2
+            ;;
+        -g|--gpu-memory)
+            GPU_MEMORY_UTILIZATION="$2"
+            shift 2
+            ;;
+        -r|--result-dir)
+            RESULT_BASE_DIR="$2"
+            shift 2
+            ;;
+        -b|--batch-size)
+            BATCH_SIZE="$2"
+            shift 2
+            ;;
+        --tasks)
+            IFS=',' read -ra TASKS <<< "$2"
+            shift 2
+            ;;
+        -n|--num-fewshot)
+            NUM_FEWSHOT="$2"
+            shift 2
+            ;;
+        -h|--help)
+            usage
+            exit 0
+            ;;
+        -*|--*)
+            echo "Error: Unknown option: $1"
+            usage
+            exit 1
+            ;;
+        *)
+            POSITIONAL_ARGS+=("$1")
+            shift
+            ;;
+    esac
+done
+
+set -- "${POSITIONAL_ARGS[@]}"
 
 # Check if model paths are provided
 if [ $# -eq 0 ]; then
     echo "Usage: $0 <model_path1> <model_path2> ..."
     exit 1
 fi
 
+# Set environment variables
+export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
+export PYTHON_MULTIPROCESSING_METHOD=spawn
+export VLLM_WORKER_MULTIPROC_METHOD=spawn
+export HF_ALLOW_CODE_EVAL=1
+
+echo "======================================================"
+echo "           Model Evaluation Configuration"
+echo "======================================================"
+echo "CUDA Visible Devices:      $CUDA_VISIBLE_DEVICES"
+echo "Tensor Parallel Size:      $INFERENCE_TP_SIZE"
+echo "GPU Memory Utilization:    $GPU_MEMORY_UTILIZATION"
+echo "Result Base Directory:     $RESULT_BASE_DIR"
+echo "Batch Size:                $BATCH_SIZE"
+echo "Number of Few-shot:        $NUM_FEWSHOT"
+echo "Tasks to Evaluate:         ${TASKS[*]}"
+echo "Number of Models:          $#"
+echo "Model Paths:"
+for model_path in "$@"; do
+    echo "  - $model_path"
+done
+echo "======================================================"
+echo
+
 # Iterate over all provided model paths
 for MODEL_PATH in "$@"; do
     # Extract model name from path (last directory name)
     MODEL_NAME=$(basename "$MODEL_PATH")
     echo "======================================================"
     echo "Evaluating model: $MODEL_NAME"
     echo "Model path: $MODEL_PATH"
-    echo "======================================================"
 
     # Create dedicated result directory for the model
-    RESULT_PATH="./results/$MODEL_NAME"
+    RESULT_PATH="$RESULT_BASE_DIR/$MODEL_NAME"
     mkdir -p "$RESULT_PATH"
 
-    # Evaluate ceval, mmlu, gsm8k
-    lm_eval --model vllm \
-        --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
-        --tasks ceval-valid \
-        --num_fewshot 5 \
-        --batch_size auto \
-        --output_path "$RESULT_PATH/ceval_results.json" 2>&1 | tee "$RESULT_PATH/ceval.log"
-
-    lm_eval --model vllm \
-        --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
-        --tasks mmlu \
-        --num_fewshot 4 \
-        --batch_size 1 \
-        --output_path "$RESULT_PATH/mmlu_results.json" 2>&1 | tee "$RESULT_PATH/mmlu.log"
-
-    lm_eval --model vllm \
-        --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
-        --tasks gsm8k \
-        --num_fewshot 5 \
-        --batch_size auto \
-        --output_path "$RESULT_PATH/gsm8k_results.json" 2>&1 | tee "$RESULT_PATH/gsm8k.log"
-
-    # Evaluate humaneval
-    lm_eval --model vllm \
-        --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=0.9,tensor_parallel_size=$INFERENCE_TP_SIZE \
-        --tasks humaneval \
-        --num_fewshot 0 \
-        --batch_size auto \
-        --confirm_run_unsafe_code \
-        --output_path "$RESULT_PATH/humaneval_results.json" 2>&1 | tee "$RESULT_PATH/humaneval.log"
-
+    for TASK in "${TASKS[@]}"; do
+        echo "=============================================="
+        echo "Evaluating task: $TASK"
+        echo "Number of few-shot: $NUM_FEWSHOT"
+        echo "=============================================="
+        if [[ "$TASK" == *"humaneval"* ]]; then
+            # Evaluate humaneval
+            lm_eval --model vllm \
+                --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
+                --tasks $TASK \
+                --num_fewshot $NUM_FEWSHOT \
+                --batch_size $BATCH_SIZE \
+                --confirm_run_unsafe_code \
+                --output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
+        else
+            lm_eval --model vllm \
+                --model_args pretrained=$MODEL_PATH,add_bos_token=True,gpu_memory_utilization=$GPU_MEMORY_UTILIZATION,tensor_parallel_size=$INFERENCE_TP_SIZE \
+                --tasks $TASK \
+                --num_fewshot $NUM_FEWSHOT \
+                --batch_size $BATCH_SIZE \
+                --output_path "$RESULT_PATH/$TASK.json" 2>&1 | tee "$RESULT_PATH/$TASK.log"
+        fi
+    done
+
     echo "Evaluation completed for $MODEL_NAME"
     echo "Results saved to: $RESULT_PATH"
 done