diff --git a/.claude/skills/docker-build/SKILL.md b/.claude/skills/docker-build/SKILL.md index 37d0867469..1a22d656f5 100644 --- a/.claude/skills/docker-build/SKILL.md +++ b/.claude/skills/docker-build/SKILL.md @@ -1,7 +1,6 @@ --- name: docker-build description: Build an LMDeploy Docker image and push it to the inner registry. -disable-model-invocation: true --- # Docker Build & Push @@ -26,7 +25,7 @@ If any are missing, stop and tell the user to set them before proceeding. BRANCH=$(git branch --show-current | sed 's/[^a-zA-Z0-9._-]/-/g') SHA=$(git rev-parse --short=7 HEAD) TAG="${BRANCH}-${SHA}" -IMAGE="${LMDEPLOY_REGISTRY}/lmdeploy:${TAG}" +IMAGE="${LMDEPLOY_REGISTRY}/ailab-puyu-puyu_gpu/lmdeploy-dev:lmdeploy-${TAG}" ``` Print the computed image name so the user can confirm. diff --git a/.claude/skills/submit-llm-eval/SKILL.md b/.claude/skills/submit-llm-eval/SKILL.md new file mode 100644 index 0000000000..f61e10259b --- /dev/null +++ b/.claude/skills/submit-llm-eval/SKILL.md @@ -0,0 +1,204 @@ +--- +name: submit-eval +description: Use when submitting a model eval task to the auto-eval platform +disable-model-invocation: true +--- + +# Submit Eval Task + +Submit a model evaluation task to the auto-eval platform. + +## Prerequisites + +Read `~/.eval/config` and verify these required keys are present: + +``` +AUTO_EVAL_TOKEN +FEISHU_EVAL_WEBHOOK +USER +OPENAI_API_BASE +AUTO_EVAL_API_URL +``` + +If any are missing, stop and tell the user to populate `~/.eval/config`. + +Also verify `~/.eval/model.yaml` exists. If missing, stop and tell the user to create it. + +## 1. Gather inputs + +Read `~/.eval/model.yaml` and present the list of available model keys to the user. Ask them to select one or more models. The model key (e.g. `Qwen3.5-35B-A3B`) is the `model_abbr`. User input is matched case-insensitively — `qwen3.5-35b-a3b` matches `Qwen3.5-35B-A3B`. The original casing from the YAML key is used in the payload. + +Then ask for: + +- **backend** — `pytorch` or `turbomind`. This determines the Dockerfile used for building and is passed as `--backend` in `infer_extra_params`. +- **instances** — number of inference instances (integer, used to compute `end_num`) +- **datasets** — comma-separated dataset keys (looked up in `~/.eval/config`) +- **image** (optional) — Docker image for the eval container + +If the user selected multiple models, repeat steps 3-10 for each model. + +## 2. Look up model config + +For each selected model, read its entry from `~/.eval/model.yaml`. Extract `model_path` and all other fields. + +All fields except `model_path` are passed as CLI flags to `infer_extra_params`, mapping each key to `--{key} {value}`. The `--backend` flag comes from the user's backend input (step 1), not from model.yaml. For example: + +```yaml +tp: 2 +reasoning_parser: qwen-qwq +tool_call_parser: qwen +``` + +with `backend=turbomind` produces: + +``` +--tp 2 --backend turbomind --reasoning-parser qwen-qwq --tool-call-parser qwen +``` + +Keys with underscores are converted to hyphens for the CLI flag name. + +## 3. Resolve image + +If the user provided an `image`, use it. Otherwise, invoke the `docker-build` skill to build and push an image from the current branch. Use the `image` variable it produces. + +- If backend is `turbomind`, tell `docker-build` to use the full build mode (`docker/Dockerfile`). +- If backend is `pytorch`, tell `docker-build` to use the patch build mode (`docker/Dockerfile_patch`). + +## 4. Resolve datasets + +Parse the comma-separated `datasets` input. For each key, look up its value in `~/.eval/config`. If a key is not found, stop and list available dataset keys. + +Combine the values into `subdataset`: `[*val1, *val2, ...]` + +## 5. Compute derived fields + +Compute the timestamp-padded model name: + +```bash +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +MODEL_ABBR_PADDED="${model_abbr}-${TIMESTAMP}" +``` + +Compute `infer_extra_params` from all model.yaml fields except `model_path` (see step 2 for the mapping rule). + +Compute resource fields: + +- `gpu_num` = `tp` +- `cpu` = `16 * tp` +- `memory` = `"128000 * tp"` (as string) +- `end_num` = `instances + 1` +- `tokenizer_path` = `model_path` +- `output_dir` = `"./{USER}/${MODEL_ABBR_PADDED}"` + +## 6. Assemble model_infer_config + +Build as a Python dict string with these fields: + +```python +{ + 'type': 'OpenAISDKStreaming', + 'key': 'sk-admin', + 'openai_api_base': ['{OPENAI_API_BASE}'], + 'query_per_second': 8, + 'batch_size': 32, + 'max_workers': 8, + 'temperature': 1, + 'tokenizer_path': '{model_path}', + 'retry': 50, + 'max_out_len': 128000, + 'max_seq_len': 128000, + 'extra_body': { + 'top_k': 20, + 'repetition_penalty': 1.0, + 'top_p': 0.95, + }, + 'verbose': True, +} +``` + +If the model has a `reasoning_parser` field, add to `extra_body`: + +```python +'chat_template_kwargs': {'enable_thinking': True}, +``` + +And add at the top level: + +```python +'pred_postprocessor': {'type': 'extract-non-reasoning-content'}, +``` + +## 7. Compute model_infer_config_base64 + +```bash +echo -n '{model_infer_config}' | base64 -w 0 +``` + +## 8. Assemble infer_backend_config + +```json +{ + "end_num": {instances + 1}, + "gpu_num": {tp}, + "memory": "{128000 * tp}", + "cpu": {16 * tp}, + "parallelism": "TP", + "oc_cpu": "1", + "oc_mem": 4000, + "model": "{MODEL_ABBR_PADDED}", + "model_path": "{model_path}", + "image": "{image}", + "infer_engine": "lmdeploy", + "infer_extra_params": "{infer_extra_params}", + "delete": "false", + "start_infer": "true", + "node_num": 1 +} +``` + +## 9. Assemble full payload + +Build the JSON body: + +```json +{ + "job_name": "api_eval_v4", + "param": { + "cluster": "yidian", + "workspace_id": "evalservice_gpu", + "model_abbr": "{MODEL_ABBR_PADDED}", + "user": "{USER}", + "model_infer_config": "{model_infer_config as string}", + "llm_judger_config": "", + "infer_worker_nums": 8, + "eval_nums": "15", + "eval_type": "chat_objective", + "auto_eval_version": "ld_0122_oc_0524d49_v2", + "ocp_version": "fullbench_v2_0", + "subdataset": "{subdataset}", + "fast_infer": "true", + "output_dir": "{output_dir}", + "eval_only": "false", + "cli_extra": "", + "dataset_max_out_len": "128000", + "feishu_token": "{FEISHU_EVAL_WEBHOOK}", + "model_infer_config_base64": "{model_infer_config_base64}", + "infer_backend_config": {infer_backend_config} + } +} +``` + +## 10. Submit + +Execute the curl command: + +```bash +curl -s -X POST "${AUTO_EVAL_API_URL}" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer ${AUTO_EVAL_TOKEN}" \ + -d @- <<'JSON' +{payload} +JSON +``` + +Report the HTTP status and response body to the user. diff --git a/.gitignore b/.gitignore index 7b4051fcdd..ff0779dc17 100644 --- a/.gitignore +++ b/.gitignore @@ -4,13 +4,10 @@ __pycache__/ *$py.class .vscode/ .idea/ +.cursor/ # C extensions *.so -# skills -.cursor/ -!.claude/skills/docker-build/ -!.claude/skills/docker-build/SKILL.md # Distribution / packaging .Python @@ -51,6 +48,7 @@ htmlcov/ .cache *build*/ !builder/ +!.claude/skills/docker-build/ lmdeploy/lib/ lmdeploy/bin/ dist/ diff --git a/docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md b/docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md new file mode 100644 index 0000000000..e23b1cd336 --- /dev/null +++ b/docs/superpowers/specs/2026-04-29-submit-eval-skill-design.md @@ -0,0 +1,149 @@ +# submit-eval Skill Design + +## Overview + +A Claude Code skill that submits a model evaluation task to the auto-eval platform. The API URL is read from `~/.eval/config`. It reads model and dataset config from user-maintained files, computes derived fields, optionally builds a Docker image via the `docker-build` skill, and submits the request via curl. + +## User Inputs + +| Input | Required | Example | Notes | +| ------------ | -------- | --------------------------------------- | --------------------------------------------------------------------------------- | +| `model_abbr` | Yes | `Qwen3.5-35B` | Key in `~/.eval/models/model.yaml`; padded with `-yyyymmdd-hhmmss` at submit time | +| `instances` | Yes | `3` | Number of inference instances; used to compute `end_num` | +| `datasets` | Yes | `mmlu_pro, ifeval, aime2026` | Comma-separated keys (looked up in `~/.eval/config`) | +| `image` | No | `/lmdeploy:main-abc1234` | Docker image. If omitted, triggers `docker-build` skill | + +## Config Files + +### `~/.eval/config` (KEY=VALUE) + +``` +AUTO_EVAL_TOKEN= +FEISHU_EVAL_WEBHOOK= +USER=lvhan +OPENAI_API_BASE= +AUTO_EVAL_API_URL= +mmlu_pro=*mmlu_pro_datasets +ifeval=*ifeval_datasets +aime2026=*aime2026_datasets +``` + +### `~/.eval/models/model.yaml` (YAML) + +```yaml +Qwen3.5-35B: + model_path: /mnt/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/b1fc3d59ae0ab1e4279e04a8dd0fc4dc361fc2b6 + tp: 2 + backend: turbomind + reasoning_parser: qwen-qwq + # tool_call_parser: ... (optional) + +Qwen3-32B: + model_path: /mnt/shared-storage-gpfs2/.../snapshots/... + tp: 2 + backend: turbomind + reasoning_parser: qwen-qwq +``` + +## Hardcoded Defaults + +| Field | Default | +| ------------------- | ----------------------- | +| `job_name` | `api_eval_v4` | +| `cluster` | `yidian` | +| `workspace_id` | `evalservice_gpu` | +| `eval_type` | `chat_objective` | +| `auto_eval_version` | `ld_0122_oc_0524d49_v2` | +| `ocp_version` | `fullbench_v2_0` | +| `fast_infer` | `true` | +| `eval_only` | `false` | +| `parallelism` | `TP` | +| `infer_engine` | `lmdeploy` | +| `delete` | `false` | +| `start_infer` | `true` | +| `node_num` | `1` | +| `oc_cpu` | `1` | +| `oc_mem` | `4000` | +| `infer_worker_nums` | `8` | +| `eval_nums` | `15` | +| `llm_judger_config` | `""` | +| `cli_extra` | `""` | + +## Derived Fields + +| Field | Derivation | +| --------------------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `model_abbr` (padded) | `{model_abbr}-{yyyymmdd-hhmmss}` using current timestamp at submit time | +| `subdataset` | From `datasets` input: look up each key in `~/.eval/config`, combine values as `[*val1, *val2, ...]` | +| `infer_extra_params` | `--tp {tp} --backend {backend} --reasoning-parser {reasoning_parser}` + optional `--tool-call-parser {tool_call_parser}` | +| `gpu_num` | = `tp` | +| `cpu` | = `16 * tp` | +| `memory` | = `"128000 * tp"` (as string, e.g. tp=2 → `"256000"`) | +| `end_num` | = `instances + 1` | +| `tokenizer_path` | = `model_path` | +| `output_dir` | `./{user}/{model_abbr_padded}` | +| `model_infer_config` | Assembled dict string (see below) | +| `model_infer_config_base64` | `echo -n '{model_infer_config}' \| base64` | +| `infer_backend_config` | Assembled dict (see below) | + +### model_infer_config structure + +Assembled as a Python dict string with: + +- `type`: `OpenAISDKStreaming` +- `key`: `sk-admin` +- `openai_api_base`: `['{OPENAI_API_BASE}']` (from `~/.eval/config`) +- `query_per_second`: `8` +- `batch_size`: `32` +- `max_workers`: `8` +- `temperature`: `1` +- `tokenizer_path`: `{model_path}` +- `retry`: `50` +- `max_out_len`: `128000` +- `max_seq_len`: `128000` +- `extra_body`: `{top_k: 20, repetition_penalty: 1.0, top_p: 0.95, chat_template_kwargs: {enable_thinking: True}}` + - If `reasoning_parser` is set, include `chat_template_kwargs.enable_thinking: True`; otherwise omit +- `pred_postprocessor`: `{type: 'extract-non-reasoning-content'}` (only if `reasoning_parser` is set) +- `verbose`: `True` + +### infer_backend_config structure + +Assembled as a dict with: + +- `end_num`: `{instances + 1}` +- `gpu_num`: `{tp}` +- `memory`: `"{128000 * tp}"` +- `cpu`: `{16 * tp}` +- `parallelism`: `TP` +- `oc_cpu`: `1` +- `oc_mem`: `4000` +- `model`: `{model_abbr_padded}` +- `model_path`: `{model_path}` +- `image`: `{image}` +- `infer_engine`: `lmdeploy` +- `infer_extra_params`: `{infer_extra_params}` +- `delete`: `false` +- `start_infer`: `true` +- `node_num`: `1` + +## Flow + +1. **Read `~/.eval/config`** — verify `AUTO_EVAL_TOKEN`, `FEISHU_EVAL_WEBHOOK`, `USER`, `OPENAI_API_BASE`, `AUTO_EVAL_API_URL` are present. Stop if missing. +2. **Gather inputs** — ask user for `model_abbr`, `instances`, `datasets`, and optionally `image`. +3. **Look up model** — read `~/.eval/models/model.yaml`, find the entry for `model_abbr`. Stop if not found. +4. **Resolve image** — if user provided `image`, use it; else invoke the `docker-build` skill to build and push an image from the current branch. +5. **Resolve datasets** — parse comma-separated `datasets` input, look up each key in `~/.eval/config`, combine into `subdataset` value. +6. **Compute derived fields** — pad `model_abbr` with timestamp, assemble `infer_extra_params`, `gpu_num`, `cpu`, `memory`, `end_num`, `model_infer_config`, `model_infer_config_base64`, `infer_backend_config`, `output_dir`. +7. **Assemble payload** — build the full JSON body with defaults + user inputs + computed fields. +8. **Submit** — execute `curl -X POST` to `AUTO_EVAL_API_URL` with the payload and `AUTO_EVAL_TOKEN` as Bearer token. Report the response. + +## Skill File Location + +`/workspace/lmdeploy/.claude/skills/submit-eval/SKILL.md` + +## Error Handling + +- Missing `~/.eval/config` or incomplete keys → stop and tell user to create/populate it +- Model not found in `~/.eval/models/model.yaml` → stop and list available models +- Dataset key not found in `~/.eval/config` → stop and list available dataset keys +- curl failure → report the HTTP status and response body