|
| 1 | +--- |
| 2 | +name: serve-config-guide |
| 3 | +description: Generate a source-backed starting `trtllm-serve --config` YAML for |
| 4 | + basic aggregate single-node PyTorch serving, aligned with checked-in TensorRT-LLM |
| 5 | + configs and deployment docs. Preserves explicit latency / balanced / throughput |
| 6 | + objectives. Excludes disaggregated, multi-node, and non-MTP speculative configs. |
| 7 | +--- |
| 8 | + |
| 9 | +# Serve Config Guide |
| 10 | + |
| 11 | +**Scope:** aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it). |
| 12 | + |
| 13 | +**Input:** model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified). |
| 14 | +**Output:** repo-grounded starting YAML for `trtllm-serve --config`. |
| 15 | + |
| 16 | +If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in `docs/source/features/` (e.g., speculative-decoding, disagg-serving, parallel-strategy) or `examples/llm-api/`. |
| 17 | + |
| 18 | +## Constraints |
| 19 | + |
| 20 | +1. **Speculative exclusion:** Exclude configs containing `speculative_config` by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with `decoding_type: MTP` in `examples/configs/`). When including MTP, copy the full `speculative_config` block verbatim — never interpolate speculative fields. |
| 21 | + |
| 22 | +2. **Objective preservation:** Preserve the user's stated objective through config selection. Use `database.py` profile labels (`Min Latency`, `Balanced`, `Max Throughput`; plus `Low Latency`/`High Throughput` in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch. |
| 23 | + |
| 24 | +3. **Source preference:** Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified. |
| 25 | + |
| 26 | +## Response Format |
| 27 | + |
| 28 | +For **exact matches**: `Config` → `Source` → `Launch command` |
| 29 | + |
| 30 | +For **interpolated configs**: `Config` → `Source used as starting point` → `What to benchmark` (single list of knobs worth sweeping, not per-field unverified tags) |
| 31 | + |
| 32 | +## Step 0: Lock Objective and Decode Mode |
| 33 | + |
| 34 | +Identify the user's objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per **Constraint 1**). Preserve both through the remaining steps. |
| 35 | + |
| 36 | +## Step 1: Exact Database Match |
| 37 | + |
| 38 | +Search `examples/configs/database/lookup.yaml` for an exact `(model, gpu, isl, osl, concurrency, num_gpus)` match. Use `database.py` as a loader/helper. |
| 39 | + |
| 40 | +- Apply **speculative exclusion**. |
| 41 | +- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per **objective preservation**. |
| 42 | +- Prefer an exact match that also matches the stated objective over manual tuning. |
| 43 | + |
| 44 | +## Step 2: Nearest Checked-In Config |
| 45 | + |
| 46 | +If no exact match, widen the search to also include `examples/configs/curated/lookup.yaml`. |
| 47 | + |
| 48 | +Apply the same constraints as Step 1. Additionally: |
| 49 | +- A partial match from `database/` is preferred over a partial match from `curated/` for the same model (database configs are benchmark-tuned). |
| 50 | +- Exclude disaggregated-only or prefill-only entries (e.g., `qwen3-disagg-prefill.yaml`). |
| 51 | +- For curated configs, only treat intent as explicit when the repo labels it (e.g., `*-latency.yaml`, `*-throughput.yaml`, or guide text). |
| 52 | +- If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch. |
| 53 | + |
| 54 | +## Step 3: Read Model Docs |
| 55 | + |
| 56 | +Search `docs/source/deployment-guide/` and `examples/models/core/` for the model's deployment guide and README. Read both before adjusting knobs. |
| 57 | + |
| 58 | +**Excluded sources:** Do NOT use `docs/source/legacy/` tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving. |
| 59 | + |
| 60 | +**DeepSeek-V3 caveat:** For DeepSeek-V3/V3.2-Exp, use `examples/models/core/deepseek_v3/README.md`, not the R1 deployment guide. |
| 61 | + |
| 62 | +## Step 4: Adjust Source-Backed Fields |
| 63 | + |
| 64 | +Commonly scenario-dependent fields (adjust only these, guided by the checked-in source): |
| 65 | + |
| 66 | +`max_batch_size`, `max_num_tokens`, `max_seq_len`, `enable_attention_dp`, `attention_dp_config.*`, `kv_cache_config.free_gpu_memory_fraction`, `moe_expert_parallel_size` (MoE), `moe_config.backend` (when guide specifies), `stream_interval`, `num_postprocess_workers`, `cuda_graph_config.max_batch_size`/`batch_sizes`, and MTP-specific fields when using DeepSeek-R1 MTP configs. |
| 67 | + |
| 68 | +Do not assume other fields are constant across models/GPUs. For tuning notes, read `references/knob-heuristics.md`. |
| 69 | + |
| 70 | +## Validation Checklist |
| 71 | + |
| 72 | +- [ ] `trust_remote_code: true` called out as trust boundary when present |
| 73 | +- [ ] `max_num_tokens` >= ISL + chat template overhead (requests rejected if violated) |
| 74 | +- [ ] If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags |
0 commit comments