Skip to content

Commit 9d08467

Browse files
authored
Merge branch 'NVIDIA:main' into fix/ministral-loading
2 parents de09a58 + 3ac2704 commit 9d08467

File tree

693 files changed

+82931
-37751
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

693 files changed

+82931
-37751
lines changed
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
name: serve-config-guide
3+
description: Generate a source-backed starting `trtllm-serve --config` YAML for
4+
basic aggregate single-node PyTorch serving, aligned with checked-in TensorRT-LLM
5+
configs and deployment docs. Preserves explicit latency / balanced / throughput
6+
objectives. Excludes disaggregated, multi-node, and non-MTP speculative configs.
7+
---
8+
9+
# Serve Config Guide
10+
11+
**Scope:** aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
12+
13+
**Input:** model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified).
14+
**Output:** repo-grounded starting YAML for `trtllm-serve --config`.
15+
16+
If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in `docs/source/features/` (e.g., speculative-decoding, disagg-serving, parallel-strategy) or `examples/llm-api/`.
17+
18+
## Constraints
19+
20+
1. **Speculative exclusion:** Exclude configs containing `speculative_config` by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with `decoding_type: MTP` in `examples/configs/`). When including MTP, copy the full `speculative_config` block verbatim — never interpolate speculative fields.
21+
22+
2. **Objective preservation:** Preserve the user's stated objective through config selection. Use `database.py` profile labels (`Min Latency`, `Balanced`, `Max Throughput`; plus `Low Latency`/`High Throughput` in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
23+
24+
3. **Source preference:** Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
25+
26+
## Response Format
27+
28+
For **exact matches**: `Config``Source``Launch command`
29+
30+
For **interpolated configs**: `Config``Source used as starting point``What to benchmark` (single list of knobs worth sweeping, not per-field unverified tags)
31+
32+
## Step 0: Lock Objective and Decode Mode
33+
34+
Identify the user's objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per **Constraint 1**). Preserve both through the remaining steps.
35+
36+
## Step 1: Exact Database Match
37+
38+
Search `examples/configs/database/lookup.yaml` for an exact `(model, gpu, isl, osl, concurrency, num_gpus)` match. Use `database.py` as a loader/helper.
39+
40+
- Apply **speculative exclusion**.
41+
- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per **objective preservation**.
42+
- Prefer an exact match that also matches the stated objective over manual tuning.
43+
44+
## Step 2: Nearest Checked-In Config
45+
46+
If no exact match, widen the search to also include `examples/configs/curated/lookup.yaml`.
47+
48+
Apply the same constraints as Step 1. Additionally:
49+
- A partial match from `database/` is preferred over a partial match from `curated/` for the same model (database configs are benchmark-tuned).
50+
- Exclude disaggregated-only or prefill-only entries (e.g., `qwen3-disagg-prefill.yaml`).
51+
- For curated configs, only treat intent as explicit when the repo labels it (e.g., `*-latency.yaml`, `*-throughput.yaml`, or guide text).
52+
- If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
53+
54+
## Step 3: Read Model Docs
55+
56+
Search `docs/source/deployment-guide/` and `examples/models/core/` for the model's deployment guide and README. Read both before adjusting knobs.
57+
58+
**Excluded sources:** Do NOT use `docs/source/legacy/` tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
59+
60+
**DeepSeek-V3 caveat:** For DeepSeek-V3/V3.2-Exp, use `examples/models/core/deepseek_v3/README.md`, not the R1 deployment guide.
61+
62+
## Step 4: Adjust Source-Backed Fields
63+
64+
Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
65+
66+
`max_batch_size`, `max_num_tokens`, `max_seq_len`, `enable_attention_dp`, `attention_dp_config.*`, `kv_cache_config.free_gpu_memory_fraction`, `moe_expert_parallel_size` (MoE), `moe_config.backend` (when guide specifies), `stream_interval`, `num_postprocess_workers`, `cuda_graph_config.max_batch_size`/`batch_sizes`, and MTP-specific fields when using DeepSeek-R1 MTP configs.
67+
68+
Do not assume other fields are constant across models/GPUs. For tuning notes, read `references/knob-heuristics.md`.
69+
70+
## Validation Checklist
71+
72+
- [ ] `trust_remote_code: true` called out as trust boundary when present
73+
- [ ] `max_num_tokens` >= ISL + chat template overhead (requests rejected if violated)
74+
- [ ] If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Source-Backed Tuning Notes
2+
3+
Read an exact or nearby checked-in config and the model's deployment guide **before** using these notes. These are not universal thresholds.
4+
5+
## Commonly Tuned Fields
6+
7+
| Field | Guidance |
8+
|---|---|
9+
| `max_batch_size` | Scheduler ceiling, not a memory reservation and NOT proportional to concurrency — actual batch size adapts at runtime. Copy from the nearest checked-in source config; do not invent a value from concurrency. Prefer keeping the source value unless OOM occurs. MoE models generally cap lower than dense. |
10+
| `max_num_tokens` | Scheduler token budget. When chunked prefill is **disabled** (default): must exceed ISL plus chat template overhead; sweet spot is ISL to 2× ISL. When chunked prefill is **enabled**: acts as the chunk size — see `enable_chunked_prefill` section below. General default is 8192. Tune together with `max_batch_size`. |
11+
| `max_seq_len` | Global hard cap on total tokens per request (prompt + output). Set to `ISL + OSL + chat_template_overhead`. Chat templates and benchmarking preambles add tokens beyond raw ISL — overhead varies by model (checked-in configs show 20–200 tokens). Setting too tight rejects or truncates requests; setting too loose wastes KV cache per request. Copy from nearest checked-in config when available. |
12+
| `enable_attention_dp` | High-throughput knob. MoE+GQA models benefit at lower concurrency thresholds than MoE+MLA or Dense+GQA. Memory overhead: small for MLA (compressed attention), substantial for GQA (full replication). Can trigger OOM when combined with aggressive KV cache fraction. Follow the exact model guide/config. |
13+
| `kv_cache_config.free_gpu_memory_fraction` | OOM lever. MLA models (compressed KV) tolerate higher fractions; GQA models need more headroom. Lower when ADP enabled to account for replicated attention overhead. Large MoE models with ADP may need notably conservative fractions. Guides often adjust `max_batch_size` or `max_seq_len` first. |
14+
| `moe_expert_parallel_size` / `moe_config.backend` | MoE only. Copy both from checked-in source — EP does not necessarily equal TP. If no backend source exists, mark as unverified; benchmark CUTLASS vs TRTLLM. |
15+
| `cuda_graph_config.max_batch_size` / `batch_sizes` | Caps which decode batch sizes get CUDA graphs captured; batches above this fall back to eager execution (no error, just slower). **Default to `max_batch_size`** (safe, covers all batch sizes). Only lower when memory is tight — e.g., DeepSeek-R1 conc=1 uses `cuda_graph_config.max_batch_size: 1` with server `max_batch_size: 512` to avoid wasting graph memory on unreachable sizes. Also capped by `max_num_tokens / (1 + max_total_draft_tokens)` at runtime. |
16+
17+
## KV Cache Estimation
18+
19+
Use these formulas to sanity-check whether a concurrency target fits in GPU memory. Read the required values from the model's HuggingFace config (`config.json`).
20+
21+
**Per-token KV cache size:**
22+
23+
- **GQA (standard grouped-query attention):**
24+
`kv_per_token = 2 × num_attention_layers × (num_key_value_heads / TP) × head_dim × dtype_bytes`
25+
When `enable_attention_dp` is enabled, KV cache is fully replicated per rank (not TP-sharded); use divisor 1 instead of TP.
26+
- **MLA (multi-latent attention, e.g. DeepSeek-V2/V3):**
27+
`kv_per_token = num_attention_layers × (kv_lora_rank + qk_rope_head_dim) × dtype_bytes`
28+
29+
Where `dtype_bytes` is 2 for BF16/FP16, 1 for FP8/INT8.
30+
31+
**Approximate max concurrent requests (upper bound):**
32+
33+
```
34+
max_requests ≈ floor((GPU_HBM × 0.90 − model_weights_bytes / TP) / (kv_per_token × (ISL + OSL)))
35+
```
36+
37+
The 0.90 factor reserves ~10% of HBM for CUDA context, driver, and runtime overhead. Result is per-GPU.
38+
39+
**HF config fields to read:** `num_attention_layers` (equals `num_hidden_layers` for standard transformers; differs for hybrid models like Nemotron-H), `num_key_value_heads`, `head_dim` (or `hidden_size / num_attention_heads`), `kv_lora_rank`, `qk_rope_head_dim`.
40+
41+
**Caveats:** This estimate ignores activation memory, CUDA graph workspace, MoE expert workspace, and attention data parallelism (ADP) overhead. Always prefer checked-in config values over formula-derived estimates. Mark any formula-derived number as unverified.
42+
43+
## Chunked Prefill
44+
45+
Chunked prefill (`enable_chunked_prefill: true`) splits long prefill sequences into chunks so that decode batches sharing the same iteration are not starved. It is **disabled by default** and should be treated as an advanced latency optimization, not a default recommendation. See the `max_num_tokens` table entry above for how it changes token budget semantics.
46+
47+
**MLA models (DeepSeek-V2/V3/R1, Kimi-K2):**
48+
- Chunked prefill IS supported for MLA — dedicated CUDA kernels exist with multi-round attention and softmax merging.
49+
- **Hardware constraint:** only available on SM90 (Hopper) and SM100/SM103/SM120 (Blackwell+). The runtime automatically disables it with a warning on older GPUs.
50+
- **Trade-off:** *"primarily designed to reduce TPOT [...] will also decrease overall throughput."*
51+
- **Recommendation:** do not enable by default for MLA models. Consider it only for latency-sensitive workloads on Hopper or Blackwell GPUs where TPOT reduction outweighs the throughput cost.
52+
53+
**Non-MLA models (GQA):** more broadly supported across GPU generations. Still disabled by default; enable when long prefill sequences cause decode latency spikes.

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
165165
/tensorrt_llm/executor @NVIDIA/trt-llm-llmapi-devs
166166
/tensorrt_llm/serve @NVIDIA/trt-llm-llmapi-devs
167167
/tensorrt_llm/commands @NVIDIA/trt-llm-llmapi-devs
168+
/tensorrt_llm/visual_gen @NVIDIA/trt-llm-llmapi-devs
168169

169170
## TensorRT-LLM LLM Disaggregated
170171
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners

.github/workflows/blossom-ci.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@ jobs:
191191
"litaotju",
192192
"liyuhannnnn",
193193
"lkomali",
194+
"longcheng-nv",
194195
"longlee0622",
195196
"lowsfer",
196197
"lucaslie",
@@ -293,6 +294,7 @@ jobs:
293294
"tcherckez-nvidia",
294295
"thorjohnsen",
295296
"tianyuxbear",
297+
"tianyuz-nv",
296298
"tiffany940107",
297299
"tijyojwad",
298300
"timlee0212",
@@ -332,11 +334,13 @@ jobs:
332334
"xueweilnvidia",
333335
"xupinjie",
334336
"xuwchen",
337+
"xwang233",
335338
"xxi-nv",
336339
"yali-arch",
337340
"yechank-nvidia",
338341
"yibinl-nvidia",
339342
"yifeizhang-c",
343+
"YihuiLu512",
340344
"yihwang-nv",
341345
"yijingl-nvidia",
342346
"yilin-void",

.github/workflows/model-registry-check.yml

Lines changed: 0 additions & 40 deletions
This file was deleted.

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ tensorrt_llm/scripts
6262
*docs/source/_cpp_gen*
6363
docs/source/**/*.rst
6464
!docs/source/examples/index.rst
65-
!docs/source/deployment-guide/config_table.rst
6665
!docs/source/_includes/note_sections.rst
6766
*.swp
6867

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1447,7 +1447,7 @@ repos:
14471447
additional_dependencies:
14481448
- tomli
14491449
# add ignore words list
1450-
args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
1450+
args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,indx,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
14511451
exclude: 'scripts/attribution/data/cas/.*$'
14521452
- repo: https://github.com/astral-sh/ruff-pre-commit
14531453
rev: v0.9.4

3rdparty/CMakeLists.txt

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,16 +55,15 @@ foreach(DEP_IDX RANGE ${DEP_COUNT_MINUS_ONE})
5555
endif()
5656

5757
if(DEP_PATCH_FILE AND NOT DEP_PATCH_FILE STREQUAL "")
58+
set(_patch_file "${CMAKE_CURRENT_SOURCE_DIR}/${DEP_PATCH_FILE}")
5859
list(
5960
APPEND
6061
FETCH_ARGS
6162
PATCH_COMMAND
62-
patch
63-
-p1
64-
--forward
65-
--batch
66-
-i
67-
"${CMAKE_CURRENT_SOURCE_DIR}/${DEP_PATCH_FILE}")
63+
bash
64+
-c
65+
"patch -p1 --forward --batch --dry-run -i '${_patch_file}' && patch -p1 --forward --batch -i '${_patch_file}' || echo 'Patch already applied, skipping.'"
66+
)
6867
endif()
6968

7069
FetchContent_Declare(${FETCH_ARGS})

AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,6 @@ See [architecture diagram](.github/tava_architecture_diagram.md) for the full Me
5555
| **AutoDeploy** | Beta | `_torch/auto_deploy/` shim | `_torch/auto_deploy/shim/ad_executor.py` → adapts `PyExecutor` → graph transforms + torch.export |
5656
| **TensorRT** | Legacy | `TrtLlmArgs` | `builder.py``trtllm.Executor` → TensorRT Engine |
5757

58-
> **Note:** The `LLM(backend="...")` parameter still works but is **deprecated**. Prefer using `TorchLlmArgs` or `TrtLlmArgs` directly.
59-
6058
### Shared C++ Core (via Nanobind)
6159

6260
Both PyTorch and TensorRT backends share these C++ components:
@@ -85,6 +83,7 @@ HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
8583
| `tensorrt_llm/executor/executor.py` | Execution abstraction (`GenerationExecutor`) |
8684
| `tensorrt_llm/models/automodel.py` | Auto-discovery and model registry |
8785
| `tensorrt_llm/_torch/models/` | PyTorch backend model implementations (distinct from `models/` used by TensorRT backend) |
86+
| `tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md` | MoE architecture, backends, communication, development patterns — **read before modifying MoE code** |
8887
| `CODING_GUIDELINES.md` | C++ and Python coding standards (referenced throughout, must read before contributing) |
8988

9089
## Design Patterns
@@ -107,6 +106,7 @@ HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
107106
- **Avoid broad exception handling** — catch specific exceptions, not bare `except:` (see `CODING_GUIDELINES.md`).
108107
- **One concern per PR** — avoid scope creep. If a PR touches unrelated areas, split it.
109108
- **User-facing configuration classes** - when editing or defining any user-facing configuration classes (particularly `BaseLlmArgs` or any class used in its fields), you **MUST** follow the Pydantic guidelines in `CODING_GUIDELINES.md`.
109+
- **TensorRT backend is legacy**`TrtLlmArgs` / `backend="tensorrt"` and all exclusive tooling (`trtllm-build`, `trtllm-refit`, `convert_checkpoint.py`, `ModelRunner*`) are legacy. Bug fixes OK; new features target PyTorch or AutoDeploy.
110110

111111
## Development Workflow
112112

ATTRIBUTIONS-Python.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62375,7 +62375,7 @@ Copyright 2018- The Hugging Face team. All rights reserved.
6237562375
- `Homepage`: https://github.com/huggingface/transformers
6237662376

6237762377

62378-
## triton (3.5.1)
62378+
## triton (3.6.0)
6237962379

6238062380
### Licenses
6238162381
License: `MIT License`
@@ -62413,7 +62413,7 @@ License: `MIT License`
6241362413
- `Homepage`: https://github.com/triton-lang/triton/
6241462414

6241562415

62416-
## triton-kernels (3.5.1)
62416+
## triton-kernels (3.6.0)
6241762417

6241862418
### Licenses
6241962419
License: `MIT License`
@@ -62444,7 +62444,7 @@ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
6244462444
```
6244562445

6244662446
### URLs
62447-
- `Source`: https://github.com/triton-lang/triton/tree/v3.5.1/python/triton_kernels
62447+
- `Source`: https://github.com/triton-lang/triton/tree/v3.6.0/python/triton_kernels
6244862448

6244962449

6245062450
## tritonclient (2.63.0)

0 commit comments

Comments
 (0)