NVIDIA
diff --git a/‎.claude/skills/serve-config-guide/SKILL.md‎
Lines changed: 74 additions & 0 deletions b/‎.claude/skills/serve-config-guide/SKILL.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎.claude/skills/serve-config-guide/references/knob-heuristics.md‎
Lines changed: 53 additions & 0 deletions b/‎.claude/skills/serve-config-guide/references/knob-heuristics.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/blossom-ci.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/blossom-ci.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/workflows/model-registry-check.yml‎
Lines changed: 0 additions & 40 deletions b/‎.github/workflows/model-registry-check.yml‎
Lines changed: 0 additions & 40 deletions
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 1 deletion b/‎.gitignore‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎3rdparty/CMakeLists.txt‎
Lines changed: 5 additions & 6 deletions b/‎3rdparty/CMakeLists.txt‎
Lines changed: 5 additions & 6 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 2 additions & 2 deletions b/‎AGENTS.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎ATTRIBUTIONS-Python.md‎
Lines changed: 3 additions & 3 deletions b/‎ATTRIBUTIONS-Python.md‎
Lines changed: 3 additions & 3 deletions
@@ -0,0 +1,74 @@
+---
+name: serve-config-guide
+description: Generate a source-backed starting `trtllm-serve --config` YAML for
+  basic aggregate single-node PyTorch serving, aligned with checked-in TensorRT-LLM
+  configs and deployment docs. Preserves explicit latency / balanced / throughput
+  objectives. Excludes disaggregated, multi-node, and non-MTP speculative configs.
+---
+
+# Serve Config Guide
+
+**Scope:** aggregate/IFB (in-flight batching) colocated prefill+decode, single node, PyTorch backend, non-speculative by default; DeepSeek-R1 MTP is the standard mode (all checked-in configs include it).
+
+**Input:** model, GPU, ISL (input sequence length), OSL (output sequence length), concurrency, TP, performance objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified).
+**Output:** repo-grounded starting YAML for `trtllm-serve --config`.
+
+If the request is adjacent but out of scope, provide a best-effort answer using the nearest in-scope config as a starting point, clearly label inferred vs. verified fields, and point to the relevant feature doc in `docs/source/features/` (e.g., speculative-decoding, disagg-serving, parallel-strategy) or `examples/llm-api/`.
+
+## Constraints
+
+1. **Speculative exclusion:** Exclude configs containing `speculative_config` by default. Exception: exact checked-in DeepSeek-R1 MTP configs (models with `decoding_type: MTP` in `examples/configs/`). When including MTP, copy the full `speculative_config` block verbatim — never interpolate speculative fields.
+
+2. **Objective preservation:** Preserve the user's stated objective through config selection. Use `database.py` profile labels (`Min Latency`, `Balanced`, `Max Throughput`; plus `Low Latency`/`High Throughput` in smaller sets) as selection aids. If a config is unlabeled, treat it as a default starting point — do not claim it matches a specific objective. If the only match conflicts with the stated objective, call out the mismatch.
+
+3. **Source preference:** Prefer checked-in configs over interpolation. When docs and configs disagree, prefer the config for the exact scenario and note the mismatch. Mark any interpolation as unverified.
+
+## Response Format
+
+For **exact matches**: `Config` → `Source` → `Launch command`
+
+For **interpolated configs**: `Config` → `Source used as starting point` → `What to benchmark` (single list of knobs worth sweeping, not per-field unverified tags)
+
+## Step 0: Lock Objective and Decode Mode
+
+Identify the user's objective (`Min Latency` | `Balanced` | `Max Throughput` | unspecified) and decode mode (non-speculative or DeepSeek-R1 MTP per **Constraint 1**). Preserve both through the remaining steps.
+
+## Step 1: Exact Database Match
+
+Search `examples/configs/database/lookup.yaml` for an exact `(model, gpu, isl, osl, concurrency, num_gpus)` match. Use `database.py` as a loader/helper.
+
+- Apply **speculative exclusion**.
+- When multiple recipes exist at different concurrency points, use profile labels to match the user's objective per **objective preservation**.
+- Prefer an exact match that also matches the stated objective over manual tuning.
+
+## Step 2: Nearest Checked-In Config
+
+If no exact match, widen the search to also include `examples/configs/curated/lookup.yaml`.
+
+Apply the same constraints as Step 1. Additionally:
+- A partial match from `database/` is preferred over a partial match from `curated/` for the same model (database configs are benchmark-tuned).
+- Exclude disaggregated-only or prefill-only entries (e.g., `qwen3-disagg-prefill.yaml`).
+- For curated configs, only treat intent as explicit when the repo labels it (e.g., `*-latency.yaml`, `*-throughput.yaml`, or guide text).
+- If no in-scope config matches the stated objective, pick the nearest same-model starting point and call out the mismatch.
+
+## Step 3: Read Model Docs
+
+Search `docs/source/deployment-guide/` and `examples/models/core/` for the model's deployment guide and README. Read both before adjusting knobs.
+
+**Excluded sources:** Do NOT use `docs/source/legacy/` tuning values or benchmark numbers — those were measured on the TensorRT engine-building backend and do not transfer to PyTorch backend serving.
+
+**DeepSeek-V3 caveat:** For DeepSeek-V3/V3.2-Exp, use `examples/models/core/deepseek_v3/README.md`, not the R1 deployment guide.
+
+## Step 4: Adjust Source-Backed Fields
+
+Commonly scenario-dependent fields (adjust only these, guided by the checked-in source):
+
+`max_batch_size`, `max_num_tokens`, `max_seq_len`, `enable_attention_dp`, `attention_dp_config.*`, `kv_cache_config.free_gpu_memory_fraction`, `moe_expert_parallel_size` (MoE), `moe_config.backend` (when guide specifies), `stream_interval`, `num_postprocess_workers`, `cuda_graph_config.max_batch_size`/`batch_sizes`, and MTP-specific fields when using DeepSeek-R1 MTP configs.
+
+Do not assume other fields are constant across models/GPUs. For tuning notes, read `references/knob-heuristics.md`.
+
+## Validation Checklist
+
+- [ ] `trust_remote_code: true` called out as trust boundary when present
+- [ ] `max_num_tokens` >= ISL + chat template overhead (requests rejected if violated)
+- [ ] If interpolated: single "What to benchmark" section listing knobs to sweep, not per-field unverified tags
@@ -0,0 +1,53 @@
+# Source-Backed Tuning Notes
+
+Read an exact or nearby checked-in config and the model's deployment guide **before** using these notes. These are not universal thresholds.
+
+## Commonly Tuned Fields
+
+| Field | Guidance |
+|---|---|
+| `max_batch_size` | Scheduler ceiling, not a memory reservation and NOT proportional to concurrency — actual batch size adapts at runtime. Copy from the nearest checked-in source config; do not invent a value from concurrency. Prefer keeping the source value unless OOM occurs. MoE models generally cap lower than dense. |
+| `max_num_tokens` | Scheduler token budget. When chunked prefill is **disabled** (default): must exceed ISL plus chat template overhead; sweet spot is ISL to 2× ISL. When chunked prefill is **enabled**: acts as the chunk size — see `enable_chunked_prefill` section below. General default is 8192. Tune together with `max_batch_size`. |
+| `max_seq_len` | Global hard cap on total tokens per request (prompt + output). Set to `ISL + OSL + chat_template_overhead`. Chat templates and benchmarking preambles add tokens beyond raw ISL — overhead varies by model (checked-in configs show 20–200 tokens). Setting too tight rejects or truncates requests; setting too loose wastes KV cache per request. Copy from nearest checked-in config when available. |
+| `enable_attention_dp` | High-throughput knob. MoE+GQA models benefit at lower concurrency thresholds than MoE+MLA or Dense+GQA. Memory overhead: small for MLA (compressed attention), substantial for GQA (full replication). Can trigger OOM when combined with aggressive KV cache fraction. Follow the exact model guide/config. |
+| `kv_cache_config.free_gpu_memory_fraction` | OOM lever. MLA models (compressed KV) tolerate higher fractions; GQA models need more headroom. Lower when ADP enabled to account for replicated attention overhead. Large MoE models with ADP may need notably conservative fractions. Guides often adjust `max_batch_size` or `max_seq_len` first. |
+| `moe_expert_parallel_size` / `moe_config.backend` | MoE only. Copy both from checked-in source — EP does not necessarily equal TP. If no backend source exists, mark as unverified; benchmark CUTLASS vs TRTLLM. |
+| `cuda_graph_config.max_batch_size` / `batch_sizes` | Caps which decode batch sizes get CUDA graphs captured; batches above this fall back to eager execution (no error, just slower). **Default to `max_batch_size`** (safe, covers all batch sizes). Only lower when memory is tight — e.g., DeepSeek-R1 conc=1 uses `cuda_graph_config.max_batch_size: 1` with server `max_batch_size: 512` to avoid wasting graph memory on unreachable sizes. Also capped by `max_num_tokens / (1 + max_total_draft_tokens)` at runtime. |
+
+## KV Cache Estimation
+
+Use these formulas to sanity-check whether a concurrency target fits in GPU memory. Read the required values from the model's HuggingFace config (`config.json`).
+
+**Per-token KV cache size:**
+
+- **GQA (standard grouped-query attention):**
+  `kv_per_token = 2 × num_attention_layers × (num_key_value_heads / TP) × head_dim × dtype_bytes`
+  When `enable_attention_dp` is enabled, KV cache is fully replicated per rank (not TP-sharded); use divisor 1 instead of TP.
+- **MLA (multi-latent attention, e.g. DeepSeek-V2/V3):**
+  `kv_per_token = num_attention_layers × (kv_lora_rank + qk_rope_head_dim) × dtype_bytes`
+
+Where `dtype_bytes` is 2 for BF16/FP16, 1 for FP8/INT8.
+
+**Approximate max concurrent requests (upper bound):**
+
+```
+max_requests ≈ floor((GPU_HBM × 0.90 − model_weights_bytes / TP) / (kv_per_token × (ISL + OSL)))
+```
+
+The 0.90 factor reserves ~10% of HBM for CUDA context, driver, and runtime overhead. Result is per-GPU.
+
+**HF config fields to read:** `num_attention_layers` (equals `num_hidden_layers` for standard transformers; differs for hybrid models like Nemotron-H), `num_key_value_heads`, `head_dim` (or `hidden_size / num_attention_heads`), `kv_lora_rank`, `qk_rope_head_dim`.
+
+**Caveats:** This estimate ignores activation memory, CUDA graph workspace, MoE expert workspace, and attention data parallelism (ADP) overhead. Always prefer checked-in config values over formula-derived estimates. Mark any formula-derived number as unverified.
+
+## Chunked Prefill
+
+Chunked prefill (`enable_chunked_prefill: true`) splits long prefill sequences into chunks so that decode batches sharing the same iteration are not starved. It is **disabled by default** and should be treated as an advanced latency optimization, not a default recommendation. See the `max_num_tokens` table entry above for how it changes token budget semantics.
+
+**MLA models (DeepSeek-V2/V3/R1, Kimi-K2):**
+- Chunked prefill IS supported for MLA — dedicated CUDA kernels exist with multi-round attention and softmax merging.
+- **Hardware constraint:** only available on SM90 (Hopper) and SM100/SM103/SM120 (Blackwell+). The runtime automatically disables it with a warning on older GPUs.
+- **Trade-off:** *"primarily designed to reduce TPOT [...] will also decrease overall throughput."*
+- **Recommendation:** do not enable by default for MLA models. Consider it only for latency-sensitive workloads on Hopper or Blackwell GPUs where TPOT reduction outweighs the throughput cost.
+
+**Non-MLA models (GQA):** more broadly supported across GPU generations. Still disabled by default; enable when long prefill sequences cause decode latency spikes.
@@ -165,6 +165,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
 /tensorrt_llm/executor @NVIDIA/trt-llm-llmapi-devs
 /tensorrt_llm/serve @NVIDIA/trt-llm-llmapi-devs
 /tensorrt_llm/commands @NVIDIA/trt-llm-llmapi-devs
+/tensorrt_llm/visual_gen @NVIDIA/trt-llm-llmapi-devs
 
 ## TensorRT-LLM LLM Disaggregated
 /examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners
 
@@ -191,6 +191,7 @@ jobs:
         "litaotju",
         "liyuhannnnn",
         "lkomali",
+        "longcheng-nv",
         "longlee0622",
         "lowsfer",
         "lucaslie",
@@ -293,6 +294,7 @@ jobs:
         "tcherckez-nvidia",
         "thorjohnsen",
         "tianyuxbear",
+        "tianyuz-nv",
         "tiffany940107",
         "tijyojwad",
         "timlee0212",
@@ -332,11 +334,13 @@ jobs:
         "xueweilnvidia",
         "xupinjie",
         "xuwchen",
+        "xwang233",
         "xxi-nv",
         "yali-arch",
         "yechank-nvidia",
         "yibinl-nvidia",
         "yifeizhang-c",
+        "YihuiLu512",
         "yihwang-nv",
         "yijingl-nvidia",
         "yilin-void",
 
@@ -62,7 +62,6 @@ tensorrt_llm/scripts
 *docs/source/_cpp_gen*
 docs/source/**/*.rst
 !docs/source/examples/index.rst
-!docs/source/deployment-guide/config_table.rst
 !docs/source/_includes/note_sections.rst
 *.swp
 
 
@@ -1447,7 +1447,7 @@ repos:
         additional_dependencies:
         - tomli
         # add ignore words list
-        args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
+        args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,indx,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
         exclude: 'scripts/attribution/data/cas/.*$'
 -   repo: https://github.com/astral-sh/ruff-pre-commit
     rev: v0.9.4
 
@@ -55,16 +55,15 @@ foreach(DEP_IDX RANGE ${DEP_COUNT_MINUS_ONE})
   endif()
 
   if(DEP_PATCH_FILE AND NOT DEP_PATCH_FILE STREQUAL "")
+    set(_patch_file "${CMAKE_CURRENT_SOURCE_DIR}/${DEP_PATCH_FILE}")
     list(
       APPEND
       FETCH_ARGS
       PATCH_COMMAND
-      patch
-      -p1
-      --forward
-      --batch
-      -i
-      "${CMAKE_CURRENT_SOURCE_DIR}/${DEP_PATCH_FILE}")
+      bash
+      -c
+      "patch -p1 --forward --batch --dry-run -i '${_patch_file}' && patch -p1 --forward --batch -i '${_patch_file}' || echo 'Patch already applied, skipping.'"
+    )
   endif()
 
   FetchContent_Declare(${FETCH_ARGS})
 
@@ -55,8 +55,6 @@ See [architecture diagram](.github/tava_architecture_diagram.md) for the full Me
 | **AutoDeploy** | Beta | `_torch/auto_deploy/` shim | `_torch/auto_deploy/shim/ad_executor.py` → adapts `PyExecutor` → graph transforms + torch.export |
 | **TensorRT** | Legacy | `TrtLlmArgs` | `builder.py` → `trtllm.Executor` → TensorRT Engine |
 
-> **Note:** The `LLM(backend="...")` parameter still works but is **deprecated**. Prefer using `TorchLlmArgs` or `TrtLlmArgs` directly.
-
 ### Shared C++ Core (via Nanobind)
 
 Both PyTorch and TensorRT backends share these C++ components:
@@ -85,6 +83,7 @@ HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
 | `tensorrt_llm/executor/executor.py` | Execution abstraction (`GenerationExecutor`) |
 | `tensorrt_llm/models/automodel.py` | Auto-discovery and model registry |
 | `tensorrt_llm/_torch/models/` | PyTorch backend model implementations (distinct from `models/` used by TensorRT backend) |
+| `tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md` | MoE architecture, backends, communication, development patterns — **read before modifying MoE code** |
 | `CODING_GUIDELINES.md` | C++ and Python coding standards (referenced throughout, must read before contributing) |
 
 ## Design Patterns
@@ -107,6 +106,7 @@ HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
 - **Avoid broad exception handling** — catch specific exceptions, not bare `except:` (see `CODING_GUIDELINES.md`).
 - **One concern per PR** — avoid scope creep. If a PR touches unrelated areas, split it.
 - **User-facing configuration classes** - when editing or defining any user-facing configuration classes (particularly `BaseLlmArgs` or any class used in its fields), you **MUST** follow the Pydantic guidelines in `CODING_GUIDELINES.md`.
+- **TensorRT backend is legacy** — `TrtLlmArgs` / `backend="tensorrt"` and all exclusive tooling (`trtllm-build`, `trtllm-refit`, `convert_checkpoint.py`, `ModelRunner*`) are legacy. Bug fixes OK; new features target PyTorch or AutoDeploy.
 
 ## Development Workflow
 
 
@@ -62375,7 +62375,7 @@ Copyright 2018- The Hugging Face team. All rights reserved.
   - `Homepage`: https://github.com/huggingface/transformers
 
 
-## triton (3.5.1)
+## triton (3.6.0)
 
 ### Licenses
 License: `MIT License`
@@ -62413,7 +62413,7 @@ License: `MIT License`
   - `Homepage`: https://github.com/triton-lang/triton/
 
 
-## triton-kernels (3.5.1)
+## triton-kernels (3.6.0)
 
 ### Licenses
 License: `MIT License`
@@ -62444,7 +62444,7 @@ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 ```
 
 ### URLs
-  - `Source`: https://github.com/triton-lang/triton/tree/v3.5.1/python/triton_kernels
+  - `Source`: https://github.com/triton-lang/triton/tree/v3.6.0/python/triton_kernels
 
 
 ## tritonclient (2.63.0)