Skip to content

Commit 990110a

Browse files
authored
Merge branch 'main' into test/unwaive-nvbug-6224637
2 parents ea2788c + b8d17d7 commit 990110a

50 files changed

Lines changed: 1609 additions & 240 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/source/_static/config_db.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,18 @@
1212
"model_url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
1313
"scenario": "Max Throughput"
1414
},
15+
{
16+
"command": "trtllm-serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
17+
"config_filename": "nemotron-3-ultra-throughput.yaml",
18+
"config_github_url": "https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
19+
"config_path": "examples/configs/curated/nemotron-3-ultra-throughput.yaml",
20+
"config_raw_url": "https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
21+
"gpu_compatibility": "B200, B300, GB200, GB300, H100, H200",
22+
"model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
23+
"model_display_name": "Nemotron v3 Ultra (NVFP4)",
24+
"model_url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
25+
"scenario": "Max Throughput"
26+
},
1527
{
1628
"command": "trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --config ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml",
1729
"config_filename": "qwen3-next.yaml",
@@ -3516,6 +3528,10 @@
35163528
"display_name": "Nemotron v3 Super (NVFP4)",
35173529
"url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
35183530
},
3531+
"nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4": {
3532+
"display_name": "Nemotron v3 Ultra (NVFP4)",
3533+
"url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4"
3534+
},
35193535
"openai/gpt-oss-120b": {
35203536
"display_name": "gpt-oss-120b",
35213537
"url": "https://huggingface.co/openai/gpt-oss-120b"

docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -317,7 +317,7 @@ Example output:
317317
Configuring with YAML Files
318318
----------------------------
319319
320-
You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. The arguments in the file override the corresponding command line arguments.
320+
You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. Explicit CLI flags take precedence over values in the YAML; un-set CLI flags fall back to the YAML.
321321
322322
.. include:: ../../_includes/note_sections.rst
323323
:start-after: .. start-note-config-flag-alias

docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md

Lines changed: 80 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
1-
# Deployment Guide for Nemotron v3 Super on TensorRT LLM - Blackwell & Hopper Hardware
1+
# Deployment Guide for Nemotron v3 (Ultra & Super) on TensorRT LLM - Blackwell & Hopper Hardware
22

33
## Introduction
44

5-
This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 Super 120B-A12B model using TensorRT LLM. Nemotron v3 Super is a hybrid architecture model combining Mixture-of-Experts (MoE) with SSM (Mamba) and attention layers, delivering 120B total parameters with only 12B active parameters per token for efficient inference. This guide covers model access, environment setup, server configuration, and inference validation.
5+
This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 family of models using TensorRT LLM. It covers two models:
6+
7+
* **Nemotron v3 Ultra (550B-A55B)** — 550B total parameters with 55B active per token.
8+
* **Nemotron v3 Super (120B-A12B)** — 120B total parameters with 12B active per token.
9+
10+
Both models share a hybrid architecture (`NemotronHForCausalLM`) that interleaves Mamba-2 (SSM), Mixture-of-Experts (MoE), and attention layers for efficient inference. Nemotron v3 Ultra additionally uses a Latent Mixture-of-Experts (LatentMoE) design and ships with built-in Multi-Token Prediction (MTP) layers. On TensorRT LLM, Nemotron v3 Ultra supports MTP, prefix caching (KV cache reuse), and disaggregated serving. This guide covers model access, environment setup, server configuration, and inference validation for both models.
611

712
## Prerequisites
813

@@ -14,6 +19,13 @@ This deployment guide provides step-by-step instructions for running the NVIDIA
1419

1520
## Models
1621

22+
### Nemotron v3 Ultra (550B-A55B)
23+
24+
* [NVIDIA-Nemotron-3-Ultra-550B-A55B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-Base-BF16)
25+
* [NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4)
26+
27+
### Nemotron v3 Super (120B-A12B)
28+
1729
* [NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16)
1830
* [NVIDIA-Nemotron-3-Super-120B-A12B-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8)
1931
* [NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
@@ -22,7 +34,27 @@ All models are available under the [nvidia/nvidia-nemotron-v3](https://huggingfa
2234

2335
## GPU Requirements
2436

25-
Nemotron v3 Super 120B-A12B has 120B total parameters. The minimum GPU memory required depends on the precision:
37+
The minimum GPU memory required depends on the model size and precision.
38+
39+
### Nemotron v3 Ultra (550B-A55B)
40+
41+
The NVFP4 checkpoint is the recommended (and minimum-footprint) deployment precision for Ultra. The published minimum GPU requirements for the NVFP4 checkpoint are:
42+
43+
| Platform | Minimum GPUs |
44+
|----------|--------------|
45+
| B200 | 4x B200 |
46+
| B300 | 4x B300 |
47+
| GB200 | 4x GB200 |
48+
| GB300 | 4x GB300 |
49+
| H100 | 8x H100 \* |
50+
51+
The NVFP4 checkpoint uses an FP8 KV cache. On Blackwell (B200/B300) and Grace Blackwell (GB200/GB300), a single node of 4 GPUs fits the NVFP4 weights plus the KV cache with headroom.
52+
53+
\* The same NVFP4 checkpoint can also be served on Hopper. Because Hopper lacks a native NVFP4 tensor-core GEMM, NVFP4 weights are run through a W4A16 fallback path that dequantizes them on the fly; this requires a minimum of 8x H100 (fewer may suffice on the higher-memory H200) and delivers somewhat lower throughput than Blackwell. No checkpoint conversion or command change is needed — the runtime selects the fallback automatically.
54+
55+
The `Base-BF16` checkpoint is the pre-training checkpoint and is primarily intended for research and fine-tuning rather than serving.
56+
57+
### Nemotron v3 Super (120B-A12B)
2658

2759
| Checkpoint | Minimum GPUs (H100/H200 80GB) | Minimum GPUs (B200/GB200 192GB) |
2860
|------------|-------------------------------|---------------------------------|
@@ -61,12 +93,36 @@ We maintain YAML configuration files with recommended performance settings in th
6193

6294
```shell
6395
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
96+
```
97+
98+
Select the config file that matches the model you are deploying:
99+
100+
```shell
101+
# Nemotron v3 Ultra
102+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/nemotron-3-ultra-throughput.yaml
103+
104+
# Nemotron v3 Super
64105
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/nemotron-3-super-throughput.yaml
65106
```
66107

67-
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
108+
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdowns below.
68109

69-
````{admonition} Show code
110+
````{admonition} Show Nemotron v3 Ultra config
111+
:class: dropdown
112+
113+
```{literalinclude} ../../../examples/configs/curated/nemotron-3-ultra-throughput.yaml
114+
---
115+
language: shell
116+
prepend: |
117+
EXTRA_LLM_API_FILE=/tmp/config.yml
118+
119+
cat << EOF > ${EXTRA_LLM_API_FILE}
120+
append: EOF
121+
---
122+
```
123+
````
124+
125+
````{admonition} Show Nemotron v3 Super config
70126
:class: dropdown
71127
72128
```{literalinclude} ../../../examples/configs/curated/nemotron-3-super-throughput.yaml
@@ -81,16 +137,25 @@ append: EOF
81137
```
82138
````
83139

140+
The Ultra config is a starting point tuned for max throughput on 4x B200; adjust the parallelism, batch sizes, and KV cache fraction to match your hardware and traffic pattern.
141+
84142
### Launch the TensorRT LLM Server
85143

86-
Below are example commands to launch the TensorRT LLM server with the Nemotron v3 Super model from within the container.
144+
Below are example commands to launch the TensorRT LLM server from within the container. Make sure `EXTRA_LLM_API_FILE` points to the config that matches your model (see above).
145+
146+
**Nemotron v3 Ultra — NVFP4 model (recommended):**
147+
148+
```shell
149+
trtllm-serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser nemotron-v3 --tool_parser qwen3_coder --config ${EXTRA_LLM_API_FILE}
150+
```
87151

88-
**NVFP4 model (recommended, lowest memory footprint):**
152+
**Nemotron v3 Super — NVFP4 model (recommended, lowest memory footprint):**
89153

90154
```shell
91155
trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser nano-v3 --tool_parser qwen3_coder --config ${EXTRA_LLM_API_FILE}
92156
```
93157

158+
The `nemotron-v3` and `nano-v3` reasoning parsers are aliases for the same Nemotron v3 parser and are interchangeable. Reasoning can be toggled per request by passing `enable_thinking` through `chat_template_kwargs` in the request body, for example `{"chat_template_kwargs": {"enable_thinking": true}}` (set it to `false` to disable reasoning).
94159

95160
After the server is set up, the client can now send prompt requests to the server and receive results.
96161

@@ -102,7 +167,7 @@ These options provide control over TensorRT LLM's behavior and are set within th
102167

103168
#### `tensor_parallel_size`
104169

105-
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For BF16, use 4 or more GPUs on H100/H200. For NVFP4, 2 GPUs on H100/H200 may suffice.
170+
* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For Super BF16, use 4 or more GPUs on H100/H200; for Super NVFP4, 2 GPUs on H100/H200 may suffice. For Ultra NVFP4, use 4 GPUs (single node on B200).
106171

107172
#### `moe_expert_parallel_size`
108173

@@ -158,11 +223,11 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
158223

159224
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
160225

161-
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
226+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server. The example below uses Nemotron v3 Ultra; replace the `model` field with the model you launched (for example `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`).
162227

163228
```shell
164229
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
165-
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
230+
"model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
166231
"messages": [
167232
{
168233
"role": "user",
@@ -182,7 +247,7 @@ Here is an example response:
182247
"id": "chatcmpl-abc123def456",
183248
"object": "chat.completion",
184249
"created": 1759022940,
185-
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
250+
"model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
186251
"choices": [
187252
{
188253
"index": 0,
@@ -209,7 +274,7 @@ Here is an example response:
209274
* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
210275
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
211276
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
212-
* Nemotron v3 Super is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full 120B parameter weights even though only 12B parameters are active per token.
277+
* Nemotron v3 is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full parameter weights even though only a fraction of parameters are active per token (12B for Super, 55B for Ultra).
213278

214279
## Benchmarking Performance
215280

@@ -220,14 +285,14 @@ cat <<'EOF' > bench.sh
220285
#!/usr/bin/env bash
221286
set -euo pipefail
222287
223-
# Adjust the model name based on which Nemotron v3 Super variant you're benchmarking
224-
MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
288+
# Adjust the model name based on which Nemotron v3 variant you're benchmarking
289+
MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4"
225290
226291
concurrency_list="1 2 4 8 16 32 64 128"
227292
multi_round=5
228293
isl=1024
229294
osl=1024
230-
result_dir=/tmp/nemotron_super_output
295+
result_dir=/tmp/nemotron_v3_output
231296
232297
for concurrency in ${concurrency_list}; do
233298
num_prompts=$((concurrency * multi_round))

docs/source/deployment-guide/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The deployment guides below provide more detailed instructions for serving speci
2828
:maxdepth: 1
2929
:name: Deployment Guides
3030

31-
deployment-guide-for-nemotron-3-super-on-trtllm.md
31+
deployment-guide-for-nemotron-3-on-trtllm.md
3232
deployment-guide-for-deepseek-r1-on-trtllm.md
3333
deployment-guide-for-llama3.3-70b-on-trtllm.md
3434
deployment-guide-for-llama4-scout-on-trtllm.md

docs/source/models/supported-models.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The following is a table of supported models for the PyTorch backend:
3636
| `MixtralForCausalLM` | Mixtral | `mistralai/Mixtral-8x7B-v0.1` |
3737
| `MllamaForConditionalGeneration` | Llama 3.2 | `meta-llama/Llama-3.2-11B-Vision` |
3838
| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base` |
39-
| `NemotronHForCausalLM` | Nemotron-3-Nano, Nemotron-3-Super | `nvidia/nvidia-nemotron-v3` |
39+
| `NemotronHForCausalLM` | Nemotron-3-Nano, Nemotron-3-Super, Nemotron-3-Ultra | `nvidia/nvidia-nemotron-v3` |
4040
| `NemotronNASForCausalLM` | NemotronNAS | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` |
4141
| `Olmo3ForCausalLM` [^5] | OLMo 3, OLMo 3.1 | `allenai/Olmo-3.1-32B-Instruct` |
4242
| `OpenELMForCausalLM` [^5] | OpenELM | `apple/OpenELM-270M-Instruct` |
@@ -70,6 +70,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
7070
| `Qwen3_5MoeForCausalLM` [^5] | Yes | Yes | Untested | Untested | Yes | No | No | No | No | Yes | Untested | Yes | N/A | Untested | Untested |
7171
| `Glm4MoeLiteForCausalLM` [^5] | Yes | Yes | Untested | Untested | Yes | No | No | No | No | Yes | Untested | Untested | N/A | Untested | Untested |
7272
| `NemotronHForCausalLM` (Super) | Yes | Yes | Untested | Untested | Yes | Yes | No | No | No | Yes | Yes | Untested | N/A | Untested | Untested |
73+
| `NemotronHForCausalLM` (Ultra) | Yes | Yes | Untested | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | N/A | Untested | Untested |
7374
| `Gemma4ForConditionalGeneration` | Untested | Yes | Untested | No | Yes | No | No | No | No | Yes | Untested | No | Yes | Untested | Untested |
7475
| `Step3p7ForConditionalGeneration`| Yes | Yes | Yes | Untested | Untested | Yes | No | No | No | Yes | Untested | Untested | Yes | Untested | Untested |
7576

docs/source/release-notes.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ All published functionality in the Release Notes has been fully tested and verif
2626

2727
### API Changes
2828

29+
- `trtllm-serve`, `trtllm-eval`, `trtllm-bench`: explicit CLI flags now take precedence over values in `--config` / `--extra_llm_api_options` YAML files (was: YAML overrode CLI). Un-set CLI flags continue to fall back to the YAML, then to model-specific and built-in defaults.
30+
2931
### Fixed Issues
3032

3133
### Known Issues

examples/configs/curated/lookup.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@
44
config_path: examples/configs/curated/nemotron-3-super-throughput.yaml
55
scenario: Max Throughput
66
gpu_compatibility: "B200, GB200"
7+
- model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
8+
arch: NemotronHForCausalLM
9+
config_path: examples/configs/curated/nemotron-3-ultra-throughput.yaml
10+
scenario: Max Throughput
11+
gpu_compatibility: "B200, B300, GB200, GB300, H100, H200"
712
- model: Qwen/Qwen3-Next-80B-A3B-Thinking
813
arch: Qwen3NextForCausalLM
914
config_path: examples/configs/curated/qwen3-next.yaml
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
max_batch_size: 256
2+
max_num_tokens: 2048
3+
tensor_parallel_size: 4
4+
moe_expert_parallel_size: 4
5+
trust_remote_code: true
6+
enable_attention_dp: true
7+
cuda_graph_config:
8+
enable_padding: true
9+
max_batch_size: 256
10+
kv_cache_config:
11+
free_gpu_memory_fraction: 0.8
12+
enable_block_reuse: false
13+
mamba_ssm_cache_dtype: float16
14+
mamba_ssm_philox_rounds: 5
15+
mamba_ssm_stochastic_rounding: true
16+
moe_config:
17+
backend: CUTEDSL
18+
num_postprocess_workers: 4
19+
stream_interval: 10

examples/models/core/nemotron/README_nemotron_super_v3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -200,4 +200,4 @@ Key options:
200200
# Notes
201201

202202
* prefix-cache is not supported for Nemotron Super V3 yet, so please set `enable_block_reuse: false` when launching a server.
203-
* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md).
203+
* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md).

examples/visual_gen/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,12 @@ for feature details.
1818
# Defaults
1919
python quickstart_example.py
2020
python models/wan_t2v.py
21-
python models/wan_i2v.py
21+
python models/ltx2.py
2222

2323
# With engine config (quant, parallelism, etc.)
2424
python models/wan_t2v.py --visual_gen_args configs/wan2.2-t2v-fp4-1gpu.yaml
2525
python models/wan_i2v.py --visual_gen_args configs/wan2.2-i2v-fp4-1gpu.yaml --image /path/to/image.png
26+
python models/ltx2.py --visual_gen_args configs/ltx2-t2v-fp8-1-gpu.yaml
2627
```
2728

2829
Install deps from the repo root: `pip install -r requirements-dev.txt`.

0 commit comments

Comments
 (0)