You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/commands/trtllm-serve/trtllm-serve.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -317,7 +317,7 @@ Example output:
317
317
Configuring with YAML Files
318
318
----------------------------
319
319
320
-
You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. The arguments in the file override the corresponding command line arguments.
320
+
You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. Explicit CLI flags take precedence over values in the YAML; un-set CLI flags fall back to the YAML.
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md
+80-15Lines changed: 80 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,13 @@
1
-
# Deployment Guide for Nemotron v3 Super on TensorRT LLM - Blackwell & Hopper Hardware
1
+
# Deployment Guide for Nemotron v3 (Ultra & Super) on TensorRT LLM - Blackwell & Hopper Hardware
2
2
3
3
## Introduction
4
4
5
-
This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 Super 120B-A12B model using TensorRT LLM. Nemotron v3 Super is a hybrid architecture model combining Mixture-of-Experts (MoE) with SSM (Mamba) and attention layers, delivering 120B total parameters with only 12B active parameters per token for efficient inference. This guide covers model access, environment setup, server configuration, and inference validation.
5
+
This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 family of models using TensorRT LLM. It covers two models:
6
+
7
+
***Nemotron v3 Ultra (550B-A55B)** — 550B total parameters with 55B active per token.
8
+
***Nemotron v3 Super (120B-A12B)** — 120B total parameters with 12B active per token.
9
+
10
+
Both models share a hybrid architecture (`NemotronHForCausalLM`) that interleaves Mamba-2 (SSM), Mixture-of-Experts (MoE), and attention layers for efficient inference. Nemotron v3 Ultra additionally uses a Latent Mixture-of-Experts (LatentMoE) design and ships with built-in Multi-Token Prediction (MTP) layers. On TensorRT LLM, Nemotron v3 Ultra supports MTP, prefix caching (KV cache reuse), and disaggregated serving. This guide covers model access, environment setup, server configuration, and inference validation for both models.
6
11
7
12
## Prerequisites
8
13
@@ -14,6 +19,13 @@ This deployment guide provides step-by-step instructions for running the NVIDIA
@@ -22,7 +34,27 @@ All models are available under the [nvidia/nvidia-nemotron-v3](https://huggingfa
22
34
23
35
## GPU Requirements
24
36
25
-
Nemotron v3 Super 120B-A12B has 120B total parameters. The minimum GPU memory required depends on the precision:
37
+
The minimum GPU memory required depends on the model size and precision.
38
+
39
+
### Nemotron v3 Ultra (550B-A55B)
40
+
41
+
The NVFP4 checkpoint is the recommended (and minimum-footprint) deployment precision for Ultra. The published minimum GPU requirements for the NVFP4 checkpoint are:
42
+
43
+
| Platform | Minimum GPUs |
44
+
|----------|--------------|
45
+
| B200 | 4x B200 |
46
+
| B300 | 4x B300 |
47
+
| GB200 | 4x GB200 |
48
+
| GB300 | 4x GB300 |
49
+
| H100 | 8x H100 \*|
50
+
51
+
The NVFP4 checkpoint uses an FP8 KV cache. On Blackwell (B200/B300) and Grace Blackwell (GB200/GB300), a single node of 4 GPUs fits the NVFP4 weights plus the KV cache with headroom.
52
+
53
+
\* The same NVFP4 checkpoint can also be served on Hopper. Because Hopper lacks a native NVFP4 tensor-core GEMM, NVFP4 weights are run through a W4A16 fallback path that dequantizes them on the fly; this requires a minimum of 8x H100 (fewer may suffice on the higher-memory H200) and delivers somewhat lower throughput than Blackwell. No checkpoint conversion or command change is needed — the runtime selects the fallback automatically.
54
+
55
+
The `Base-BF16` checkpoint is the pre-training checkpoint and is primarily intended for research and fine-tuning rather than serving.
The Ultra config is a starting point tuned for max throughput on 4x B200; adjust the parallelism, batch sizes, and KV cache fraction to match your hardware and traffic pattern.
141
+
84
142
### Launch the TensorRT LLM Server
85
143
86
-
Below are example commands to launch the TensorRT LLM server with the Nemotron v3 Super model from within the container.
144
+
Below are example commands to launch the TensorRT LLM server from within the container. Make sure `EXTRA_LLM_API_FILE` points to the config that matches your model (see above).
145
+
146
+
**Nemotron v3 Ultra — NVFP4 model (recommended):**
The `nemotron-v3` and `nano-v3` reasoning parsers are aliases for the same Nemotron v3 parser and are interchangeable. Reasoning can be toggled per request by passing `enable_thinking` through `chat_template_kwargs` in the request body, for example `{"chat_template_kwargs": {"enable_thinking": true}}` (set it to `false` to disable reasoning).
94
159
95
160
After the server is set up, the client can now send prompt requests to the server and receive results.
96
161
@@ -102,7 +167,7 @@ These options provide control over TensorRT LLM's behavior and are set within th
102
167
103
168
#### `tensor_parallel_size`
104
169
105
-
***Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For BF16, use 4 or more GPUs on H100/H200. For NVFP4, 2 GPUs on H100/H200 may suffice.
170
+
***Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For Super BF16, use 4 or more GPUs on H100/H200; for Super NVFP4, 2 GPUs on H100/H200 may suffice. For Ultra NVFP4, use 4 GPUs (single node on B200).
When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
160
225
161
-
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
226
+
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server. The example below uses Nemotron v3 Ultra; replace the `model` field with the model you launched (for example `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`).
* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
210
275
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
211
276
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
212
-
* Nemotron v3 Super is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full 120B parameter weights even though only 12B parameters are active per token.
277
+
* Nemotron v3 is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full parameter weights even though only a fraction of parameters are active per token (12B for Super, 55B for Ultra).
213
278
214
279
## Benchmarking Performance
215
280
@@ -220,14 +285,14 @@ cat <<'EOF' > bench.sh
220
285
#!/usr/bin/env bash
221
286
set -euo pipefail
222
287
223
-
# Adjust the model name based on which Nemotron v3 Super variant you're benchmarking
Copy file name to clipboardExpand all lines: docs/source/release-notes.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,8 @@ All published functionality in the Release Notes has been fully tested and verif
26
26
27
27
### API Changes
28
28
29
+
-`trtllm-serve`, `trtllm-eval`, `trtllm-bench`: explicit CLI flags now take precedence over values in `--config` / `--extra_llm_api_options` YAML files (was: YAML overrode CLI). Un-set CLI flags continue to fall back to the YAML, then to model-specific and built-in defaults.
Copy file name to clipboardExpand all lines: examples/models/core/nemotron/README_nemotron_super_v3.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -200,4 +200,4 @@ Key options:
200
200
# Notes
201
201
202
202
* prefix-cache is not supported for Nemotron Super V3 yet, so please set `enable_block_reuse: false` when launching a server.
203
-
* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md).
203
+
* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md).
0 commit comments