NVIDIA
diff --git a/‎docs/source/_static/config_db.json‎
Lines changed: 16 additions & 0 deletions b/‎docs/source/_static/config_db.json‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/commands/trtllm-serve/trtllm-serve.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎…-guide-for-nemotron-3-super-on-trtllm.md‎ ‎…oyment-guide-for-nemotron-3-on-trtllm.md‎docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md
Lines changed: 80 additions & 15 deletions b/‎…-guide-for-nemotron-3-super-on-trtllm.md‎ ‎…oyment-guide-for-nemotron-3-on-trtllm.md‎docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md
Lines changed: 80 additions & 15 deletions
diff --git a/‎docs/source/deployment-guide/index.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/deployment-guide/index.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/models/supported-models.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/source/models/supported-models.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/source/release-notes.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/release-notes.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/configs/curated/lookup.yaml‎
Lines changed: 5 additions & 0 deletions b/‎examples/configs/curated/lookup.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/configs/curated/nemotron-3-ultra-throughput.yaml‎
Lines changed: 19 additions & 0 deletions b/‎examples/configs/curated/nemotron-3-ultra-throughput.yaml‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎examples/models/core/nemotron/README_nemotron_super_v3.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/models/core/nemotron/README_nemotron_super_v3.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/visual_gen/README.md‎
Lines changed: 2 additions & 1 deletion b/‎examples/visual_gen/README.md‎
Lines changed: 2 additions & 1 deletion
@@ -12,6 +12,18 @@
       "model_url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
       "scenario": "Max Throughput"
     },
+    {
+      "command": "trtllm-serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
+      "config_filename": "nemotron-3-ultra-throughput.yaml",
+      "config_github_url": "https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
+      "config_path": "examples/configs/curated/nemotron-3-ultra-throughput.yaml",
+      "config_raw_url": "https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/configs/curated/nemotron-3-ultra-throughput.yaml",
+      "gpu_compatibility": "B200, B300, GB200, GB300, H100, H200",
+      "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
+      "model_display_name": "Nemotron v3 Ultra (NVFP4)",
+      "model_url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
+      "scenario": "Max Throughput"
+    },
     {
       "command": "trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --config ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml",
       "config_filename": "qwen3-next.yaml",
@@ -3516,6 +3528,10 @@
       "display_name": "Nemotron v3 Super (NVFP4)",
       "url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
     },
+    "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4": {
+      "display_name": "Nemotron v3 Ultra (NVFP4)",
+      "url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4"
+    },
     "openai/gpt-oss-120b": {
       "display_name": "gpt-oss-120b",
       "url": "https://huggingface.co/openai/gpt-oss-120b"
 
@@ -317,7 +317,7 @@ Example output:
 Configuring with YAML Files
 ----------------------------
 
-You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. The arguments in the file override the corresponding command line arguments.
+You can configure various options of ``trtllm-serve`` using YAML files by setting the ``--config`` option to the path of a YAML file. Explicit CLI flags take precedence over values in the YAML; un-set CLI flags fall back to the YAML.
 
 .. include:: ../../_includes/note_sections.rst
    :start-after: .. start-note-config-flag-alias
 
@@ -1,8 +1,13 @@
-# Deployment Guide for Nemotron v3 Super on TensorRT LLM - Blackwell & Hopper Hardware
+# Deployment Guide for Nemotron v3 (Ultra & Super) on TensorRT LLM - Blackwell & Hopper Hardware
 
 ## Introduction
 
-This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 Super 120B-A12B model using TensorRT LLM. Nemotron v3 Super is a hybrid architecture model combining Mixture-of-Experts (MoE) with SSM (Mamba) and attention layers, delivering 120B total parameters with only 12B active parameters per token for efficient inference. This guide covers model access, environment setup, server configuration, and inference validation.
+This deployment guide provides step-by-step instructions for running the NVIDIA Nemotron v3 family of models using TensorRT LLM. It covers two models:
+
+* **Nemotron v3 Ultra (550B-A55B)** — 550B total parameters with 55B active per token.
+* **Nemotron v3 Super (120B-A12B)** — 120B total parameters with 12B active per token.
+
+Both models share a hybrid architecture (`NemotronHForCausalLM`) that interleaves Mamba-2 (SSM), Mixture-of-Experts (MoE), and attention layers for efficient inference. Nemotron v3 Ultra additionally uses a Latent Mixture-of-Experts (LatentMoE) design and ships with built-in Multi-Token Prediction (MTP) layers. On TensorRT LLM, Nemotron v3 Ultra supports MTP, prefix caching (KV cache reuse), and disaggregated serving. This guide covers model access, environment setup, server configuration, and inference validation for both models.
 
 ## Prerequisites
 
@@ -14,6 +19,13 @@ This deployment guide provides step-by-step instructions for running the NVIDIA
 
 ## Models
 
+### Nemotron v3 Ultra (550B-A55B)
+
+* [NVIDIA-Nemotron-3-Ultra-550B-A55B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-Base-BF16)
+* [NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4)
+
+### Nemotron v3 Super (120B-A12B)
+
 * [NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16)
 * [NVIDIA-Nemotron-3-Super-120B-A12B-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8)
 * [NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4)
@@ -22,7 +34,27 @@ All models are available under the [nvidia/nvidia-nemotron-v3](https://huggingfa
 
 ## GPU Requirements
 
-Nemotron v3 Super 120B-A12B has 120B total parameters. The minimum GPU memory required depends on the precision:
+The minimum GPU memory required depends on the model size and precision.
+
+### Nemotron v3 Ultra (550B-A55B)
+
+The NVFP4 checkpoint is the recommended (and minimum-footprint) deployment precision for Ultra. The published minimum GPU requirements for the NVFP4 checkpoint are:
+
+| Platform | Minimum GPUs |
+|----------|--------------|
+| B200     | 4x B200      |
+| B300     | 4x B300      |
+| GB200    | 4x GB200     |
+| GB300    | 4x GB300     |
+| H100     | 8x H100 \*   |
+
+The NVFP4 checkpoint uses an FP8 KV cache. On Blackwell (B200/B300) and Grace Blackwell (GB200/GB300), a single node of 4 GPUs fits the NVFP4 weights plus the KV cache with headroom.
+
+\* The same NVFP4 checkpoint can also be served on Hopper. Because Hopper lacks a native NVFP4 tensor-core GEMM, NVFP4 weights are run through a W4A16 fallback path that dequantizes them on the fly; this requires a minimum of 8x H100 (fewer may suffice on the higher-memory H200) and delivers somewhat lower throughput than Blackwell. No checkpoint conversion or command change is needed — the runtime selects the fallback automatically.
+
+The `Base-BF16` checkpoint is the pre-training checkpoint and is primarily intended for research and fine-tuning rather than serving.
+
+### Nemotron v3 Super (120B-A12B)
 
 | Checkpoint | Minimum GPUs (H100/H200 80GB) | Minimum GPUs (B200/GB200 192GB) |
 |------------|-------------------------------|---------------------------------|
@@ -61,12 +93,36 @@ We maintain YAML configuration files with recommended performance settings in th
 
 ```shell
 TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
+```
+
+Select the config file that matches the model you are deploying:
+
+```shell
+# Nemotron v3 Ultra
+EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/nemotron-3-ultra-throughput.yaml
+
+# Nemotron v3 Super
 EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/nemotron-3-super-throughput.yaml
 ```
 
-Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
+Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdowns below.
 
-````{admonition} Show code
+````{admonition} Show Nemotron v3 Ultra config
+:class: dropdown
+
+```{literalinclude} ../../../examples/configs/curated/nemotron-3-ultra-throughput.yaml
+---
+language: shell
+prepend: |
+  EXTRA_LLM_API_FILE=/tmp/config.yml
+
+  cat << EOF > ${EXTRA_LLM_API_FILE}
+append: EOF
+---
+```
+````
+
+````{admonition} Show Nemotron v3 Super config
 :class: dropdown
 
 ```{literalinclude} ../../../examples/configs/curated/nemotron-3-super-throughput.yaml
@@ -81,16 +137,25 @@ append: EOF
 ```
 ````
 
+The Ultra config is a starting point tuned for max throughput on 4x B200; adjust the parallelism, batch sizes, and KV cache fraction to match your hardware and traffic pattern.
+
 ### Launch the TensorRT LLM Server
 
-Below are example commands to launch the TensorRT LLM server with the Nemotron v3 Super model from within the container.
+Below are example commands to launch the TensorRT LLM server from within the container. Make sure `EXTRA_LLM_API_FILE` points to the config that matches your model (see above).
+
+**Nemotron v3 Ultra — NVFP4 model (recommended):**
+
+```shell
+trtllm-serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser nemotron-v3 --tool_parser qwen3_coder --config ${EXTRA_LLM_API_FILE}
+```
 
-**NVFP4 model (recommended, lowest memory footprint):**
+**Nemotron v3 Super — NVFP4 model (recommended, lowest memory footprint):**
 
 ```shell
 trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser nano-v3 --tool_parser qwen3_coder --config ${EXTRA_LLM_API_FILE}
 ```
 
+The `nemotron-v3` and `nano-v3` reasoning parsers are aliases for the same Nemotron v3 parser and are interchangeable. Reasoning can be toggled per request by passing `enable_thinking` through `chat_template_kwargs` in the request body, for example `{"chat_template_kwargs": {"enable_thinking": true}}` (set it to `false` to disable reasoning).
 
 After the server is set up, the client can now send prompt requests to the server and receive results.
 
@@ -102,7 +167,7 @@ These options provide control over TensorRT LLM's behavior and are set within th
 
 #### `tensor_parallel_size`
 
-* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For BF16, use 4 or more GPUs on H100/H200. For NVFP4, 2 GPUs on H100/H200 may suffice.
+* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance. For Super BF16, use 4 or more GPUs on H100/H200; for Super NVFP4, 2 GPUs on H100/H200 may suffice. For Ultra NVFP4, use 4 GPUs (single node on B200).
 
 #### `moe_expert_parallel_size`
 
@@ -158,11 +223,11 @@ curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
 
 When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
 
-After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
+After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server. The example below uses Nemotron v3 Ultra; replace the `model` field with the model you launched (for example `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`).
 
 ```shell
 curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
-    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
+    "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
     "messages": [
         {
             "role": "user",
@@ -182,7 +247,7 @@ Here is an example response:
   "id": "chatcmpl-abc123def456",
   "object": "chat.completion",
   "created": 1759022940,
-  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
+  "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
   "choices": [
     {
       "index": 0,
@@ -209,7 +274,7 @@ Here is an example response:
 * For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
 * If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
 * For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
-* Nemotron v3 Super is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full 120B parameter weights even though only 12B parameters are active per token.
+* Nemotron v3 is a hybrid SSM/attention model with MoE — ensure you have sufficient GPU memory for the full parameter weights even though only a fraction of parameters are active per token (12B for Super, 55B for Ultra).
 
 ## Benchmarking Performance
 
@@ -220,14 +285,14 @@ cat <<'EOF' > bench.sh
 #!/usr/bin/env bash
 set -euo pipefail
 
-# Adjust the model name based on which Nemotron v3 Super variant you're benchmarking
-MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
+# Adjust the model name based on which Nemotron v3 variant you're benchmarking
+MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4"
 
 concurrency_list="1 2 4 8 16 32 64 128"
 multi_round=5
 isl=1024
 osl=1024
-result_dir=/tmp/nemotron_super_output
+result_dir=/tmp/nemotron_v3_output
 
 for concurrency in ${concurrency_list}; do
     num_prompts=$((concurrency * multi_round))
 
@@ -28,7 +28,7 @@ The deployment guides below provide more detailed instructions for serving speci
    :maxdepth: 1
    :name: Deployment Guides
 
-   deployment-guide-for-nemotron-3-super-on-trtllm.md
+   deployment-guide-for-nemotron-3-on-trtllm.md
    deployment-guide-for-deepseek-r1-on-trtllm.md
    deployment-guide-for-llama3.3-70b-on-trtllm.md
    deployment-guide-for-llama4-scout-on-trtllm.md
 
@@ -36,7 +36,7 @@ The following is a table of supported models for the PyTorch backend:
 | `MixtralForCausalLM`                 | Mixtral                            | `mistralai/Mixtral-8x7B-v0.1`                |
 | `MllamaForConditionalGeneration`     | Llama 3.2                          | `meta-llama/Llama-3.2-11B-Vision`            |
 | `NemotronForCausalLM`                | Nemotron-3, Nemotron-4, Minitron   | `nvidia/Minitron-8B-Base`                    |
-| `NemotronHForCausalLM`               | Nemotron-3-Nano, Nemotron-3-Super  | `nvidia/nvidia-nemotron-v3`                  |
+| `NemotronHForCausalLM`               | Nemotron-3-Nano, Nemotron-3-Super, Nemotron-3-Ultra | `nvidia/nvidia-nemotron-v3`                  |
 | `NemotronNASForCausalLM`             | NemotronNAS                        | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`     |
 | `Olmo3ForCausalLM` [^5]              | OLMo 3, OLMo 3.1                   | `allenai/Olmo-3.1-32B-Instruct`              |
 | `OpenELMForCausalLM` [^5]            | OpenELM                            | `apple/OpenELM-270M-Instruct`                |
@@ -70,6 +70,7 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
 | `Qwen3_5MoeForCausalLM` [^5]     | Yes               | Yes        | Untested                   | Untested              | Yes             | No  | No               | No                | No     | Yes           | Untested         | Yes            | N/A                      | Untested              | Untested        |
 | `Glm4MoeLiteForCausalLM` [^5]    | Yes               | Yes        | Untested                   | Untested              | Yes             | No  | No               | No                | No     | Yes           | Untested         | Untested       | N/A                      | Untested              | Untested        |
 | `NemotronHForCausalLM` (Super)   | Yes               | Yes        | Untested                   | Untested              | Yes             | Yes | No               | No                | No     | Yes           | Yes              | Untested       | N/A                      | Untested              | Untested        |
+| `NemotronHForCausalLM` (Ultra)   | Yes               | Yes        | Untested                   | Yes                   | Yes             | Yes | No               | No                | No     | Yes           | Yes              | Yes            | N/A                      | Untested              | Untested        |
 | `Gemma4ForConditionalGeneration` | Untested          | Yes        | Untested                   | No                    | Yes             | No  | No               | No                | No     | Yes           | Untested         | No             | Yes                      | Untested              | Untested        |
 | `Step3p7ForConditionalGeneration`| Yes               | Yes        | Yes                        | Untested              | Untested        | Yes | No               | No                | No     | Yes           | Untested         | Untested       | Yes                      | Untested              | Untested        |
 
 
@@ -26,6 +26,8 @@ All published functionality in the Release Notes has been fully tested and verif
 
 ### API Changes
 
+- `trtllm-serve`, `trtllm-eval`, `trtllm-bench`: explicit CLI flags now take precedence over values in `--config` / `--extra_llm_api_options` YAML files (was: YAML overrode CLI). Un-set CLI flags continue to fall back to the YAML, then to model-specific and built-in defaults.
+
 ### Fixed Issues
 
 ### Known Issues
 
@@ -4,6 +4,11 @@
   config_path: examples/configs/curated/nemotron-3-super-throughput.yaml
   scenario: Max Throughput
   gpu_compatibility: "B200, GB200"
+- model: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
+  arch: NemotronHForCausalLM
+  config_path: examples/configs/curated/nemotron-3-ultra-throughput.yaml
+  scenario: Max Throughput
+  gpu_compatibility: "B200, B300, GB200, GB300, H100, H200"
 - model: Qwen/Qwen3-Next-80B-A3B-Thinking
   arch: Qwen3NextForCausalLM
   config_path: examples/configs/curated/qwen3-next.yaml
 
@@ -0,0 +1,19 @@
+max_batch_size: 256
+max_num_tokens: 2048
+tensor_parallel_size: 4
+moe_expert_parallel_size: 4
+trust_remote_code: true
+enable_attention_dp: true
+cuda_graph_config:
+  enable_padding: true
+  max_batch_size: 256
+kv_cache_config:
+  free_gpu_memory_fraction: 0.8
+  enable_block_reuse: false
+  mamba_ssm_cache_dtype: float16
+  mamba_ssm_philox_rounds: 5
+  mamba_ssm_stochastic_rounding: true
+moe_config:
+  backend: CUTEDSL
+num_postprocess_workers: 4
+stream_interval: 10
@@ -200,4 +200,4 @@ Key options:
 # Notes
 
 * prefix-cache is not supported for Nemotron Super V3 yet, so please set `enable_block_reuse: false` when launching a server.
-* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-super-on-trtllm.md).
+* For detailed deployment instructions, see the [deployment guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/deployment-guide/deployment-guide-for-nemotron-3-on-trtllm.md).
@@ -18,11 +18,12 @@ for feature details.
 # Defaults
 python quickstart_example.py
 python models/wan_t2v.py
-python models/wan_i2v.py
+python models/ltx2.py
 
 # With engine config (quant, parallelism, etc.)
 python models/wan_t2v.py --visual_gen_args configs/wan2.2-t2v-fp4-1gpu.yaml
 python models/wan_i2v.py --visual_gen_args configs/wan2.2-i2v-fp4-1gpu.yaml --image /path/to/image.png
+python models/ltx2.py --visual_gen_args configs/ltx2-t2v-fp8-1-gpu.yaml
 ```
 
 Install deps from the repo root: `pip install -r requirements-dev.txt`.