Skip to content

Commit 09ebc59

Browse files
authored
[TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc (#15111)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
1 parent 5e3af40 commit 09ebc59

7 files changed

Lines changed: 97 additions & 29 deletions

File tree

docs/source/_static/config_db.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,18 @@
3636
"model_url": "https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking",
3737
"scenario": "Max Throughput"
3838
},
39+
{
40+
"command": "trtllm-serve nvidia/Qwen3.5-397B-A17B-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/qwen3.5.yaml",
41+
"config_filename": "qwen3.5.yaml",
42+
"config_github_url": "https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/configs/curated/qwen3.5.yaml",
43+
"config_path": "examples/configs/curated/qwen3.5.yaml",
44+
"config_raw_url": "https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/configs/curated/qwen3.5.yaml",
45+
"gpu_compatibility": "B200, B300, GB200, GB300",
46+
"model": "nvidia/Qwen3.5-397B-A17B-NVFP4",
47+
"model_display_name": "Qwen3.5-397B-A17B (NVFP4)",
48+
"model_url": "https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4",
49+
"scenario": "Max Throughput"
50+
},
3951
{
4052
"command": "trtllm-serve Qwen/Qwen3-30B-A3B --config ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml",
4153
"config_filename": "qwen3.yaml",
@@ -3532,6 +3544,10 @@
35323544
"display_name": "Nemotron v3 Ultra (NVFP4)",
35333545
"url": "https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4"
35343546
},
3547+
"nvidia/Qwen3.5-397B-A17B-NVFP4": {
3548+
"display_name": "Qwen3.5-397B-A17B (NVFP4)",
3549+
"url": "https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4"
3550+
},
35353551
"openai/gpt-oss-120b": {
35363552
"display_name": "gpt-oss-120b",
35373553
"url": "https://huggingface.co/openai/gpt-oss-120b"

docs/source/deployment-guide/deployment-guide-for-qwen3-next-on-trtllm.md renamed to docs/source/deployment-guide/deployment-guide-for-qwen3.5-on-trtllm.md

Lines changed: 54 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
# Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware
1+
# Deployment Guide for Qwen3.5 on TensorRT LLM - Blackwell & Hopper Hardware
22

33
## Introduction
44

5-
This is a functional quick-start guide for running the Qwen3-Next model on TensorRT LLM. It focuses on a working setup with recommended defaults. Additional performance optimizations and support will be rolled out in future updates.
5+
This deployment guide provides step-by-step instructions for running the Qwen3.5-397B-A17B model using TensorRT LLM. It covers model access, environment setup, server configuration, and inference validation.
66

77
## Prerequisites
88

@@ -14,36 +14,63 @@ This is a functional quick-start guide for running the Qwen3-Next model on Tenso
1414

1515
## Models
1616

17-
* BF16 model: [Qwen3-Next-80B-A3B-Thinking](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking)
17+
* [nvidia/Qwen3.5-397B-A17B-NVFP4](https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4)
18+
* [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B) (base, BF16)
19+
20+
## GPU Requirements
21+
22+
The NVFP4 checkpoint is the recommended (and minimum-footprint) deployment precision for Qwen3.5. It quantizes the linear layers in the MoE blocks to NVFP4 and uses an FP8 KV cache.
23+
24+
| Platform | Minimum GPUs |
25+
|----------|--------------|
26+
| B200 | 4x B200 |
27+
| B300 | 4x B300 |
28+
| GB200 | 4x GB200 |
29+
| GB300 | 4x GB300 |
30+
31+
The NVFP4 checkpoint has been validated on B200 with `tensor_parallel_size = 4`. A single node of 4 Blackwell GPUs fits the NVFP4 weights plus the KV cache with headroom.
1832

1933
## Deployment Steps
2034

2135
### Run Docker Container
2236

23-
Build and run the docker container. See the [Docker guide](../../../docker/README.md) for details.
37+
Run the docker container using the TensorRT LLM NVIDIA NGC image.
38+
39+
```shell
40+
docker run --rm -it \
41+
--ipc=host \
42+
--gpus all \
43+
-p 8000:8000 \
44+
-v ~/.cache:/root/.cache:rw \
45+
--name tensorrt_llm \
46+
nvcr.io/nvidia/tensorrt-llm/release:x.y.z \
47+
/bin/bash
2448
```
25-
cd TensorRT-LLM
2649

27-
make -C docker release_build IMAGE_TAG=qwen3-next-local
50+
Note:
2851

29-
make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=qwen3-next-local LOCAL_USER=1
30-
```
52+
* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn't exist please create it using `$ mkdir ~/.cache`.
53+
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
54+
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host.
55+
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
56+
57+
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source.html)
3158

3259
### Recommended Performance Settings
3360

3461
We maintain YAML configuration files with recommended performance settings in the [`examples/configs`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/configs) directory. These config files are present in the TensorRT LLM container at the path `/app/tensorrt_llm/examples/configs`. You can use these out-of-the-box, or adjust them to your specific use case.
3562

3663
```shell
3764
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
38-
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml
65+
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.5.yaml
3966
```
4067

4168
Note: if you don't have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
4269

4370
````{admonition} Show code
4471
:class: dropdown
4572
46-
```{literalinclude} ../../../examples/configs/curated/qwen3-next.yaml
73+
```{literalinclude} ../../../examples/configs/curated/qwen3.5.yaml
4774
---
4875
language: shell
4976
prepend: |
@@ -55,15 +82,18 @@ append: EOF
5582
```
5683
````
5784

85+
The config is a starting point tuned for max throughput on 4x B200; adjust the parallelism, batch sizes, and KV cache fraction to match your hardware and traffic pattern.
5886

5987
### Launch the TensorRT LLM Server
6088

61-
Below is an example command to launch the TensorRT LLM server with the Qwen3-Next model from within the container.
89+
Below is an example command to launch the TensorRT LLM server with the Qwen3.5 NVFP4 model from within the container.
6290

6391
```shell
64-
trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --host 0.0.0.0 --port 8000 --reasoning_parser deepseek-r1 --config ${EXTRA_LLM_API_FILE}
92+
trtllm-serve nvidia/Qwen3.5-397B-A17B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser qwen3_5 --tool_parser qwen3 --config ${EXTRA_LLM_API_FILE}
6593
```
6694

95+
Qwen3.5 uses the `qwen3_5` reasoning parser (its chat template pre-injects a `<think>` block, so reasoning starts at the beginning of the response). The `qwen3` tool parser handles the Qwen3 function-call format.
96+
6797
After the server is set up, the client can now send prompt requests to the server and receive results.
6898

6999
### LLM API Options (YAML Configuration)
@@ -80,12 +110,15 @@ These options provide control over TensorRT LLM's behavior and are set within th
80110

81111
* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tensor_parallel_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.
82112

113+
#### `enable_attention_dp`
114+
115+
* **Description:** Enables **attention data parallelism** for the attention/linear-attention layers while keeping the MoE expert-parallel. This generally improves throughput at high concurrency and long context.
116+
83117
#### `kv_cache_config.free_gpu_memory_fraction`
84118

85119
* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
86120
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.
87121

88-
89122
#### `max_batch_size`
90123

91124
* **Description:** The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
@@ -147,7 +180,7 @@ After the TensorRT LLM server is set up and shows Application startup complete,
147180

148181
```shell
149182
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
150-
"model": "Qwen/Qwen3-Next-80B-A3B-Thinking",
183+
"model": "nvidia/Qwen3.5-397B-A17B-NVFP4",
151184
"messages": [
152185
{
153186
"role": "user",
@@ -159,21 +192,14 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
159192
}' -w "\n"
160193
```
161194

162-
Here is an example response:
163-
164-
```
165-
{"id":"chatcmpl-64ac201c77bf46a7a3a4eca7759b1fd8","object":"chat.completion","created":1759022940,"model":"Qwen/Qwen3-Next-80B-A3B-Thinking","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, the user is asking \"Where is New York?\" Hmm, this seems straightforward but I need to be careful. New York could mean different things—maybe they're confused about the city versus the state. \n\nFirst thought: Are they a tourist planning a trip? Or maybe a student doing homework? Could even be someone国外 who's only heard \"New York\" in movies and isn't sure if it's a city or state. \n\nI should clarify both possibilities immediately. People often mix them up. Like, if someone says \"I'm going to New York\" they're probably talking about NYC, but technically New York State is bigger. \n\nLet me break it down: \n- New York City (NYC) is the famous one—Manhattan, skyscrapers, Times Square. \n- Then New York State (NY) is the whole state, which includes NYC but also upstate areas like Albany (the capital), Buffalo, and even the Adirondacks. \n\nWait, should I mention that NYC is in New York State? Yeah, that's crucial. Otherwise they might think it's two separate things. Also, where is the state located? Northeast US, borders other states like Pennsylvania, New Jersey... maybe name a few neighbors for context. \n\nOh! And the city vs state confusion is super common. Like, people say \"New York\" for the city but forget it's part of a larger state. Should I give examples? \n- \"If someone says 'hot dogs in New York' they mean NYC\" \n- But \"the New York State Police\" covers the whole state\" \n\n...Maybe add a fun fact? Like how NYC is just one city within the state? Or that the state capital isn't NYC but Albany? Yeah, that'll help clear confusion. \n\nAlso—should I ask if they meant the city or the state? Nah, better to cover both upfront. Keep it simple but precise. No jargon. They might not know terms like \"metropolitan area\" or \"boroughs,\" so explain \"boroughs\" as neighborhoods (Manhattan, Brooklyn etc.). \n\nCoordinates: 40.7°N, 74°W? Maybe overkill unless they're into geography. But for the state, roughly 42°N to 45°N latitude? Hmm, probably unnecessary. Stick to landmarks: borders Lake Erie, Atlantic coast... \n\nWait—user might be asking literally for GPS coordinates? Unlikely, but just in case I can add: \"Geographically, New York City is at approximately 40.7°N latitude and 74.0°W longitude...\" but only if the answer feels too dry otherwise. \n\nPriority: Clear, simple, correct. No assuming prior knowledge. Say \"New York can refer to...\" then split into two cases. End with \"Most people mean the city when they say 'New York' in casual talk.\" That’s the key takeaway. \n\n...Double-check: Is Albany really the capital? Yes. And NYC is the largest city but not capital. Got it. Also, New York State is the 27th largest by area? Maybe not relevant. Stick to basics. \n\nFinal structure: \n1. \"New York can mean two things: the city or the state.\" \n2. For city: \n - Where it is (Northeast US) \n - Inside New York State \n - Famous parts (Manhattan etc.) \n3. For state: \n - Larger area, includes city \n - Borders (PA, NJ, CT, MA, VT, Canada) \n - Capital is Albany (not NYC!) \n4. Clear summary: \"In everyday talk, 'New York' usually means NYC.\" \n\nYeah. That covers it without overwhelming them. No markdown, just plain English. Keep it warm—add a \"hope that helps!\" at the end.\n</think>\n\nThat's a common point of confusion! **\"New York\" can refer to two different things**, and it's important to distinguish between them:\n\n### 1. **New York City (NYC)** \n - **What it is**: The largest and most famous city in the United States, known for landmarks like Times Square, Central Park, the Statue of Liberty, and Wall Street. \n - **Where it is**: \n - Located in the **northeastern United States**. \n - Situated at the mouth of the **Hudson River**, where it meets the **Atlantic Ocean**. \n - Part of **New York State** (see below). \n - **Geographic details**: \n - Coordinates: Approximately **40.7° N latitude, 74.0° W longitude**. \n - Composed of **5 boroughs**: Manhattan (the \"city\" most people picture), Brooklyn, Queens, The Bronx, and Staten Island. \n - Panoramic view of NYC (including Brooklyn and New Jersey skyline):","reasoning_content":null,"reasoning":null,"tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":15,"total_tokens":1039,"completion_tokens":1024},"prompt_token_ids":null}
166-
```
167-
168195
### Troubleshooting Tips
169196

170-
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
197+
* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size`, `max_num_tokens`, or `kv_cache_config.free_gpu_memory_fraction`.
171198
* Ensure your model checkpoints are compatible with the expected format.
172-
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
199+
* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
173200
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
174201
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.
175-
* If you are using trtllm-serve and the thinking model of Qwen3-Next, make sure to add this server arg `--reasoning_parser deepseek-r1`.
176-
202+
* Reasoning is controlled with `--reasoning_parser qwen3_5`. To toggle thinking per request, pass `enable_thinking` through `chat_template_kwargs` in the request body, for example `{"chat_template_kwargs": {"enable_thinking": true}}` (set it to `false` to disable reasoning).
177203

178204
## Benchmarking Performance
179205

@@ -184,16 +210,18 @@ cat <<'EOF' > bench.sh
184210
#!/usr/bin/env bash
185211
set -euo pipefail
186212
213+
MODEL_NAME="nvidia/Qwen3.5-397B-A17B-NVFP4"
214+
187215
concurrency_list="1 2 4 8 16 32 64 128 256"
188216
multi_round=5
189217
isl=1024
190218
osl=1024
191-
result_dir=/tmp/qwen3_output
219+
result_dir=/tmp/qwen3_5_output
192220
193221
for concurrency in ${concurrency_list}; do
194222
num_prompts=$((concurrency * multi_round))
195223
python -m tensorrt_llm.serve.scripts.benchmark_serving \
196-
--model Qwen/Qwen3-Next-80B-A3B-Thinking \
224+
--model ${MODEL_NAME} \
197225
--backend openai \
198226
--dataset-name "random" \
199227
--random-input-len ${isl} \

docs/source/deployment-guide/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,6 @@ The deployment guides below provide more detailed instructions for serving speci
3434
deployment-guide-for-llama4-scout-on-trtllm.md
3535
deployment-guide-for-gpt-oss-on-trtllm.md
3636
deployment-guide-for-qwen3-on-trtllm.md
37-
deployment-guide-for-qwen3-next-on-trtllm.md
37+
deployment-guide-for-qwen3.5-on-trtllm.md
3838
deployment-guide-for-kimi-k2-thinking-on-trtllm.md
3939
deployment-guide-for-glm-5-on-trtllm.md

docs/source/models/supported-models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The following is a table of supported models for the PyTorch backend:
4747
| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B` |
4848
| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B` |
4949
| `Qwen3NextForCausalLM` | Qwen3Next | `Qwen/Qwen3-Next-80B-A3B-Thinking` |
50-
| `Qwen3_5MoeForCausalLM` [^5] | Qwen3.5-MoE | `Qwen/Qwen3.5-397B-A17B` |
50+
| `Qwen3_5MoeForCausalLM` | Qwen3.5-MoE | `Qwen/Qwen3.5-397B-A17B` |
5151
| `SeedOssForCausalLM` [^5] | Seed OSS, Seed-Coder | `ByteDance-Seed/Seed-OSS-36B-Instruct` |
5252
| `SkyworkR1V2ForConditionalGeneration` [^5] | Skywork R1V2, Skywork SWE | `Skywork/Skywork-R1V2-38B` |
5353
| `SmolLM3ForCausalLM` [^5] | SmolLM3 | `HuggingFaceTB/SmolLM3-3B` |
@@ -65,9 +65,9 @@ Note: Support for other models may vary. Features marked "N/A" are not applicabl
6565
| `Glm4MoeForCausalLM` | Yes | Yes | Yes | Untested | Yes | Yes | No | No | No | Yes | Yes | Untested | N/A | Yes | Yes |
6666
| `Qwen3MoeForCausalLM` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | Yes | N/A | Yes | Yes |
6767
| `Qwen3NextForCausalLM` [^3] | Yes | Yes | Yes | Untested | Yes | No | No | No | No | Yes | Yes | No | No | Untested | Untested |
68+
| `Qwen3_5MoeForCausalLM` | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | Untested | Yes | N/A | Untested | Untested |
6869
| `Llama4ForConditionalGeneration` | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | Untested | N/A | Yes | Yes |
6970
| `GptOssForCausalLM` | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | N/A | Yes | Yes |
70-
| `Qwen3_5MoeForCausalLM` [^5] | Yes | Yes | Untested | Untested | Yes | No | No | No | No | Yes | Untested | Yes | N/A | Untested | Untested |
7171
| `Glm4MoeLiteForCausalLM` [^5] | Yes | Yes | Untested | Untested | Yes | No | No | No | No | Yes | Untested | Untested | N/A | Untested | Untested |
7272
| `NemotronHForCausalLM` | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | N/A | Untested | Untested |
7373
| `Gemma4ForConditionalGeneration` | Untested | Yes | Untested | No | Yes | No | No | No | No | Yes | Untested | No | Yes | Untested | Untested |

0 commit comments

Comments
 (0)