You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog02_DeepSeek_R1_MTP_Implementation_and_Optimization.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,7 +114,7 @@ Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py
114
114
115
115
```bash
116
116
cd examples/llm-api
117
-
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
117
+
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
118
118
```
119
119
120
120
To benchmark min-latency performance with MTP, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:
@@ -170,7 +170,7 @@ Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quicks
To benchmark min-latency performance with MTP Relaxed Acceptance, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog03_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -157,7 +157,7 @@ These optimizations target the overall execution flow, scheduling, and resource
157
157
158
158
There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.
159
159
160
-
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
160
+
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [CudaGraphConfig](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_args.py#L102)
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog04_Scaling_Expert_Parallelism_in_TensorRT-LLM.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -396,7 +396,7 @@ On servers without C2C but with PCIe, if cross-node communication is required, n
396
396
397
397
### Offline EP Load Balancer
398
398
399
-
Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how how to run through the Offline EP Load Balancer in E2E approach.
399
+
Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how to run through the Offline EP Load Balancer in E2E approach.
400
400
401
401
## E2E evaluation
402
402
@@ -519,7 +519,7 @@ The code and scripts required in the reproducing steps described in this section
519
519
520
520
### The effect of EP Load Balancer
521
521
522
-
Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
522
+
Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
523
523
524
524
##### Step 1: Run inference and collect statistics
Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
539
+
Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
540
540
541
541
Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.
After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
562
+
After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/wide_ep/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
@@ -582,10 +582,10 @@ average 1024.0 491.651199 1.564272
582
582
583
583
##### Step 2: Generate the EPLB configuration
584
584
585
-
Use the provided `examples/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
585
+
Use the provided `examples/wide_ep/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog08_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -249,7 +249,7 @@ To address the increased host overhead when scaling parallelism in the system, w
249
249
250
250
TensorRT LLM is designed to be composed of both C++ and Python code, so that C++ can handle the most performance-sensitive parts while Python handles higher-level logic. As we try to put more logic into Python to make the program easier to read and debug, there are still frequent conversations through binding interfaces between C++ and Python. Besides, since most of the logic is implemented in Python, there are several layers of implementation that communicate with each other through inter-process communication overhead. Frequent binding calls and serialization/deserialization introduced by inter-process communication slow down the core library.
251
251
252
-
To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
252
+
To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
253
253
254
254
To enable powerful NVTX markers for easier analysis of host overheads, TensorRT LLM provides several useful environment variables:
255
255
@@ -307,7 +307,7 @@ When enabling MTP, there is an extra performance boost compared to the baseline.
307
307
</div>
308
308
<p align="center"><sub><em>Figure 8: DeepSeek R1 throughput on ISL/OSL 8k/1k with MTP enabled.</em></sub></p>
309
309
310
-
To reproduce the numbers, refer to the [`examples/wide_ep/slurm_scripts`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) directory. The scripts there demonstrate how to launch TensorRT LLM disaggregated serving with large-scale EP and other features enabled on a SLURM cluster.
310
+
To reproduce the numbers, refer to the [`examples/wide_ep`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) directory for wide EP examples. Note: the `slurm_scripts` subdirectory is not yet available in the public repository; SLURM-based launch scripts for disaggregated serving with large-scale EP will be added in a future update.
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog09_Deploying_GPT_OSS_on_TRTLLM.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ We have a forthcoming guide for getting great performance on H100, however this
19
19
20
20
## Launching the TensorRT LLM docker container
21
21
22
-
The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`
22
+
The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1`
23
23
24
24
Run the follow docker command to start the TensorRT LLM container in interactive mode:
25
25
@@ -171,7 +171,7 @@ Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4
171
171
172
172
## Launch the TensorRT-LLM Server
173
173
174
-
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
174
+
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
175
175
**Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
176
176
177
177
```bash
@@ -357,7 +357,7 @@ moe_config:
357
357
358
358
## Troubleshooting Tips
359
359
360
-
- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
360
+
- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
361
361
- Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log.
362
362
- Check GPU utilization with `nvidia-smi` while the server is running to inspect GPU status and memory usage.
363
363
- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
+7-6Lines changed: 7 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,17 +17,18 @@ Expected directory layout on the host (example):
17
17
└─ eagle/ # Eagle3 speculative decoding assets
18
18
```
19
19
20
-
### Get the TensorRT LLM Container (1.1.0rc0)
20
+
### Get the TensorRT LLM Container
21
21
22
-
If required by your environment, log into NGC and pull the image:
22
+
If required by your environment, log into NGC and pull the image. Check the [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) for the latest available release tag:
23
23
24
24
```bash
25
25
# Create an API key at https://ngc.nvidia.com (if you don't have one)
0 commit comments