Skip to content

Commit 90c1cb7

Browse files
schetlur-nvclaude
andauthored
[None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 (#12338)
Signed-off-by: schetlur <schetlur@nvidia.com> Signed-off-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 1d77ec5 commit 90c1cb7

6 files changed

Lines changed: 24 additions & 23 deletions

docs/source/blogs/tech_blog/blog02_DeepSeek_R1_MTP_Implementation_and_Optimization.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py
114114

115115
```bash
116116
cd examples/llm-api
117-
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
117+
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
118118
```
119119

120120
To benchmark min-latency performance with MTP, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:
@@ -170,7 +170,7 @@ Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quicks
170170

171171
```bash
172172
cd examples/llm-api
173-
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
173+
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
174174
```
175175

176176
To benchmark min-latency performance with MTP Relaxed Acceptance, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:

docs/source/blogs/tech_blog/blog03_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ These optimizations target the overall execution flow, scheduling, and resource
157157

158158
There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.
159159

160-
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
160+
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [CudaGraphConfig](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_args.py#L102)
161161

162162
* Overlap Scheduler:
163163

docs/source/blogs/tech_blog/blog04_Scaling_Expert_Parallelism_in_TensorRT-LLM.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -396,7 +396,7 @@ On servers without C2C but with PCIe, if cross-node communication is required, n
396396

397397
### Offline EP Load Balancer
398398

399-
Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how how to run through the Offline EP Load Balancer in E2E approach.
399+
Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how to run through the Offline EP Load Balancer in E2E approach.
400400

401401
## E2E evaluation
402402

@@ -519,7 +519,7 @@ The code and scripts required in the reproducing steps described in this section
519519

520520
### The effect of EP Load Balancer
521521

522-
Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
522+
Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
523523

524524
##### Step 1: Run inference and collect statistics
525525

@@ -536,7 +536,7 @@ export EXPERT_STATISTIC_PATH=./expert_statistic
536536
export EXPERT_STATISTIC_ITER_RANGE=100-200
537537
```
538538

539-
Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
539+
Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
540540

541541
Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.
542542

@@ -559,10 +559,10 @@ trtllm-bench --model ${MODEL_NAME} \
559559
--eos_id -1
560560
```
561561

562-
After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
562+
After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/wide_ep/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
563563

564564
```bash
565-
python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
565+
python examples/wide_ep/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
566566
```
567567

568568
The output would look like:
@@ -582,10 +582,10 @@ average 1024.0 491.651199 1.564272
582582

583583
##### Step 2: Generate the EPLB configuration
584584

585-
Use the provided `examples/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
585+
Use the provided `examples/wide_ep/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
586586

587587
```bash
588-
python examples/ep_load_balancer/generate_eplb_config.py \
588+
python examples/wide_ep/ep_load_balancer/generate_eplb_config.py \
589589
--ep_size 36 \
590590
--num_slots 288 \
591591
--expert_statistic_path $EXPERT_STATISTIC_PATH \
@@ -641,10 +641,10 @@ trtllm-bench --model ${MODEL_NAME} \
641641
--eos_id -1
642642
```
643643

644-
Run the `examples/ep_load_balancer/report_load_statistics.py` script again:
644+
Run the `examples/wide_ep/ep_load_balancer/report_load_statistics.py` script again:
645645

646646
```bash
647-
python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
647+
python examples/wide_ep/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
648648
```
649649

650650
The output would look like:

docs/source/blogs/tech_blog/blog08_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ To address the increased host overhead when scaling parallelism in the system, w
249249

250250
TensorRT LLM is designed to be composed of both C++ and Python code, so that C++ can handle the most performance-sensitive parts while Python handles higher-level logic. As we try to put more logic into Python to make the program easier to read and debug, there are still frequent conversations through binding interfaces between C++ and Python. Besides, since most of the logic is implemented in Python, there are several layers of implementation that communicate with each other through inter-process communication overhead. Frequent binding calls and serialization/deserialization introduced by inter-process communication slow down the core library.
251251

252-
To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
252+
To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
253253

254254
To enable powerful NVTX markers for easier analysis of host overheads, TensorRT LLM provides several useful environment variables:
255255

@@ -307,7 +307,7 @@ When enabling MTP, there is an extra performance boost compared to the baseline.
307307
</div>
308308
<p align="center"><sub><em>Figure 8: DeepSeek R1 throughput on ISL/OSL 8k/1k with MTP enabled.</em></sub></p>
309309

310-
To reproduce the numbers, refer to the [`examples/wide_ep/slurm_scripts`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) directory. The scripts there demonstrate how to launch TensorRT LLM disaggregated serving with large-scale EP and other features enabled on a SLURM cluster.
310+
To reproduce the numbers, refer to the [`examples/wide_ep`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) directory for wide EP examples. Note: the `slurm_scripts` subdirectory is not yet available in the public repository; SLURM-based launch scripts for disaggregated serving with large-scale EP will be added in a future update.
311311

312312
## Future Work
313313

docs/source/blogs/tech_blog/blog09_Deploying_GPT_OSS_on_TRTLLM.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ We have a forthcoming guide for getting great performance on H100, however this
1919

2020
## Launching the TensorRT LLM docker container
2121

22-
The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`
22+
The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1`
2323

2424
Run the follow docker command to start the TensorRT LLM container in interactive mode:
2525

@@ -171,7 +171,7 @@ Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4
171171

172172
## Launch the TensorRT-LLM Server
173173

174-
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
174+
We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
175175
**Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
176176

177177
```bash
@@ -357,7 +357,7 @@ moe_config:
357357
358358
## Troubleshooting Tips
359359
360-
- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
360+
- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
361361
- Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log.
362362
- Check GPU utilization with `nvidia-smi` while the server is running to inspect GPU status and memory usage.
363363
- If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,18 @@ Expected directory layout on the host (example):
1717
└─ eagle/ # Eagle3 speculative decoding assets
1818
```
1919

20-
### Get the TensorRT LLM Container (1.1.0rc0)
20+
### Get the TensorRT LLM Container
2121

22-
If required by your environment, log into NGC and pull the image:
22+
If required by your environment, log into NGC and pull the image. Check the [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) for the latest available release tag:
2323

2424
```bash
2525
# Create an API key at https://ngc.nvidia.com (if you don't have one)
2626
docker login nvcr.io
2727
# Username: $oauthtoken
2828
# Password: <your NGC API key>
2929

30-
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
30+
# Replace <version> with the latest release tag from the NGC catalog
31+
docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
3132
```
3233

3334
### Start the TensorRT LLM Container
@@ -41,7 +42,7 @@ docker run --rm --ipc=host -it \
4142
--gpus all \
4243
-p 8000:8000 \
4344
-v /path/to/models:/config/models:rw \
44-
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
45+
nvcr.io/nvidia/tensorrt-llm/release:<version> \
4546
/bin/bash
4647
```
4748

@@ -89,7 +90,7 @@ speculative_config:
8990
speculative_model_dir: /config/models/eagle/
9091
cuda_graph_config:
9192
max_batch_size: 10
92-
use_torch_sampler: true
93+
sampler_type: TorchSampler
9394
moe_config:
9495
backend: TRTLLM
9596
EOF
@@ -98,7 +99,7 @@ EOF
9899
Notes:
99100
- Ensure your base model directory is `/config/models/gpt-oss-120b`.
100101
- Ensure your Eagle3 assets are present under `/config/models/eagle/`.
101-
- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
102+
- On older releases (pre-1.1.0), replace `sampler_type: TorchSampler` with `use_torch_sampler: true`.
102103

103104
### Launch the Server (Eagle3 Speculative Decoding)
104105

0 commit comments

Comments
 (0)