[None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 (#12338)

schetlur-nv · claude · web-flow · commit 90c1cb75e511 · 2026-03-23T14:19:05.000-07:00
Signed-off-by: schetlur &lt;schetlur@nvidia.com&gt;
Signed-off-by: Sharan Chetlur &lt;116769508+schetlur-nv@users.noreply.github.com&gt;
Co-authored-by: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/docs/source/blogs/tech_blog/blog02_DeepSeek_R1_MTP_Implementation_and_Optimization.md b/docs/source/blogs/tech_blog/blog02_DeepSeek_R1_MTP_Implementation_and_Optimization.md
@@ -114,7 +114,7 @@ Run DeepSeek-V3/R1 models with MTP, use [examples/llm-api/quickstart_advanced.py
 
 ```bash
 cd examples/llm-api
-python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N
+python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
 ```
 
 To benchmark min-latency performance with MTP, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:
@@ -170,7 +170,7 @@ Run DeepSeek-R1 models with MTP Relaxed Acceptance, use [examples/llm-api/quicks
 
 ```bash
 cd examples/llm-api
-python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_nextn N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
+python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N --use_relaxed_acceptance_for_thinking --relaxed_topk 10 --relaxed_delta 0.6
 ```
 
 To benchmark min-latency performance with MTP Relaxed Acceptance, you need to follow [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md#6-dataset-preparation) to prepare your dataset, then follow the steps below:
diff --git a/docs/source/blogs/tech_blog/blog03_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md b/docs/source/blogs/tech_blog/blog03_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md
@@ -157,7 +157,7 @@ These optimizations target the overall execution flow, scheduling, and resource
 
     There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.
 
-    Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n  enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
+    Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n  enable_padding: False`, see API here [CudaGraphConfig](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/llmapi/llm_args.py#L102)
 
 * Overlap Scheduler:
 
diff --git a/docs/source/blogs/tech_blog/blog04_Scaling_Expert_Parallelism_in_TensorRT-LLM.md b/docs/source/blogs/tech_blog/blog04_Scaling_Expert_Parallelism_in_TensorRT-LLM.md
@@ -396,7 +396,7 @@ On servers without C2C but with PCIe, if cross-node communication is required, n
 
 ### Offline EP Load Balancer
 
-Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how how to run through the Offline EP Load Balancer in E2E approach.
+Online EP balancer is more suitable for production deployment needs to react timely to online traffic changes. However, Offline EP Balancer provides a lightweight way for performance study/debugging and validation. You can refer to [this PR](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) to learn more about the implementation of the Offline EP Load Balancer. Also there is a tool provided to collect statistics about the expert activation distribution which can be used as the input to deduce the EP balancing placement strategy. You can refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer#offline-ep-load-balancer) doc to learn more details as well as how to run through the Offline EP Load Balancer in E2E approach.
 
 ## E2E evaluation
 
@@ -519,7 +519,7 @@ The code and scripts required in the reproducing steps described in this section
 
 ### The effect of EP Load Balancer
 
-Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/large-ep/examples/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
+Please, refer to the [EP Load Balancer example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/ep_load_balancer) for how to reproduce the results for the offline EP Load Balancer.
 
 ##### Step 1: Run inference and collect statistics
 
@@ -536,7 +536,7 @@ export EXPERT_STATISTIC_PATH=./expert_statistic
 export EXPERT_STATISTIC_ITER_RANGE=100-200
 ```
 
-Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
+Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.
 
 Run 32-way expert parallelism inference on the prepared dataset. Please refer to the [LLM API MGMN example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llm-api/llm_mgmn_trtllm_bench.sh) for details on running `trtllm-bench` on Slurm.
 
@@ -559,10 +559,10 @@ trtllm-bench --model ${MODEL_NAME} \
     --eos_id -1
 ```
 
-After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
+After inference, review the dumped statistic files in `$EXPERT_STATISTIC_PATH`. Run the `examples/wide_ep/ep_load_balancer/report_load_statistics.py` script to show the standard deviation and imbalance ratio metrics:
 
 ```bash
-python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
+python examples/wide_ep/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
 ```
 
 The output would look like:
@@ -582,10 +582,10 @@ average  1024.0  491.651199         1.564272
 
 ##### Step 2: Generate the EPLB configuration
 
-Use the provided `examples/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
+Use the provided `examples/wide_ep/ep_load_balancer/generate_eplb_config.py` script to convert the collected statistics into an EPLB configuration file. Specify the target expert parallelism size (`--ep_size`) and the total number of slots (`--num_slots`) that will be used for deployment. For example, if we choose to maintain 8 expert slots per rank while increasing expert parallelism to 36 ways, there should be 32 redundant experts and 288 expert slots in total.
 
 ```bash
-python examples/ep_load_balancer/generate_eplb_config.py \
+python examples/wide_ep/ep_load_balancer/generate_eplb_config.py \
     --ep_size 36 \
     --num_slots 288 \
     --expert_statistic_path $EXPERT_STATISTIC_PATH \
@@ -641,10 +641,10 @@ trtllm-bench --model ${MODEL_NAME} \
     --eos_id -1
 ```
 
-Run the `examples/ep_load_balancer/report_load_statistics.py` script again:
+Run the `examples/wide_ep/ep_load_balancer/report_load_statistics.py` script again:
 
 ```bash
-python examples/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
+python examples/wide_ep/ep_load_balancer/report_load_statistics.py --expert_statistic_path $EXPERT_STATISTIC_PATH
 ```
 
 The output would look like:
diff --git a/docs/source/blogs/tech_blog/blog08_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md b/docs/source/blogs/tech_blog/blog08_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md
@@ -249,7 +249,7 @@ To address the increased host overhead when scaling parallelism in the system, w
 
 TensorRT LLM is designed to be composed of both C++ and Python code, so that C++ can handle the most performance-sensitive parts while Python handles higher-level logic. As we try to put more logic into Python to make the program easier to read and debug, there are still frequent conversations through binding interfaces between C++ and Python. Besides, since most of the logic is implemented in Python, there are several layers of implementation that communicate with each other through inter-process communication overhead. Frequent binding calls and serialization/deserialization introduced by inter-process communication slow down the core library.
 
-To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
+To improve program efficiency, we used environment variables introduced in the [performance analysis guidance](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-analysis.md) to measure and profile CPU overhead, and improved performance by reducing and reusing different binding calls as much as possible, and delaying Python object deserialization to avoid duplicated serialization and reduce message size when doing inter-process communication. This optimization was added in [PR 5224](https://github.com/NVIDIA/TensorRT-LLM/pull/5224). We have also reduced Python garbage collection (GC) impacts in [PR 5141](https://github.com/NVIDIA/TensorRT-LLM/pull/5141).
 
 To enable powerful NVTX markers for easier analysis of host overheads, TensorRT LLM provides several useful environment variables:
 
@@ -307,7 +307,7 @@ When enabling MTP, there is an extra performance boost compared to the baseline.
 </div>
 <p align="center"><sub><em>Figure 8: DeepSeek R1 throughput on ISL/OSL 8k/1k with MTP enabled.</em></sub></p>
 
-To reproduce the numbers, refer to the [`examples/wide_ep/slurm_scripts`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep/slurm_scripts) directory. The scripts there demonstrate how to launch TensorRT LLM disaggregated serving with large-scale EP and other features enabled on a SLURM cluster.
+To reproduce the numbers, refer to the [`examples/wide_ep`](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/wide_ep) directory for wide EP examples. Note: the `slurm_scripts` subdirectory is not yet available in the public repository; SLURM-based launch scripts for disaggregated serving with large-scale EP will be added in a future update.
 
 ## Future Work
 
diff --git a/docs/source/blogs/tech_blog/blog09_Deploying_GPT_OSS_on_TRTLLM.md b/docs/source/blogs/tech_blog/blog09_Deploying_GPT_OSS_on_TRTLLM.md
@@ -19,7 +19,7 @@ We have a forthcoming guide for getting great performance on H100, however this
 
 ## Launching the TensorRT LLM docker container
 
-The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`
+The container image that you will use will be pulled from NVIDIA's NGC. This container is multi-platform and will run on both x64 and arm64 architectures: `nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc1`
 
 Run the follow docker command to start the TensorRT LLM container in interactive mode:
 
@@ -171,7 +171,7 @@ Currently, the best throughput **19.5k tps/gpu** is achieved with DP4EP4 using 4
 
 ## Launch the TensorRT-LLM Server
 
-We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:
+We can use `trtllm-serve` to serve the model by translating the benchmark commands above. For low-latency configuration, run:  
 **Note:** You can also point to a local path containing the model weights instead of the HF repo (e.g., `${local_model_path}`).
 
 ```bash
@@ -357,7 +357,7 @@ moe_config:
 
 ## Troubleshooting Tips
 
-- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
+- If you encounter CUDA out-of-memory errors, try reducing `--max_batch_size`, `--max_num_tokens`, or `--kv_cache_free_gpu_memory_fraction`. See the [doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md) for the explanation of these parameters.
 - Add `print_iter_log: true` to extra LLM API options YAML file to inspect the per-iteration log.
 - Check GPU utilization with `nvidia-smi` while the server is running to inspect GPU status and memory usage.
 - If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed
diff --git a/docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md b/docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
@@ -17,17 +17,18 @@ Expected directory layout on the host (example):
   └─ eagle/         # Eagle3 speculative decoding assets
 ```
 
-### Get the TensorRT LLM Container (1.1.0rc0)
+### Get the TensorRT LLM Container
 
-If required by your environment, log into NGC and pull the image:
+If required by your environment, log into NGC and pull the image. Check the [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) for the latest available release tag:
 
 ```bash
 # Create an API key at https://ngc.nvidia.com (if you don't have one)
 docker login nvcr.io
 # Username: $oauthtoken
 # Password: <your NGC API key>
 
-docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
+# Replace <version> with the latest release tag from the NGC catalog
+docker pull nvcr.io/nvidia/tensorrt-llm/release:<version>
 ```
 
 ### Start the TensorRT LLM Container
@@ -41,7 +42,7 @@ docker run --rm --ipc=host -it \
   --gpus all \
   -p 8000:8000 \
   -v /path/to/models:/config/models:rw \
-  nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 \
+  nvcr.io/nvidia/tensorrt-llm/release:<version> \
   /bin/bash
 ```
 
@@ -89,7 +90,7 @@ speculative_config:
   speculative_model_dir: /config/models/eagle/
 cuda_graph_config:
   max_batch_size: 10
-use_torch_sampler: true
+sampler_type: TorchSampler
 moe_config:
   backend: TRTLLM
 EOF
@@ -98,7 +99,7 @@ EOF
 Notes:
 - Ensure your base model directory is `/config/models/gpt-oss-120b`.
 - Ensure your Eagle3 assets are present under `/config/models/eagle/`.
-- If you are running on Top of Tree, replace `use_torch_sampler: true` with `sampler_type: TorchSampler`.
+- On older releases (pre-1.1.0), replace `sampler_type: TorchSampler` with `use_torch_sampler: true`.
 
 ### Launch the Server (Eagle3 Speculative Decoding)