Skip to content

Commit ce7162f

Browse files
preminstrelKnowingNothing
authored andcommitted
update docs
1 parent 50eefcf commit ce7162f

4 files changed

Lines changed: 103 additions & 60 deletions

File tree

docs/e2e.md

Lines changed: 103 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,123 @@
1-
# End-to-End Demo for Triton-Distributed
2-
## Environment Set Up
1+
# End-to-End Integration for Triton-Distributed
2+
3+
This document provides an end-to-end (E2E) integration for Triton-Distributed. It is designed to showcase how to integrate Triton-Distributed's high-performance distributed kernels into a complete LLM, using Qwen3-32B as a reference example. The demo covers the tensor parallel implementation and performance testing from individual layers (Attention, MLP) to the entire model.
4+
5+
## Features
6+
7+
* **Two Strategies for Tensor Parallelism (TP)**:
8+
* Utilizes `AllGather-GEMM` and `GEMM-ReduceScatter` kernels. The input is sharded along the `batch` dimension, and communication is highly overlapped with computation.
9+
* Employs `GEMM + AllReduce`. The input is replicated across all devices.
10+
* **Layer-wise Module Implementation**: Provides `TP_Attn` and `TP_MLP` modules that can easily replace corresponding layers in existing models to enable distributed parallelism.
11+
* **Full Model Integration**: Demonstrates how to seamlessly integrate the parallel modules into a dense model, using `Qwen3-32B` as an example. We also include a complete inference `Engine` with CUDA Graph integration.
12+
13+
-----
14+
15+
## Environment Setup
16+
17+
First, run the following scripts to install the necessary dependencies and configure your environment variables.
318

4-
First, you need to set up the environment for running the end-to-end demo. This includes installing necessary dependencies and configuring the environment variables. You can do this by running the following commands:
519
```bash
20+
# Build the environment and install dependencies
621
bash ./scripts/build_e2e_env.sh
7-
source ./scripts/setenv.sh
822
```
923

10-
## Layer Level End-to-end Demo
24+
-----
25+
26+
## Running the Demos
27+
28+
We provide a set of test scripts for various use cases.
29+
30+
### 1\. Layer-Level Benchmarks
31+
32+
These scripts are used to benchmark the performance of the `TP_Attn` and `TP_MLP` layers in isolation.
33+
34+
#### MLP Layer (`test_tp_mlp.py`)
35+
36+
**AG_GEMM + GEMM_RS Mode**:
37+
This command benchmarks the performance of `ag_gemm` + `gemm_rs`. The input tensor `x`'s `M` dimension (`batch_size * seq_len`) is sharded across GPUs.
1138

12-
We provide TP_MLP, TP_Attn, EP_MoE, SP_Attn for end-to-end demo. You can run the end-to-end demo for these layers by executing the following commands:
1339
```bash
1440
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_mlp.py --M 4096 --model Qwen/Qwen3-32B
41+
```
42+
43+
**AllReduce Mode**:
44+
Use the `--use_allreduce` flag to switch to the `GEMM + AllReduce` paradigm. In this mode, the input is replicated on all GPUs.
45+
46+
```bash
47+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_mlp.py --M 128 --model Qwen/Qwen3-32B --use_allreduce --allreduce_method two_shot_multimem
48+
```
49+
50+
#### Attention Layer (`test_tp_attn.py`)
51+
52+
The Attention layer benchmark is divided into `prefill` and `decode` modes.
53+
54+
**AG_GEMM + GEMM_RS Mode**:
55+
56+
```bash
57+
# prefill
1558
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 32 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill
59+
60+
# decode
1661
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 4096 --seq_len 128 --model Qwen/Qwen3-32B --mode decode
1762
```
1863

19-
## Model Level End-to-end Demo
64+
**AllReduce Mode**:
2065

21-
We provide a model level end-to-end demo. You can run the end-to-end demo executing the following command:
2266
```bash
67+
# prefill
68+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 1 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill --use_allreduce --allreduce_method two_shot_multimem
69+
70+
# decode
71+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 128 --seq_len 128 --model Qwen/Qwen3-32B --mode decode --use_allreduce --allreduce_method two_shot_multimem
72+
```
73+
74+
### 2\. Model-Level End-to-End Tests (`test_tp_e2e.py`)
75+
76+
This script tests a single forward pass of the complete Qwen3 model, which can be used for correctness validation or performance evaluation.
77+
78+
**Correctness Check (`--check`)**:
79+
This mode compares the output of the Triton-Distributed implementation against the native PyTorch eager mode implementation to ensure numerical consistency.
80+
81+
```bash
82+
# AG_GEMM + GEMM_RS Mode
2383
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 256 --model Qwen/Qwen3-32B --check
84+
85+
# AllReduce Mode
86+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 128 --model Qwen/Qwen3-32B --check --use_allreduce --allreduce_method two_shot_multimem
87+
```
88+
89+
**Performance Benchmark (`--mode`)**:
90+
This mode benchmarks the model's forward pass performance during the `prefill` and `decode` stages.
91+
92+
```bash
93+
# AG_GEMM + GEMM_RS Mode
94+
# Prefill
2495
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 32 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill
96+
97+
# Decode
2598
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 4096 --seq_len 128 --model Qwen/Qwen3-32B --mode decode
99+
100+
# AllReduce Mode
101+
# Prefill
102+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 1 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill --use_allreduce --allreduce_method two_shot_multimem
103+
104+
# Decode
105+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 128 --seq_len 128 --model Qwen/Qwen3-32B --mode decode --use_allreduce --allreduce_method two_shot_multimem
106+
```
107+
108+
109+
### 3\. Full Inference Pipeline (`test_e2e_inference.py`)
110+
111+
This script runs a complete generation task (including one prefill step and multiple decode steps) using the `Engine` class. It measures end-to-end throughput and latency.
112+
```bash
113+
# Baseline PyTorch Eager Mode
26114
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150
115+
116+
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 128 --gen_len 128 --max_length 150
117+
118+
# Triton-Distributed AG_GEMM + GEMM_RS Mode
27119
bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150 --triton_dist
120+
121+
# Triton-Distributed AllReduce Mode
122+
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 128 --gen_len 128 --max_length 150 --triton_dist_AR
28123
```

scripts/build_e2e_env.sh

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,6 @@
2525

2626
#!/bin/bash
2727

28-
export http_proxy=http://sys-proxy-rd-relay.byted.org:3128
29-
export https_proxy=http://sys-proxy-rd-relay.byted.org:3128
30-
export no_proxy=code.byted.org
31-
3228
# --- NVIDIA CUDA ---
3329
if command -v nvcc &> /dev/null; then
3430
echo "NVIDIA CUDA compiler (nvcc) found. Proceeding with CUDA-specific installations."

scripts/download_prebuild_ompi.sh

Lines changed: 0 additions & 37 deletions
This file was deleted.

sync-history.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,5 @@
1010
|2025-06-18| 481454316f7e4d7bc17f8419cf70c129ec6b85da| 2db9f9d2a5520d52b4ddb94f3baf1f7103022943 | 2db9f9d2a5520d52b4ddb94f3baf1f7103022943 | 708b5f14682da98a3623bbb6be46c21f2e3321a3|
1111
|2025-07-11| 1f43a82736cbad70446762bdf54b97561a40097a| b5a42c350a933f1312030ca83af62e2b93696ff2 | b5a42c350a933f1312030ca83af62e2b93696ff2 | 17e436b6bb4f76d6904d7ca3083600591e36e767|
1212

13-
## Skipped commits in distributed-main
14-
### End-to-end related
15-
4e3b1c6f413698b322382ca20ee8d7a8c79eba62
16-
3c3e60e087213b94e84b519e2beafa1bb153d47f
17-
03091b1ce538e35355d4d52d010ad02d5e34ecba
18-
1913
### CI related
2014
6a9ded70d154ffb71a2578201008a3db0fca1684
21-
22-
## Partially deleted (except CI-related)
23-
### End-to-end related
24-
8811f37781fde2cd8807366d408975244f65af5b
25-
e46ee8a803ed9361543039633914e43b6644727f

0 commit comments

Comments
 (0)