|
1 | | -# End-to-End Demo for Triton-Distributed |
2 | | -## Environment Set Up |
| 1 | +# End-to-End Integration for Triton-Distributed |
| 2 | + |
| 3 | +This document provides an end-to-end (E2E) integration for Triton-Distributed. It is designed to showcase how to integrate Triton-Distributed's high-performance distributed kernels into a complete LLM, using Qwen3-32B as a reference example. The demo covers the tensor parallel implementation and performance testing from individual layers (Attention, MLP) to the entire model. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | + * **Two Strategies for Tensor Parallelism (TP)**: |
| 8 | + * Utilizes `AllGather-GEMM` and `GEMM-ReduceScatter` kernels. The input is sharded along the `batch` dimension, and communication is highly overlapped with computation. |
| 9 | + * Employs `GEMM + AllReduce`. The input is replicated across all devices. |
| 10 | + * **Layer-wise Module Implementation**: Provides `TP_Attn` and `TP_MLP` modules that can easily replace corresponding layers in existing models to enable distributed parallelism. |
| 11 | + * **Full Model Integration**: Demonstrates how to seamlessly integrate the parallel modules into a dense model, using `Qwen3-32B` as an example. We also include a complete inference `Engine` with CUDA Graph integration. |
| 12 | + |
| 13 | +----- |
| 14 | + |
| 15 | +## Environment Setup |
| 16 | + |
| 17 | +First, run the following scripts to install the necessary dependencies and configure your environment variables. |
3 | 18 |
|
4 | | -First, you need to set up the environment for running the end-to-end demo. This includes installing necessary dependencies and configuring the environment variables. You can do this by running the following commands: |
5 | 19 | ```bash |
| 20 | +# Build the environment and install dependencies |
6 | 21 | bash ./scripts/build_e2e_env.sh |
7 | | -source ./scripts/setenv.sh |
8 | 22 | ``` |
9 | 23 |
|
10 | | -## Layer Level End-to-end Demo |
| 24 | +----- |
| 25 | + |
| 26 | +## Running the Demos |
| 27 | + |
| 28 | +We provide a set of test scripts for various use cases. |
| 29 | + |
| 30 | +### 1\. Layer-Level Benchmarks |
| 31 | + |
| 32 | +These scripts are used to benchmark the performance of the `TP_Attn` and `TP_MLP` layers in isolation. |
| 33 | + |
| 34 | +#### MLP Layer (`test_tp_mlp.py`) |
| 35 | + |
| 36 | +**AG_GEMM + GEMM_RS Mode**: |
| 37 | +This command benchmarks the performance of `ag_gemm` + `gemm_rs`. The input tensor `x`'s `M` dimension (`batch_size * seq_len`) is sharded across GPUs. |
11 | 38 |
|
12 | | -We provide TP_MLP, TP_Attn, EP_MoE, SP_Attn for end-to-end demo. You can run the end-to-end demo for these layers by executing the following commands: |
13 | 39 | ```bash |
14 | 40 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_mlp.py --M 4096 --model Qwen/Qwen3-32B |
| 41 | +``` |
| 42 | + |
| 43 | +**AllReduce Mode**: |
| 44 | +Use the `--use_allreduce` flag to switch to the `GEMM + AllReduce` paradigm. In this mode, the input is replicated on all GPUs. |
| 45 | + |
| 46 | +```bash |
| 47 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_mlp.py --M 128 --model Qwen/Qwen3-32B --use_allreduce --allreduce_method two_shot_multimem |
| 48 | +``` |
| 49 | + |
| 50 | +#### Attention Layer (`test_tp_attn.py`) |
| 51 | + |
| 52 | +The Attention layer benchmark is divided into `prefill` and `decode` modes. |
| 53 | + |
| 54 | +**AG_GEMM + GEMM_RS Mode**: |
| 55 | + |
| 56 | +```bash |
| 57 | +# prefill |
15 | 58 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 32 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill |
| 59 | + |
| 60 | +# decode |
16 | 61 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 4096 --seq_len 128 --model Qwen/Qwen3-32B --mode decode |
17 | 62 | ``` |
18 | 63 |
|
19 | | -## Model Level End-to-end Demo |
| 64 | +**AllReduce Mode**: |
20 | 65 |
|
21 | | -We provide a model level end-to-end demo. You can run the end-to-end demo executing the following command: |
22 | 66 | ```bash |
| 67 | +# prefill |
| 68 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 1 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill --use_allreduce --allreduce_method two_shot_multimem |
| 69 | + |
| 70 | +# decode |
| 71 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_attn.py --bsz 128 --seq_len 128 --model Qwen/Qwen3-32B --mode decode --use_allreduce --allreduce_method two_shot_multimem |
| 72 | +``` |
| 73 | + |
| 74 | +### 2\. Model-Level End-to-End Tests (`test_tp_e2e.py`) |
| 75 | + |
| 76 | +This script tests a single forward pass of the complete Qwen3 model, which can be used for correctness validation or performance evaluation. |
| 77 | + |
| 78 | +**Correctness Check (`--check`)**: |
| 79 | +This mode compares the output of the Triton-Distributed implementation against the native PyTorch eager mode implementation to ensure numerical consistency. |
| 80 | + |
| 81 | +```bash |
| 82 | +# AG_GEMM + GEMM_RS Mode |
23 | 83 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 256 --model Qwen/Qwen3-32B --check |
| 84 | + |
| 85 | +# AllReduce Mode |
| 86 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 128 --model Qwen/Qwen3-32B --check --use_allreduce --allreduce_method two_shot_multimem |
| 87 | +``` |
| 88 | + |
| 89 | +**Performance Benchmark (`--mode`)**: |
| 90 | +This mode benchmarks the model's forward pass performance during the `prefill` and `decode` stages. |
| 91 | + |
| 92 | +```bash |
| 93 | +# AG_GEMM + GEMM_RS Mode |
| 94 | +# Prefill |
24 | 95 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 32 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill |
| 96 | + |
| 97 | +# Decode |
25 | 98 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 4096 --seq_len 128 --model Qwen/Qwen3-32B --mode decode |
| 99 | + |
| 100 | +# AllReduce Mode |
| 101 | +# Prefill |
| 102 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 1 --seq_len 128 --model Qwen/Qwen3-32B --mode prefill --use_allreduce --allreduce_method two_shot_multimem |
| 103 | + |
| 104 | +# Decode |
| 105 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 128 --seq_len 128 --model Qwen/Qwen3-32B --mode decode --use_allreduce --allreduce_method two_shot_multimem |
| 106 | +``` |
| 107 | + |
| 108 | + |
| 109 | +### 3\. Full Inference Pipeline (`test_e2e_inference.py`) |
| 110 | + |
| 111 | +This script runs a complete generation task (including one prefill step and multiple decode steps) using the `Engine` class. It measures end-to-end throughput and latency. |
| 112 | +```bash |
| 113 | +# Baseline PyTorch Eager Mode |
26 | 114 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150 |
| 115 | + |
| 116 | +bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 128 --gen_len 128 --max_length 150 |
| 117 | + |
| 118 | +# Triton-Distributed AG_GEMM + GEMM_RS Mode |
27 | 119 | bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150 --triton_dist |
| 120 | + |
| 121 | +# Triton-Distributed AllReduce Mode |
| 122 | +NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh ./python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 128 --gen_len 128 --max_length 150 --triton_dist_AR |
28 | 123 | ``` |
0 commit comments