|
3 | 3 | ## Installation |
4 | 4 |
|
5 | 5 | This benchmark is meant to be a lightweight layer ontop of an existing vLLM/SGLang/TRTLLM installation. For example, no install |
6 | | -is required if one is running in the following dockers: `vllm/vllm-openai:v0.11.0` (vLLM), `lmsysorg/sglang:v0.5.4.post2` (SGLang), or |
7 | | -`nvcr.io/nvidia/tensorrt-llm/release:1.2.0` (TRT-LLM). |
| 6 | +is required if one is running in the following dockers: `vllm/vllm-openai:v0.19.0` (vLLM), `lmsysorg/sglang:v0.5.10.post1` (SGLang), or |
| 7 | +`nvcr.io/nvidia/tensorrt-llm/release:1.3.0.rc10` (TRT-LLM). |
8 | 8 |
|
9 | 9 | Next |
10 | 10 |
|
@@ -145,9 +145,80 @@ python3 run.py \ |
145 | 145 | --runtime_params runtime_args_long_context.yaml |
146 | 146 | ``` |
147 | 147 |
|
| 148 | +## Running Sweeps |
| 149 | + |
| 150 | +A sweep runs multiple dataset/concurrency combinations in a single invocation — useful for |
| 151 | +throughput curves, ablations over concurrency levels, or multi-dataset evaluations. |
| 152 | + |
| 153 | +### Sweep config format |
| 154 | + |
| 155 | +Create a YAML file with a `runs` key (or a flat list). Each entry supports: |
| 156 | + |
| 157 | +| Key | Type | Description | |
| 158 | +|-----|------|-------------| |
| 159 | +| `dataset` | string (required) | Dataset name (same choices as `--dataset`) | |
| 160 | +| `dataset_path` | string | Path to dataset (can also be supplied via CLI) | |
| 161 | +| `random_isl` | int | Input sequence length for the `random` dataset | |
| 162 | +| `concurrency` | int or list | Concurrency level(s) to sweep over | |
| 163 | +| `num_requests` | int or list | Requests per concurrency level (list must match length of `concurrency`) | |
| 164 | +| `output_length` | int or list | Output token limit per concurrency level | |
| 165 | +| `temperature` | float or list | Sampling temperature per concurrency level | |
| 166 | +| `category` | string | Category filter (for datasets that support it) | |
| 167 | + |
| 168 | +`concurrency` accepts a single integer or a list. When a list, `num_requests`, |
| 169 | +`output_length`, and `temperature` can each be a matching-length list to set per-level |
| 170 | +values, or a single scalar to apply the same value to all levels. |
| 171 | + |
| 172 | +**Example `sweep_example.yaml`:** |
| 173 | + |
| 174 | +```yaml |
| 175 | +runs: |
| 176 | + - dataset: speed |
| 177 | + dataset_path: /data/speed/qualitative |
| 178 | + concurrency: [32] |
| 179 | + num_requests: 880 |
| 180 | + output_length: 4096 |
| 181 | + - dataset: speed |
| 182 | + dataset_path: /data/speed/throughput_1k |
| 183 | + concurrency: [1, 2, 4, 8, 16, 32, 64] |
| 184 | + num_requests: [8, 16, 32, 32, 64, 128, 256] |
| 185 | + output_length: 2048 |
| 186 | +``` |
| 187 | +
|
| 188 | +### Running a sweep |
| 189 | +
|
| 190 | +Pass `--sweep_config` in place of the usual dataset flags: |
| 191 | + |
| 192 | +```bash |
| 193 | +python3 run.py \ |
| 194 | + --model_dir meta-llama/Llama-3.3-70B-Instruct \ |
| 195 | + --tokenizer meta-llama/Llama-3.3-70B-Instruct \ |
| 196 | + --draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \ |
| 197 | + --tp_size 8 \ |
| 198 | + --ep_size 1 \ |
| 199 | + --draft_length 3 \ |
| 200 | + --engine TRTLLM \ |
| 201 | + --sweep_config sweep_example.yaml \ |
| 202 | + --sweep_output_root ./my_sweep_results |
| 203 | +``` |
| 204 | + |
| 205 | +### Output structure |
| 206 | + |
| 207 | +Each run is saved to its own subdirectory under the sweep output root: |
| 208 | + |
| 209 | +``` |
| 210 | +my_sweep_results/ |
| 211 | + 000_speed_c32/ # first entry, concurrency=32 |
| 212 | + 001_speed_c1/ # second entry, concurrency=1 |
| 213 | + 001_speed_c2/ # second entry, concurrency=2 |
| 214 | + ... |
| 215 | +``` |
| 216 | + |
| 217 | +If `--sweep_output_root` is not set, outputs go to `./sweep_outputs/<timestamp>/`. |
| 218 | + |
148 | 219 | ## Notes |
149 | 220 |
|
150 | 221 | The goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method. |
151 | 222 | This benchmark sends request in a single-threaded fashion, so running large concurrency (>256) may result in python async scheduling delays and skew metrics. |
152 | 223 | If larger concurrency is needed, it is recommended to fully deploy the model using `vllm serve`, `python -m sglang.launch_server`, or `trtllm-serve` (for vLLM, SGlang, or TRTLLM respectively) and |
153 | | -use a more robust benchmarking client like NVIDIA AI Perf. |
| 224 | +use a more robust benchmarking client like NVIDIA AI Perf. |
0 commit comments