NVIDIA
diff --git a/‎examples/specdec_bench/README.md‎
Lines changed: 74 additions & 3 deletions b/‎examples/specdec_bench/README.md‎
Lines changed: 74 additions & 3 deletions
diff --git a/‎examples/specdec_bench/SPECBENCH_PORTING.md‎
Lines changed: 2 additions & 14 deletions b/‎examples/specdec_bench/SPECBENCH_PORTING.md‎
Lines changed: 2 additions & 14 deletions
diff --git a/‎examples/specdec_bench/prepare_data.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/specdec_bench/prepare_data.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/specdec_bench/requirements_speed.txt‎
Lines changed: 1 addition & 1 deletion b/‎examples/specdec_bench/requirements_speed.txt‎
Lines changed: 1 addition & 1 deletion
@@ -3,8 +3,8 @@
 ## Installation
 
 This benchmark is meant to be a lightweight layer ontop of an existing vLLM/SGLang/TRTLLM installation. For example, no install
-is required if one is running in the following dockers: `vllm/vllm-openai:v0.11.0` (vLLM), `lmsysorg/sglang:v0.5.4.post2` (SGLang), or
-`nvcr.io/nvidia/tensorrt-llm/release:1.2.0` (TRT-LLM).
+is required if one is running in the following dockers: `vllm/vllm-openai:v0.19.0` (vLLM), `lmsysorg/sglang:v0.5.10.post1` (SGLang), or
+`nvcr.io/nvidia/tensorrt-llm/release:1.3.0.rc10` (TRT-LLM).
 
 Next
 
@@ -145,9 +145,80 @@ python3 run.py \
     --runtime_params runtime_args_long_context.yaml
 ```
 
+## Running Sweeps
+
+A sweep runs multiple dataset/concurrency combinations in a single invocation — useful for
+throughput curves, ablations over concurrency levels, or multi-dataset evaluations.
+
+### Sweep config format
+
+Create a YAML file with a `runs` key (or a flat list). Each entry supports:
+
+| Key | Type | Description |
+|-----|------|-------------|
+| `dataset` | string (required) | Dataset name (same choices as `--dataset`) |
+| `dataset_path` | string | Path to dataset (can also be supplied via CLI) |
+| `random_isl` | int | Input sequence length for the `random` dataset |
+| `concurrency` | int or list | Concurrency level(s) to sweep over |
+| `num_requests` | int or list | Requests per concurrency level (list must match length of `concurrency`) |
+| `output_length` | int or list | Output token limit per concurrency level |
+| `temperature` | float or list | Sampling temperature per concurrency level |
+| `category` | string | Category filter (for datasets that support it) |
+
+`concurrency` accepts a single integer or a list. When a list, `num_requests`,
+`output_length`, and `temperature` can each be a matching-length list to set per-level
+values, or a single scalar to apply the same value to all levels.
+
+**Example `sweep_example.yaml`:**
+
+```yaml
+runs:
+  - dataset: speed
+    dataset_path: /data/speed/qualitative
+    concurrency: [32]
+    num_requests: 880
+    output_length: 4096
+  - dataset: speed
+    dataset_path: /data/speed/throughput_1k
+    concurrency: [1, 2, 4, 8, 16, 32, 64]
+    num_requests: [8, 16, 32, 32, 64, 128, 256]
+    output_length: 2048
+```
+
+### Running a sweep
+
+Pass `--sweep_config` in place of the usual dataset flags:
+
+```bash
+python3 run.py \
+    --model_dir meta-llama/Llama-3.3-70B-Instruct \
+    --tokenizer meta-llama/Llama-3.3-70B-Instruct \
+    --draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
+    --tp_size 8 \
+    --ep_size 1 \
+    --draft_length 3 \
+    --engine TRTLLM \
+    --sweep_config sweep_example.yaml \
+    --sweep_output_root ./my_sweep_results
+```
+
+### Output structure
+
+Each run is saved to its own subdirectory under the sweep output root:
+
+```
+my_sweep_results/
+  000_speed_c32/       # first entry, concurrency=32
+  001_speed_c1/        # second entry, concurrency=1
+  001_speed_c2/        # second entry, concurrency=2
+  ...
+```
+
+If `--sweep_output_root` is not set, outputs go to `./sweep_outputs/<timestamp>/`.
+
 ## Notes
 
 The goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method.
 This benchmark sends request in a single-threaded fashion, so running large concurrency (>256) may result in python async scheduling delays and skew metrics.
 If larger concurrency is needed, it is recommended to fully deploy the model using `vllm serve`, `python -m sglang.launch_server`, or `trtllm-serve` (for vLLM, SGlang, or TRTLLM respectively) and
-use a more robust benchmarking client like NVIDIA AI Perf.
+use a more robust benchmarking client like NVIDIA AI Perf.
@@ -5,13 +5,11 @@ This guide explains how to convert any `inference_*.py` runner from [Spec-Bench]
 ## Overview
 
 Spec-Bench inference runners follow a pattern where:
-
 1. A `*_forward()` function handles the speculative decoding logic
 2. The `run_eval()` function orchestrates evaluation with tokenized inputs
 3. Models are loaded in `__main__` and passed to `run_eval()`
 
 In contrast, `specdec_bench` uses a class-based approach where:
-
 1. Models inherit from the `Model` base class
 2. `__init__()` handles model loading
 3. `run()` is an async method that processes single requests
@@ -29,7 +27,6 @@ class Model:
         prompt_ids: list of token IDs (not a tensor!)
         Returns dict with:
             - output_ids: list of list of token chunks per step [[chunk1, chunk2, ...]]
-            - output_logits: optional logits (usually None)
             - token_times: list of timestamps per decoding step
         """
         raise NotImplementedError
@@ -181,7 +178,7 @@ Convert the standalone `*_forward()` function to an internal method:
             turn_id: Turn identifier
         
         Returns:
-            dict with output_ids, output_logits, token_times
+            dict with output_ids and token_times
         """
         output_dict = {}
 
@@ -221,7 +218,6 @@ Convert the standalone `*_forward()` function to an internal method:
             reformatted_output_ids[0].append(generated_tokens[start:])
 
         output_dict['output_ids'] = reformatted_output_ids
-        output_dict['output_logits'] = None
         output_dict['token_times'] = timing
 
         return output_dict
@@ -261,7 +257,7 @@ from .specbench_<method> import SpecBench<Method>Model
 | Aspect | Spec-Bench | specdec_bench |
 |--------|-----------|---------------|
 | Input format | `inputs.input_ids` (tensor from tokenizer) | `prompt_ids` (list of ints) |
-| Output format | `(output_ids, new_token, steps, accept_lengths)` | `dict` with `output_ids`, `output_logits`, `token_times` |
+| Output format | `(output_ids, new_token, steps, accept_lengths)` | `dict` with `output_ids`, `token_times` |
 | Output IDs | Full sequence tensor | List of token chunks per step |
 | Timing | External (in `run_eval`) | Internal (in `run()`) |
 | Device | `device_map="auto"` | Explicit single device |
@@ -319,11 +315,3 @@ async def test():
 
 asyncio.run(test())
 ```
-
-Adjust the vicuna chat template to be in the tokenizer_config to be
-
-Insert to tokenizer_config (for vicuna)
-
-```json
-"chat_template": "{% set ns = namespace(system='') %}{% for m in messages %}{% if m['role'] == 'system' %}{% set ns.system = m['content'] %}{% endif %}{% endfor %}{{ ns.system | trim }}{% if ns.system | trim != '' %} {% endif %}{% for m in messages %}{% if m['role'] == 'user' %}USER: {{ m['content'] | trim }} ASSISTANT:{% elif m['role'] == 'assistant' %}{{ m['content'] | trim }}{% endif %}{% endfor %}"
-```
@@ -67,7 +67,7 @@ def prepare_data(args: argparse.Namespace) -> None:
         "--config",
         type=str,
         default="all",
-        choices=[*list(get_args(config_type)), "all"],
+        choices=[*get_args(config_type), "all"],
         help='SPEED-Bench configuration to prepare. Use "all" to prepare all configs. (default: %(default)s)',
     )
     parser.add_argument(
 
@@ -1,4 +1,4 @@
-datasets>=3.1.0
+datasets>=4.2.0
 rich>=14.2.0
 seaborn>=0.13.2
 tiktoken>=0.12.0
Original file line number	Diff line number	Diff line change
`@@ -67,7 +67,7 @@ def prepare_data(args: argparse.Namespace) -> None:`
`67`	`67`	`"--config",`
`68`	`68`	`type=str,`
`69`	`69`	`default="all",`
`70`		`- choices=[*list(get_args(config_type)), "all"],`
	`70`	`+ choices=[*get_args(config_type), "all"],`
`71`	`71`	`help='SPEED-Bench configuration to prepare. Use "all" to prepare all configs. (default: %(default)s)',`
`72`	`72`	`)`
`73`	`73`	`parser.add_argument(`