Skip to content

Commit 7a43ece

Browse files
authored
[FEAT] Add replay from trace strategy (#620)
### Summary - Add a new `replay` benchmarking strategy that reproduces real-world request patterns from trace log files (.jsonl) - Enable time-based request rate replay with precise timestamp scheduling - Support synthetic prompt generation that matches token counts from trace files - use `max_requests` and `max_seconds` cli options to limit the number of requests processed from a trace ### Motivation This change addresses issue #597 by enabling users to benchmark their vLLM servers using real production traces. Instead of synthetic load patterns, users can now replay exact request arrival times and token distributions from their actual workloads for more realistic performance testing. ### Changes - Add `TraceReplayStrategy` scheduler strategy for timestamp-based request dispatching - Add `ReplayProfile` class for configuring trace-based benchmarking parameters - Add `TraceSyntheticDatasetDeserializer` to generate prompts matching trace input/output lengths - Add `TraceReader` utility for reading .jsonl trace files with timestamp, input_length, output_length fields - Update `Entrypoint` to handle replay profile and dataset configuration - use `max_requests` and `max_seconds` truncation support to limit trace replay length ### Testing - `pytest tests/unit/scheduler/test_trace_replay.py` (pass) - `pytest tests/unit/benchmark/test_replay_profile.py` (pass) - `pytest tests/unit/data/deserializers/test_trace_synthetic.py` (pass) - Added tests: scheduling accuracy, boundary conditions, malformed trace handling, empty trace cases, max_requests truncation - test it in practice quickly with [NB COLAB](https://colab.research.google.com/drive/1hOY9Kg5BVYz4BZzJLrcUKWiMDR2_lU7V?usp=sharing) ### Next Steps (this PR) 1. Apply reviewer feedback 2. Add E2E tests verifying end-to-end trace replay flow ✅ 3. Add integrations tests (if needed) 4. Add CLI usage examples in PR description and docs ✅ ### Out of Scope (future PRs or not) - Mooncake trace format support (token-level traces) - Helper utilities for timestamp format conversions (Unix epoch, ISO8601, relative timestamps) - Support for request payload traces (not just token counts) - Trace file validation and schema verification tools - Performance optimizations for large trace files (streaming, chunked processing) - Metrics export formatted for trace analysis comparison - Support for trace file compression formats (.gz, .bz2) ## Use of AI - [x] Includes code generated by an AI application
2 parents 4fb9e8a + 3084876 commit 7a43ece

12 files changed

Lines changed: 1353 additions & 2 deletions

File tree

docs/getting-started/benchmark.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ GuideLLM offers a wide range of configuration options to customize your benchmar
6565
| `--random-seed` | Random seed for reproducibility | `--random-seed 42` |
6666
| `--max-seconds` | Duration for each benchmark in seconds | `--max-seconds 30` |
6767
| `--max-requests` | Maximum number of requests for each benchmark | `--max-requests 1000` |
68+
| `--data-samples` | Maximum number of dataset rows to load | `--data-samples 1000` |
6869
| `--output-dir` | Directory path to save output files | `--output-dir results/` |
6970
| `--outputs` | Output formats to generate | `--outputs json csv html` |
7071

@@ -187,6 +188,36 @@ guidellm benchmark \
187188

188189
You can customize synthetic data generation with additional parameters such as standard deviation, minimum, and maximum values. See the [Datasets Synthetic data documentation](../guides/datasets.md#synthetic-data) for more details.
189190

191+
### Trace Replay Benchmarking (beta)
192+
193+
For realistic load testing, replay trace events using each row's timestamp and token lengths. Trace files must be JSONL and are loaded with the `trace_synthetic` data type. By default, each row uses `timestamp`, `input_length`, and `output_length` fields. Timestamps may be absolute or monotonic values; GuideLLM sorts them and converts them to offsets from the first event before scheduling:
194+
195+
```json
196+
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
197+
{"timestamp": 1234500.5, "input_length": 512, "output_length": 64}
198+
```
199+
200+
In this example, the second request is scheduled 0.5 seconds after the first request.
201+
202+
Run with the `replay` profile:
203+
204+
```bash
205+
guidellm benchmark \
206+
--target "http://localhost:8000" \
207+
--data path/to/trace.jsonl \
208+
--data-args type_=trace_synthetic \
209+
--profile replay \
210+
--rate 1.0
211+
```
212+
213+
The `--rate` parameter acts as a time scale for the intervals between trace events, not requests per second: `1.0` preserves the original timing, `2.0` doubles the intervals and runs twice as long, and `0.5` halves the intervals and runs twice as fast.
214+
215+
GuideLLM orders trace rows by timestamp before scheduling and payload generation, so each scheduled event uses the token lengths from the same sorted row. Use `--data-samples` to limit how many trace rows are loaded and replayed. `--max-requests` remains a runtime completion constraint; it does not truncate the trace dataset.
216+
217+
If your trace uses different column names, map them with `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column` in `--data-args`.
218+
219+
For very small prompts (roughly under 15 tokens, depending on the tokenizer), GuideLLM may not have enough room to include the full per-row unique prefix. Different rows can then produce similar or identical prompts, which reduces cache resistance in replay benchmarks.
220+
190221
### Working with Real Data
191222

192223
While synthetic data is convenient for quick tests, you can benchmark with real-world data:

docs/guides/datasets.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,10 @@ The following arguments can be used to configure datasets and their processing:
1313
- `prompt_column`: Specifies the column name for the prompt. By default, GuideLLM will try the most common column names (e.g., `prompt`, `text`, `input`).
1414
- `prompt_tokens_count_column`: Specifies the column name for the prompt token count. These are used to set the request prompt token count for counting metrics. By default, GuideLLM assumes no token count is provided.
1515
- `output_tokens_count_column`: Specifies the column name for the output token count. These are used to set the request output token count for the request and counting metrics. By default, GuideLLM assumes no token count is provided.
16+
- `type_`: Selects a specialized dataset deserializer, such as `trace_synthetic` for trace replay files.
17+
- `timestamp_column`: Specifies the timestamp column for `trace_synthetic` data. The default is `timestamp`.
18+
- `prompt_tokens_column`: Specifies the prompt token length column for `trace_synthetic` data. The default is `input_length`.
19+
- `output_tokens_column`: Specifies the output token length column for `trace_synthetic` data. The default is `output_length`.
1620
- `split`: Specifies the dataset split to use (e.g., `train`, `val`, `test`). By default, GuideLLM will try the most common split names (e.g., `train`, `validation`, `test`) if the dataset has splits, otherwise it will use the entire dataset.
1721
- Any remaining arguments are passed directly into the dataset constructor as kwargs.
1822
- `--data-sampler`: Specifies the sampling strategy for datasets. By default, no sampling is applied. When set to `random`, it enables random shuffling of the dataset, which can be useful for creating diverse batches during benchmarking.
@@ -116,22 +120,64 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
116120
#### Supported Formats with Examples
117121

118122
- **Text files (`.txt`, `.text`)**: Where each line is a separate prompt to use.
123+
119124
```
120125
Hello, how are you?
121126
What is your name?
122127
```
128+
123129
- **CSV files (`.csv`)**: Where each row is a separate dataset entry and the first row contains the column names. The columns should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional columns can be included based on the previously mentioned aliases for the `--data-column-mapper` argument.
130+
124131
```csv
125132
prompt,output_tokens_count,additional_column,additional_column2
126133
Hello, how are you?,5,foo,bar
127134
What is your name?,3,baz,qux
128135
```
136+
129137
- **JSON Lines files (`.jsonl`)**: Where each line is a separate JSON object. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-args` argument.
138+
130139
```json
131140
{"prompt": "Hello, how are you?", "output_tokens_count": 5, "additional_column": "foo", "additional_column2": "bar"}
132141
{"prompt": "What is your name?", "output_tokens_count": 3, "additional_column": "baz", "additional_column2": "qux"}
133142
```
143+
144+
- **Trace files (`.jsonl` with `trace_synthetic` type)**: Specialized JSONL files for replay benchmarking with `timestamp`, `input_length`, and `output_length` fields. Used with `--profile replay` to replay trace events using each row's timestamp and token lengths. Timestamps must be numbers expressed in seconds on a shared timeline with any consistent zero point; GuideLLM sorts them and converts them to offsets from the first event before scheduling. Date strings are not parsed yet, so provide timestamps as numbers. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking).
145+
146+
```json
147+
{"timestamp": 1234500.0, "input_length": 256, "output_length": 128}
148+
{"timestamp": 1234500.5, "input_length": 512, "output_length": 64}
149+
```
150+
151+
In this example, the second request is scheduled 0.5 seconds after the first request. Trace rows are ordered by timestamp before GuideLLM schedules requests and generates synthetic payloads. This keeps each scheduled event aligned with the prompt and output token lengths from the same row.
152+
153+
Use `--data-args type_=trace_synthetic` to enable trace loading:
154+
155+
```bash
156+
guidellm benchmark \
157+
--target http://localhost:8000 \
158+
--profile replay \
159+
--rate 1.0 \
160+
--data path/to/trace.jsonl \
161+
--data-args type_=trace_synthetic
162+
```
163+
164+
If your trace uses different column names, configure them with `timestamp_column`, `prompt_tokens_column`, and `output_tokens_column`:
165+
166+
```bash
167+
guidellm benchmark \
168+
--target http://localhost:8000 \
169+
--profile replay \
170+
--rate 1.0 \
171+
--data replay.jsonl \
172+
--data-args type_=trace_synthetic,timestamp_column=timestamp,prompt_tokens_column=input_length,output_tokens_column=output_length
173+
```
174+
175+
For replay, `--rate` is a time scale for the intervals between trace events rather than requests per second. Use `--data-samples` to limit how many trace rows are loaded and replayed. Use `--max-requests` only as a runtime completion constraint; it does not limit the trace rows loaded from the file.
176+
177+
Very small `input_length` values (roughly under 15 tokens, depending on the tokenizer) may not leave enough room for the full per-row unique prefix in the synthetic prompt. This can make prompts more similar across rows and weaken cache resistance. See [Trace Replay Benchmarking](../getting-started/benchmark.md#trace-replay-benchmarking) for details.
178+
134179
- **JSON files (`.json`)**: Where the entire dataset is represented as a JSON array of objects nested under a specific key. To surface the correct key to use, a `--data-column-mapper` argument must be passed in of `"field": "NAME"` for where the array exists. The objects should include `prompt` or other common names for the prompt which will be used as the prompt column. Additional fields can be included based on the previously mentioned aliases for the `--data-column-mapper` argument.
180+
135181
```json
136182
{
137183
"version": "1.0",
@@ -141,8 +187,11 @@ GuideLLM supports various file formats for datasets, including text, CSV, JSON,
141187
]
142188
}
143189
```
190+
144191
- **Parquet files (`.parquet`)** Example: A binary columnar storage format for efficient data processing. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
192+
145193
- **Arrow files (`.arrow`)** Example: A cross-language development platform for in-memory data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
194+
146195
- **HDF5 files (`.hdf5`)** Example: A hierarchical data format for storing large amounts of data. For more information on the supported formats, see the Hugging Face dataset documentation linked in the [Notes](#notes) section.
147196

148197
#### Example Commands

src/guidellm/benchmark/entrypoints.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -350,6 +350,8 @@ async def resolve_profile(
350350
max_global_error_rate: float | None,
351351
over_saturation: dict[str, Any] | None = None,
352352
console: Console | None = None,
353+
data: list[Any] | None = None,
354+
**profile_kwargs: Any,
353355
) -> Profile:
354356
"""
355357
Resolve and configure a benchmark profile with rate and constraint settings.
@@ -371,6 +373,8 @@ async def resolve_profile(
371373
:param max_global_error_rate: Maximum global error rate threshold before stopping
372374
:param over_saturation: Over-saturation detection configuration (dict)
373375
:param console: Console instance for progress reporting, or None
376+
:param data: Optional list of data sources.
377+
:param profile_kwargs: Additional profile-specific arguments.
374378
:return: Configured Profile instance ready for benchmarking
375379
:raises ValueError: If constraints are provided with a pre-configured Profile
376380
"""
@@ -398,6 +402,8 @@ async def resolve_profile(
398402
random_seed=random_seed,
399403
rampup_duration=rampup,
400404
constraints={**constraints},
405+
data=data,
406+
**profile_kwargs,
401407
)
402408
elif constraints:
403409
raise ValueError(
@@ -529,6 +535,9 @@ async def benchmark_generative_text(
529535
max_global_error_rate=args.max_global_error_rate,
530536
over_saturation=args.over_saturation,
531537
console=console,
538+
data=args.data,
539+
data_args=args.data_args,
540+
data_samples=request_loader.info.get("data_samples", -1),
532541
)
533542
output_formats = await resolve_output_formats(
534543
outputs=args.outputs, output_dir=args.output_dir, console=console

src/guidellm/benchmark/profiles.py

Lines changed: 109 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313

1414
from abc import ABC, abstractmethod
1515
from collections.abc import Generator
16+
from pathlib import Path
1617
from typing import TYPE_CHECKING, Annotated, Any, ClassVar, Literal
1718

1819
import numpy as np
@@ -37,8 +38,10 @@
3738
SchedulingStrategy,
3839
SynchronousStrategy,
3940
ThroughputStrategy,
41+
TraceReplayStrategy,
4042
)
4143
from guidellm.schemas import PydanticClassRegistryMixin
44+
from guidellm.utils.trace_io import load_relative_timestamps
4245

4346
if TYPE_CHECKING:
4447
from guidellm.benchmark.schemas import Benchmark
@@ -48,13 +51,14 @@
4851
"ConcurrentProfile",
4952
"Profile",
5053
"ProfileType",
54+
"ReplayProfile",
5155
"SweepProfile",
5256
"SynchronousProfile",
5357
"ThroughputProfile",
5458
]
5559

5660
ProfileType = Annotated[
57-
Literal["synchronous", "concurrent", "throughput", "async", "sweep"],
61+
Literal["synchronous", "concurrent", "throughput", "async", "sweep", "replay"],
5862
"Profile type identifiers for polymorphic deserialization",
5963
]
6064

@@ -328,6 +332,110 @@ def next_strategy(
328332
return SynchronousStrategy()
329333

330334

335+
@Profile.register("replay")
336+
class ReplayProfile(Profile):
337+
"""
338+
Replay a trace file:
339+
schedule each request at start_time + time_scale * relative_timestamp[i].
340+
341+
For this profile, the ``rate`` argument is interpreted as time_scale (scale factor
342+
applied to relative timestamps), not as requests per second.
343+
344+
When ``data_samples`` is set, the replayed timestamps are truncated to match
345+
the sampled dataset size.
346+
"""
347+
348+
type_: Literal["replay"] = "replay" # type: ignore[assignment]
349+
relative_timestamps: list[float] = Field(
350+
description="Request start times relative to first event (first = 0)",
351+
)
352+
time_scale: float = Field(
353+
default=1.0,
354+
gt=0,
355+
description="Scale factor applied to relative timestamps",
356+
)
357+
358+
@classmethod
359+
def resolve_args(
360+
cls,
361+
rate_type: str,
362+
rate: list[float] | None,
363+
random_seed: int,
364+
**kwargs: Any,
365+
) -> dict[str, Any]:
366+
_ = (rate_type, random_seed) # unused
367+
data = kwargs.get("data")
368+
if not data:
369+
raise ValueError("Replay profile requires data (path to trace file)")
370+
if len(data) != 1:
371+
raise ValueError(
372+
f"ReplayProfile requires exactly one data source, received {len(data)}"
373+
)
374+
if not data[0]:
375+
raise ValueError("Replay profile requires data (path to trace file)")
376+
path = Path(data[0]) if isinstance(data[0], str) else data[0]
377+
if not path.exists():
378+
raise ValueError(f"Replay trace file not found: {path}")
379+
380+
# For replay profile, rate is interpreted as time_scale (not requests per
381+
# second)
382+
time_scale = rate[0] if rate and len(rate) > 0 else 1.0
383+
384+
# Honor a custom timestamp column when configured via --data-args so the
385+
# replay profile and trace_synthetic deserializer use the same field.
386+
data_args = kwargs.get("data_args") or []
387+
first_args = data_args[0] if data_args else {}
388+
timestamp_column = "timestamp"
389+
if isinstance(first_args, dict):
390+
raw_timestamp_column = first_args.get("timestamp_column")
391+
if isinstance(raw_timestamp_column, str) and raw_timestamp_column.strip():
392+
timestamp_column = raw_timestamp_column
393+
394+
relative_timestamps = load_relative_timestamps(
395+
path, timestamp_column=timestamp_column
396+
)
397+
data_samples = kwargs.get("data_samples", -1)
398+
if isinstance(data_samples, int) and data_samples > 0:
399+
relative_timestamps = relative_timestamps[:data_samples]
400+
401+
if not relative_timestamps:
402+
raise ValueError(
403+
"No timestamps remain after applying data_samples. "
404+
"The trace is empty or all events were filtered out."
405+
)
406+
407+
constraints = dict(kwargs.get("constraints") or {})
408+
if not any(
409+
key in constraints
410+
for key in ("max_number", "max_num", "max_requests", "max_req")
411+
):
412+
constraints["max_requests"] = len(relative_timestamps)
413+
414+
return {
415+
"relative_timestamps": relative_timestamps,
416+
"time_scale": time_scale,
417+
"constraints": constraints,
418+
}
419+
420+
@property
421+
def strategy_types(self) -> list[str]:
422+
return ["trace"]
423+
424+
def next_strategy(
425+
self,
426+
prev_strategy: SchedulingStrategy | None,
427+
prev_benchmark: Benchmark | None,
428+
) -> TraceReplayStrategy | None:
429+
_ = prev_benchmark
430+
# Replay has a single strategy; return it once, then None
431+
if prev_strategy is not None:
432+
return None
433+
return TraceReplayStrategy(
434+
relative_timestamps=self.relative_timestamps,
435+
time_scale=self.time_scale,
436+
)
437+
438+
331439
@Profile.register("concurrent")
332440
class ConcurrentProfile(Profile):
333441
"""

src/guidellm/data/deserializers/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
SyntheticTextDataset,
2626
SyntheticTextDatasetDeserializer,
2727
)
28+
from .trace_synthetic import TraceSyntheticDatasetDeserializer
2829

2930
__all__ = [
3031
"ArrowFileDatasetDeserializer",
@@ -46,4 +47,5 @@
4647
"SyntheticTextDatasetDeserializer",
4748
"TarFileDatasetDeserializer",
4849
"TextFileDatasetDeserializer",
50+
"TraceSyntheticDatasetDeserializer",
4951
]

0 commit comments

Comments
 (0)