Skip to content

[FEAT] Add replay from trace strategy#620

Merged
sjmonson merged 28 commits into
vllm-project:mainfrom
VincentG1234:add-strategy-replay-from-trace
May 18, 2026
Merged

[FEAT] Add replay from trace strategy#620
sjmonson merged 28 commits into
vllm-project:mainfrom
VincentG1234:add-strategy-replay-from-trace

Conversation

@VincentG1234
Copy link
Copy Markdown
Contributor

@VincentG1234 VincentG1234 commented Mar 4, 2026

Summary

  • Add a new replay benchmarking strategy that reproduces real-world request patterns from trace log files (.jsonl)
  • Enable time-based request rate replay with precise timestamp scheduling
  • Support synthetic prompt generation that matches token counts from trace files
  • use max_requests and max_seconds cli options to limit the number of requests processed from a trace

Motivation

This change addresses issue #597 by enabling users to benchmark their vLLM servers using real production traces. Instead of synthetic load patterns, users can now replay exact request arrival times and token distributions from their actual workloads for more realistic performance testing.

Changes

  • Add TraceReplayStrategy scheduler strategy for timestamp-based request dispatching
  • Add ReplayProfile class for configuring trace-based benchmarking parameters
  • Add TraceSyntheticDatasetDeserializer to generate prompts matching trace input/output lengths
  • Add TraceReader utility for reading .jsonl trace files with timestamp, input_length, output_length fields
  • Update Entrypoint to handle replay profile and dataset configuration
  • use max_requests and max_seconds truncation support to limit trace replay length

Testing

  • pytest tests/unit/scheduler/test_trace_replay.py (pass)

  • pytest tests/unit/benchmark/test_replay_profile.py (pass)

  • pytest tests/unit/data/deserializers/test_trace_synthetic.py (pass)

  • Added tests: scheduling accuracy, boundary conditions, malformed trace handling, empty trace cases, max_requests truncation

  • test it in practice quickly with NB COLAB

Next Steps (this PR)

  1. Apply reviewer feedback
  2. Add E2E tests verifying end-to-end trace replay flow ✅
  3. Add integrations tests (if needed)
  4. Add CLI usage examples in PR description and docs ✅

Out of Scope (future PRs or not)

  • Mooncake trace format support (token-level traces)
  • Helper utilities for timestamp format conversions (Unix epoch, ISO8601, relative timestamps)
  • Support for request payload traces (not just token counts)
  • Trace file validation and schema verification tools
  • Performance optimizations for large trace files (streaming, chunked processing)
  • Metrics export formatted for trace analysis comparison
  • Support for trace file compression formats (.gz, .bz2)

Use of AI

  • Includes code generated by an AI application

@VincentG1234 VincentG1234 force-pushed the add-strategy-replay-from-trace branch from 008633f to a66034b Compare March 4, 2026 13:32
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @VincentG1234.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 18, 2026
@VincentG1234 VincentG1234 force-pushed the add-strategy-replay-from-trace branch from 7f893fb to 780be20 Compare March 18, 2026 13:12
@mergify mergify Bot removed the needs-rebase label Mar 18, 2026
@VincentG1234 VincentG1234 marked this pull request as ready for review March 18, 2026 13:13
@sjmonson sjmonson self-requested a review March 18, 2026 19:18
@dbutenhof dbutenhof added this to the v0.7.0 milestone Mar 20, 2026
@hgsmn
Copy link
Copy Markdown

hgsmn commented Mar 30, 2026

It will be great to get an example of "How to get the JSONL" because i don't find solutions in litellm for example.

@VincentG1234
Copy link
Copy Markdown
Contributor Author

Yeah that’s true, most frameworks won’t produce this exact JSONL directly.

That’s kind of intentional. The idea here is to define a minimal, framework-agnostic canonical replay format, not something tied to a specific tracing stack.

In practice, the required fields already exist almost everywhere (timestamp, input token count, output token count), just under slightly different names, so a small mapping step is usually enough.

I agree it’s not the best UX on its own, but it felt like the right minimal base for the feature. Then we can iterate on top of it with helpers / converters for common sources like LiteLLM or Langfuse. And we can extend it later (e.g. optional prompt field, multiple timestamp formats, richer metadata) without breaking the core idea.

But happy to adjust the direction if maintainers prefer something more opinionated or integrated from the start.

Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the silence on this. There are a few things with this PR that break other use-cases. I am still working on a more complete review but here are a few low hanging problems.

Comment thread src/guidellm/utils/trace_io.py
Comment thread src/guidellm/scheduler/__init__.py Outdated
Comment thread src/guidellm/scheduler/__init__.py Outdated
Comment thread src/guidellm/scheduler/strategies.py Outdated
Comment thread src/guidellm/scheduler/strategies.py Outdated
Comment thread src/guidellm/benchmark/entrypoints.py Outdated
Comment thread src/guidellm/benchmark/entrypoints.py Outdated
Comment thread src/guidellm/benchmark/entrypoints.py Outdated
Comment thread src/guidellm/utils/trace_io.py
Comment thread src/guidellm/data/trace_io.py Outdated
@VincentG1234
Copy link
Copy Markdown
Contributor Author

Thanks a lot for the detailed review, I really appreciate your time.

I’m fully aligned with your feedback, especially on the replay handling in the entrypoint, which is a key part of the PR. I agree that introducing a special case here is not ideal and should be avoided.

I’ll refactor this to make it cleaner and better aligned with the existing design.

@VincentG1234 VincentG1234 marked this pull request as draft April 20, 2026 07:49
@VincentG1234 VincentG1234 force-pushed the add-strategy-replay-from-trace branch from 780be20 to 7d76d5f Compare April 22, 2026 15:10
@VincentG1234 VincentG1234 marked this pull request as ready for review April 27, 2026 08:56
@VincentG1234
Copy link
Copy Markdown
Contributor Author

Hi @sjmonson,

Thanks again for the feedback.

I’ve completed the refactor and addressed the main review points.

Key updates:

  • replay now uses the standard request loader flow (removed special-case path)
  • data_samples handles dataset truncation, max_requests stays runtime-only
  • shared trace loading moved to guidellm.utils.trace_io
  • simplified replay constraints
  • tests/docs cleaned up accordingly

Would appreciate another look when you have time.

Optional: I also put together a small Colab notebook to try the feature quickly if useful:
https://colab.research.google.com/drive/1hOY9Kg5BVYz4BZzJLrcUKWiMDR2_lU7V?usp=sharing

@sjmonson sjmonson self-requested a review April 28, 2026 14:09
Copy link
Copy Markdown
Collaborator

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments mostly around documentation consistency...

Comment thread docs/getting-started/benchmark.md Outdated
Comment thread docs/guides/datasets.md Outdated
Comment thread docs/guides/datasets.md Outdated
Comment thread docs/guides/datasets.md Outdated
@VincentG1234 VincentG1234 force-pushed the add-strategy-replay-from-trace branch 2 times, most recently from e17eb3d to b6c56f3 Compare April 30, 2026 16:30
Comment thread src/guidellm/utils/trace_io.py Outdated
@sjmonson
Copy link
Copy Markdown
Collaborator

sjmonson commented May 4, 2026

I tried running a test with this dataset and got a deadlock at the start of the benchmark. Here is a jsonl version for testing: data.jsonl.gz.

@VincentG1234
Copy link
Copy Markdown
Contributor Author

VincentG1234 commented May 4, 2026

I tried running a test with this dataset and got a deadlock at the start of the benchmark. Here is a jsonl version for testing: data.jsonl.gz.

Thanks for testing this and for sharing the dataset.

I can reproduce the issue on my side as well, including with a smaller subset of the trace. I’ll investigate it and work on a fix as soon as possible.

At first glance, this seems related to how replay handles large/bursty traces and high-token-count requests.

I’ll follow up once I have a clearer diagnosis and a fix.

@VincentG1234
Copy link
Copy Markdown
Contributor Author

Hi @sjmonson
This commit fixes two issues found while replaying the shared JSONL trace.

First, synthetic prompt generation now builds one reusable base prompt and creates each request prompt by adding a unique prefix before slicing to the requested input length. This keeps prompts cache-resistant while avoiding the previous expensive per-request generation path.

Second, trace replay is temporarily limited to one process. With multiple processes, there is currently a race condition where some scheduled requests can be consumed out of order or never sent, which leaves the benchmark waiting forever. Capping replay to one process is a workaround, but it makes the benchmark complete reliably; the only expected limitation is for extreme traces where one scheduling process may become a bottleneck.

I tested this with the shared JSONL trace: the benchmark no longer hangs and starts correctly after roughly 50 seconds. On a representative subset, the previous prompt generation path was at least 10x slower...

@sjmonson
Copy link
Copy Markdown
Collaborator

sjmonson commented May 8, 2026

First, synthetic prompt generation now builds one reusable base prompt and creates each request prompt by adding a unique prefix before slicing to the requested input length. This keeps prompts cache-resistant while avoiding the previous expensive per-request generation path.

Yeah... I don't love this idea, but fine for now. I see the dataset generation as temporary/fallback anyways since what we want longer term is to base the tokens off of the mooncake token ids.

Second, trace replay is temporarily limited to one process. With multiple processes, there is currently a race condition where some scheduled requests can be consumed out of order or never sent, which leaves the benchmark waiting forever. Capping replay to one process is a workaround, but it makes the benchmark complete reliably; the only expected limitation is for extreme traces where one scheduling process may become a bottleneck.

Also works for now. Fixing this requires a way for the dataset to inform on request scheduling so we'll scope something out.

@VincentG1234
Copy link
Copy Markdown
Contributor Author

VincentG1234 commented May 9, 2026

For multiprocessing, I think I may have found a fairly minimal approach that avoids the replay deadlock while keeping the implementation relatively clean, but I agree it’s probably better scoped for a follow-up PR.

For the Mooncake token-id direction, unless I miss something, I think the current prefix invalidation approach can remain compatible with a more structure-aware strategy later on.

Roughly, unrelated prompts would still receive different invalidating prefixes, while prompts sharing a common prefix could intentionally reuse the same initial invalidating block and only diverge later with unique suffix markers.

Example:

base prompt:
[A, B, C, D, E]

prompt 1:
[x, A, B, C]

prompt 2 (totally unrelated to prompt 1):
[y, A, B, C, D]

prompt 3 (shares the same prefix structure as prompt 2 up to block D):
[y, A, B, C, z, D, E]

This keeps prompts cache-resistant globally while still allowing controlled shared-prefix behavior between related requests. The same idea could likely be extended recursively for deeper shared-prefix structures.

@sjmonson
Copy link
Copy Markdown
Collaborator

augment review

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 11, 2026

🤖 Augment PR Summary

Summary: Adds a new trace-replay benchmarking mode to reproduce real-world request arrival patterns from JSONL traces.

Changes:

  • Introduces a ReplayProfile that loads trace timestamps, converts them to relative offsets, and interprets --rate as a time-scale factor.
  • Adds TraceReplayStrategy to schedule each request at start_time + time_scale * relative_timestamp[i] (currently constrained to a single process).
  • Adds TraceSyntheticDatasetDeserializer (type_=trace_synthetic) to generate synthetic prompts matching input_length/output_length per trace row.
  • Creates shared guidellm.utils.trace_io helpers to load/sort trace rows and compute relative timestamps.
  • Extends the benchmark entrypoint to pass data/data_args/data_samples into profile resolution for replay configuration.
  • Updates docs to describe --data-samples, trace JSONL format, and replay semantics for --rate.
  • Adds unit tests covering replay profile resolution, trace IO validation, deserializer behavior, and strategy scheduling/parking behavior.

Technical notes: Trace rows are timestamp-sorted before scheduling and prompt generation; --data-samples truncates how many trace rows are loaded, while --max-requests remains a runtime completion constraint.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/guidellm/benchmark/profiles.py
Comment thread src/guidellm/benchmark/profiles.py
Comment thread src/guidellm/scheduler/strategies.py
Comment thread src/guidellm/data/deserializers/trace_synthetic.py
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
- Relocate trace_io module from data/ to utils/
- Update imports in scheduler/strategies.py
- Update imports in benchmark/profiles.py
- Update imports in data/deserializers/trace_synthetic.py
- Update imports in tests/unit/scheduler/test_trace_replay.py

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…les as sole trace row cap

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…and docs

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…and docs

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…generation

Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
@jaredoconnell jaredoconnell force-pushed the add-strategy-replay-from-trace branch from 8a9c700 to 3084876 Compare May 18, 2026 15:23
Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to go back on my approval, but I am still seeing issues with the data.jsonl above. After 15 requests it hangs. Can someone else validate they can run that dataset for at least 100 requests or so.

guidellm benchmark \
    --data ./data.jsonl \
    --data-args type_=trace_synthetic \
    --profile replay \
    --rate 1000.0 \
    --request-format /v1/completions

@VincentG1234
Copy link
Copy Markdown
Contributor Author

Sorry to go back on my approval, but I am still seeing issues with the data.jsonl above. After 15 requests it hangs. Can someone else validate they can run that dataset for at least 100 requests or so.

guidellm benchmark \
    --data ./data.jsonl \
    --data-args type_=trace_synthetic \
    --profile replay \
    --rate 1000.0 \
    --request-format /v1/completions

Hello @sjmonson
The issue is likely the --rate 1000.0.
In replay mode, the rate scales the original intervals, so 1000.0 makes them 1000x longer.

To accelerate the replay and reduce the intervals, it should be something like:

--rate 0.001

The current behavior is a bit counter-intuitive at first glance though. I can also improve the CLI/docs wording to make it clearer that the rate acts as a multiplier on the original intervals. We could also consider inverting the behavior in the future, since people tend to interpret it as a speed-up factor.

@sjmonson
Copy link
Copy Markdown
Collaborator

To accelerate the replay and reduce the intervals, it should be something like:

--rate 0.001

Oh my bad, that's what I get for not rereading the documentation after coming back to this.

The current behavior is a bit counter-intuitive at first glance though. I can also improve the CLI/docs wording to make it clearer that the rate acts as a multiplier on the original intervals. We could also consider inverting the behavior in the future, since people tend to interpret it as a speed-up factor.

Yeah its a little counterintuitive when the option is just --rate, we have some upcoming work with #724 which will let each profile use their own argument names so that should help.

Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks for being such an attentive contributor!

@sjmonson sjmonson merged commit 7a43ece into vllm-project:main May 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants