[FEAT] Add replay from trace strategy#620
Conversation
008633f to
a66034b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
7f893fb to
780be20
Compare
|
It will be great to get an example of "How to get the JSONL" because i don't find solutions in litellm for example. |
|
Yeah that’s true, most frameworks won’t produce this exact JSONL directly. That’s kind of intentional. The idea here is to define a minimal, framework-agnostic canonical replay format, not something tied to a specific tracing stack. In practice, the required fields already exist almost everywhere (timestamp, input token count, output token count), just under slightly different names, so a small mapping step is usually enough. I agree it’s not the best UX on its own, but it felt like the right minimal base for the feature. Then we can iterate on top of it with helpers / converters for common sources like LiteLLM or Langfuse. And we can extend it later (e.g. optional prompt field, multiple timestamp formats, richer metadata) without breaking the core idea. But happy to adjust the direction if maintainers prefer something more opinionated or integrated from the start. |
sjmonson
left a comment
There was a problem hiding this comment.
Sorry for the silence on this. There are a few things with this PR that break other use-cases. I am still working on a more complete review but here are a few low hanging problems.
|
Thanks a lot for the detailed review, I really appreciate your time. I’m fully aligned with your feedback, especially on the replay handling in the entrypoint, which is a key part of the PR. I agree that introducing a special case here is not ideal and should be avoided. I’ll refactor this to make it cleaner and better aligned with the existing design. |
780be20 to
7d76d5f
Compare
|
Hi @sjmonson, Thanks again for the feedback. I’ve completed the refactor and addressed the main review points. Key updates:
Would appreciate another look when you have time. Optional: I also put together a small Colab notebook to try the feature quickly if useful: |
dbutenhof
left a comment
There was a problem hiding this comment.
A couple of comments mostly around documentation consistency...
e17eb3d to
b6c56f3
Compare
|
I tried running a test with this dataset and got a deadlock at the start of the benchmark. Here is a jsonl version for testing: data.jsonl.gz. |
Thanks for testing this and for sharing the dataset. I can reproduce the issue on my side as well, including with a smaller subset of the trace. I’ll investigate it and work on a fix as soon as possible. At first glance, this seems related to how replay handles large/bursty traces and high-token-count requests. I’ll follow up once I have a clearer diagnosis and a fix. |
|
Hi @sjmonson First, synthetic prompt generation now builds one reusable base prompt and creates each request prompt by adding a unique prefix before slicing to the requested input length. This keeps prompts cache-resistant while avoiding the previous expensive per-request generation path. Second, trace replay is temporarily limited to one process. With multiple processes, there is currently a race condition where some scheduled requests can be consumed out of order or never sent, which leaves the benchmark waiting forever. Capping replay to one process is a workaround, but it makes the benchmark complete reliably; the only expected limitation is for extreme traces where one scheduling process may become a bottleneck. I tested this with the shared JSONL trace: the benchmark no longer hangs and starts correctly after roughly 50 seconds. On a representative subset, the previous prompt generation path was at least 10x slower... |
Yeah... I don't love this idea, but fine for now. I see the dataset generation as temporary/fallback anyways since what we want longer term is to base the tokens off of the mooncake token ids.
Also works for now. Fixing this requires a way for the dataset to inform on request scheduling so we'll scope something out. |
|
For multiprocessing, I think I may have found a fairly minimal approach that avoids the replay deadlock while keeping the implementation relatively clean, but I agree it’s probably better scoped for a follow-up PR. For the Mooncake token-id direction, unless I miss something, I think the current prefix invalidation approach can remain compatible with a more structure-aware strategy later on. Roughly, unrelated prompts would still receive different invalidating prefixes, while prompts sharing a common prefix could intentionally reuse the same initial invalidating block and only diverge later with unique suffix markers. Example: base prompt: prompt 1: prompt 2 (totally unrelated to prompt 1): prompt 3 (shares the same prefix structure as prompt 2 up to block D): This keeps prompts cache-resistant globally while still allowing controlled shared-prefix behavior between related requests. The same idea could likely be extended recursively for deeper shared-prefix structures. |
|
augment review |
🤖 Augment PR SummarySummary: Adds a new trace-replay benchmarking mode to reproduce real-world request arrival patterns from JSONL traces. Changes:
Technical notes: Trace rows are timestamp-sorted before scheduling and prompt generation; 🤖 Was this summary useful? React with 👍 or 👎 |
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
- Relocate trace_io module from data/ to utils/ - Update imports in scheduler/strategies.py - Update imports in benchmark/profiles.py - Update imports in data/deserializers/trace_synthetic.py - Update imports in tests/unit/scheduler/test_trace_replay.py Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…les as sole trace row cap Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…and docs Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…and docs Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
…generation Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
8a9c700 to
3084876
Compare
There was a problem hiding this comment.
Sorry to go back on my approval, but I am still seeing issues with the data.jsonl above. After 15 requests it hangs. Can someone else validate they can run that dataset for at least 100 requests or so.
guidellm benchmark \
--data ./data.jsonl \
--data-args type_=trace_synthetic \
--profile replay \
--rate 1000.0 \
--request-format /v1/completions
Hello @sjmonson To accelerate the replay and reduce the intervals, it should be something like: --rate 0.001The current behavior is a bit counter-intuitive at first glance though. I can also improve the CLI/docs wording to make it clearer that the rate acts as a multiplier on the original intervals. We could also consider inverting the behavior in the future, since people tend to interpret it as a speed-up factor. |
Oh my bad, that's what I get for not rereading the documentation after coming back to this.
Yeah its a little counterintuitive when the option is just |
sjmonson
left a comment
There was a problem hiding this comment.
LGTM, Thanks for being such an attentive contributor!
Summary
replaybenchmarking strategy that reproduces real-world request patterns from trace log files (.jsonl)max_requestsandmax_secondscli options to limit the number of requests processed from a traceMotivation
This change addresses issue #597 by enabling users to benchmark their vLLM servers using real production traces. Instead of synthetic load patterns, users can now replay exact request arrival times and token distributions from their actual workloads for more realistic performance testing.
Changes
TraceReplayStrategyscheduler strategy for timestamp-based request dispatchingReplayProfileclass for configuring trace-based benchmarking parametersTraceSyntheticDatasetDeserializerto generate prompts matching trace input/output lengthsTraceReaderutility for reading .jsonl trace files with timestamp, input_length, output_length fieldsEntrypointto handle replay profile and dataset configurationmax_requestsandmax_secondstruncation support to limit trace replay lengthTesting
pytest tests/unit/scheduler/test_trace_replay.py(pass)pytest tests/unit/benchmark/test_replay_profile.py(pass)pytest tests/unit/data/deserializers/test_trace_synthetic.py(pass)Added tests: scheduling accuracy, boundary conditions, malformed trace handling, empty trace cases, max_requests truncation
test it in practice quickly with NB COLAB
Next Steps (this PR)
Out of Scope (future PRs or not)
Use of AI