Add guidellm-based performance benchmarking utilities by anmarques · Pull Request #434 · vllm-project/speculators

anmarques · 2026-04-17T22:23:43Z

Adds a complete perf-benchmark pipeline under examples/evaluate/ that:

Estimates output token distribution via guidellm throughput mode and derives per-subset max_tokens (first power-of-2 >= median)
Runs guidellm sweep benchmarks with the derived max_tokens cap
Parses sweep results into CSV (rps, latency, ITL, TTFT, output TPS)
Provides multi-version comparison plots (plot_perf_compare.py) and pairwise speedup visualizations with bwr colormap (plot_perf_speedup.py)
Supports configurable sampling params (--gen-kwargs) and data column mapping (--data-column-mapper) for different datasets

I have filled in:

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan/results, such as providing test command and pasting the results.
(Optional) The necessary documentation update.
I (a human) have written or reviewed the code in this pr to the best of my ability.

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive performance benchmarking suite with automated pipeline orchestration for profiling LLM throughput and latency metrics
- Added visualization tools for performance comparisons across versions and speedup analysis
- Added support for collecting and analyzing speculative decoding metrics from vLLM servers
Documentation
- Added detailed README documenting end-to-end benchmarking workflow, CLI options, CSV output schema, and usage examples

coderabbitai · 2026-04-17T22:23:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5d774cb-910a-4ee1-bd87-8ea1b463348e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds a complete end-to-end performance benchmarking pipeline for vLLM using GuideLLM: orchestration scripts, generation-length estimation, max-tokens computation, sweep runs, parsing/aggregation (including optional vLLM metrics), and plotting utilities producing CSVs and PNG visualizations.

Changes

Cohort / File(s)	Summary
Documentation & Dependencies `examples/evaluate/perf-benchmark/README.md`, `examples/evaluate/perf-benchmark/requirements.txt`	New README describing the benchmarking workflow, CLI options, outputs and plotting tools; new requirements file listing vllm, guidellm, matplotlib, numpy, scipy.
Top-level Orchestrator `examples/evaluate/perf-benchmark/run_perf_benchmark.sh`	New four-stage Bash orchestrator to run gen-length profiling, compute max_tokens, execute sweeps per subset, and aggregate results into a timestamped directory and CSV.
Generation-length Estimation `examples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.sh`, `examples/evaluate/perf-benchmark/scripts/parse_gen_len.py`	Throughput-mode wrapper for GuideLLM and parser that extracts output token distributions, computes medians and next power-of-two `max_tokens`, and emits per-subset JSON and a consolidated `max_tokens.json`.
Sweep Runner `examples/evaluate/perf-benchmark/scripts/run_sweep.sh`	Bash wrapper to run GuideLLM sweep profiles per subset with injected `max_tokens` via backend args; handles CLI validation and output routing.
Sweep Parsing & Aggregation `examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py`, `examples/evaluate/perf-benchmark/scripts/parse_sweep_with_metrics.py`	Parsers that load sweep JSONs, filter out throughput strategies, extract median metrics into CSV; extended parser can fetch/parse vLLM Prometheus metrics, compute speculative-decoding aggregates, optionally subtract a baseline, and append those columns to the CSV.
vLLM Metrics Utilities `examples/evaluate/perf-benchmark/scripts/fetch_and_save_vllm_metrics.sh`, `examples/evaluate/perf-benchmark/scripts/fetch_vllm_metrics.py`	Shell and Python utilities to fetch `/metrics` from vLLM, parse Prometheus text for counters and per-position vectors, and emit aggregated speculative-decoding metrics as JSON or text.
Shared Perf Utilities `examples/evaluate/perf-benchmark/scripts/perf_utils.py`	Shared Python utilities: metric metadata, CSV/JSON loaders, label/path parsing, isotonic smoothing (PAV), PCHIP interpolation, and pretty subset naming for plotting.
Plotting Tools `examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py`, `examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py`	Matplotlib scripts to produce per-subset comparison plots across versions and speedup visualizations (interpolated curves, gradient fills, colorbars) using smoothed curves from perf_utils.

Sequence Diagram(s)

sequenceDiagram
    participant Orch as run_perf_benchmark.sh
    participant GenLen as run_gen_len_estimation.sh
    participant ParseGL as parse_gen_len.py
    participant Sweep as run_sweep.sh
    participant ParseSweep as parse_sweep_with_metrics.py
    participant GuideLLM as GuideLLM
    participant vLLM as vLLM(/metrics) optional

    Orch->>GenLen: Step 1: run throughput profile per subset
    GenLen->>GuideLLM: benchmark --profile throughput
    GuideLLM-->>GenLen: gen_len_*.json
    GenLen-->>Orch: per-subset gen_len JSONs

    Orch->>ParseGL: Step 2: compute max_tokens
    ParseGL->>ParseGL: extract medians -> next_power_of_two
    ParseGL-->>Orch: max_tokens.json

    Orch->>Sweep: Step 3: run sweeps with max_tokens
    Sweep->>GuideLLM: benchmark --profile sweep (per-subset)
    GuideLLM-->>Sweep: sweep_*.json
    Sweep-->>Orch: per-subset sweep JSONs

    Orch->>ParseSweep: Step 4: parse sweeps (+ optional vLLM metrics)
    ParseSweep->>vLLM: (optional) GET /metrics
    vLLM-->>ParseSweep: Prometheus text
    ParseSweep->>ParseSweep: extract medians, spec-decode aggregates
    ParseSweep-->>Orch: perf_results.csv (+ augmented CSVs)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels

two-reviews

Suggested reviewers

fynnsu
dsikka
shanjiaz

"I hopped through logs and JSON trees,
I nudged the tokens, sampled with my knees.
From medians to sweeps, I sorted and spun —
A carrot-sized CSV and plots for fun.
— 🐰"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.79% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add guidellm-based performance benchmarking utilities' accurately and clearly describes the main change: introducing a complete performance benchmarking pipeline with utilities for estimation, sweeping, result parsing, and visualization using guidellm.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch anmarques/add-perf-benchmark-utilities

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-04-17T22:24:20Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (5)

examples/evaluate/perf-benchmark/run_perf_benchmark.sh (1)
168-169: Trim whitespace when splitting --subsets to prevent accidental invalid subset names.

A value like --subsets "HumanEval, qa" currently produces " qa" as a subset entry.
💡 Suggested fix
-IFS=',' read -ra SUBSET_ARRAY <<< "${SUBSETS}"
+IFS=',' read -ra RAW_SUBSETS <<< "${SUBSETS}"
+SUBSET_ARRAY=()
+for s in "${RAW_SUBSETS[@]}"; do
+    s="${s#"${s%%[![:space:]]*}"}"
+    s="${s%"${s##*[![:space:]]}"}"
+    [[ -n "${s}" ]] && SUBSET_ARRAY+=("${s}")
+done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh` around lines 168 -
169, Split using IFS=',' read -ra SUBSET_ARRAY <<< "${SUBSETS}" currently leaves
leading/trailing spaces in entries (e.g., " qa"); after that split, iterate over
SUBSET_ARRAY and trim whitespace from each element (replace each item with its
trimmed version) so array entries contain no surrounding spaces before further
use; update the code around the IFS=',' read -ra SUBSET_ARRAY line to perform
this trim (use bash parameter expansion or a simple trim helper) on SUBSETS/
SUBSET_ARRAY.
examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py (1)
76-76: Trim subset names when parsing --subsets to avoid accidental filter misses.

Line 76 currently keeps whitespace, so --subsets "HumanEval, qa" won’t match qa.
💡 Suggested fix
-    subset_filter = set(args.subsets.split(",")) if args.subsets else None
+    subset_filter = (
+        {s.strip() for s in args.subsets.split(",") if s.strip()}
+        if args.subsets
+        else None
+    )
As per coding guidelines examples/**/*.py: "Review for clarity, correctness, and educational value. Ensure examples are end-to-end runnable, configurations match current API, and comments explain speculative decoding-specific concepts for users new to the algorithm."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py` at line 76,
The subset parsing keeps whitespace so values like " qa" won't match; update the
construction of subset_filter (from args.subsets) to split on "," then strip
whitespace from each token and ignore empty tokens before making the set (i.e.,
map str.strip and filter out falsy tokens when building subset_filter in the
code that reads args.subsets); ensure this logic is applied where subset_filter
is defined so comparisons against subset names (used elsewhere in the script)
succeed.
examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py (1)
144-148: Move import to module level.

The scipy.interpolate.PchipInterpolator import is placed inside the loop, causing it to be re-executed for each subset. Move it to the top of the file with other imports for idiomatic Python style and marginally better performance.
♻️ Proposed fix

Add to imports at top of file (after line 26):
from scipy.interpolate import PchipInterpolator
Then update lines 144-148:
             # Interpolate both onto common dense grid
-            from scipy.interpolate import PchipInterpolator
-
             b_interp = PchipInterpolator(b_x_smooth, b_y_smooth)
             t_interp = PchipInterpolator(t_x_smooth, t_y_smooth)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py` around lines
144 - 148, The PchipInterpolator import is inside the loop creating b_interp and
t_interp (using b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth); move the import
"from scipy.interpolate import PchipInterpolator" to the top-level imports with
the other module imports and remove the local import from the loop so the
PchipInterpolator symbol is imported once at module load and reused when
creating b_interp and t_interp.
examples/evaluate/perf-benchmark/scripts/perf_utils.py (2)
76-77: Consider specifying explicit encoding.

For cross-platform portability, it's good practice to specify encoding="utf-8" explicitly when opening text files.
♻️ Proposed fix
-    with open(filepath, newline="") as f:
+    with open(filepath, newline="", encoding="utf-8") as f:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 76 - 77,
Open the CSV using an explicit UTF-8 encoding to ensure cross-platform
portability: update the open(...) call that currently uses filepath to include
encoding="utf-8" (the block that creates reader = csv.DictReader(f)) so the file
is opened with UTF-8 decoding when constructing reader.
225-226: Consider using itertools.pairwise() for clarity.

Per static analysis, itertools.pairwise() (Python 3.10+) is the idiomatic way to iterate successive pairs. While the current zip() is functionally correct here, pairwise() better expresses the intent.
♻️ Proposed fix

Add import at top:
from itertools import pairwise
Then update line 225:
-    for lo, hi in zip(edges[:-1], edges[1:]):
+    for lo, hi in pairwise(edges):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 225 -
226, Replace the manual zip over edges with itertools.pairwise for clarity: add
"from itertools import pairwise" at the top and change the loop "for lo, hi in
zip(edges[:-1], edges[1:]):" to "for lo, hi in pairwise(edges):" (keep the
existing mask logic that uses x, lo, hi and the hi < edges[-1] conditional to
handle the inclusive upper bound). This updates the loop in perf_utils.py where
edges, x, and mask are used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/evaluate/perf-benchmark/README.md`:
- Around line 38-58: The README's fenced code blocks are missing language tags
causing MD040 lint errors; update the usage block that begins with
"./run_perf_benchmark.sh --target URL [OPTIONS]" and the output tree block that
begins with "perf_results_YYYYMMDD_HHMMSS/" (and the similar block around the
later section that starts with the perf results path) to use explicit language
identifiers (e.g., replace ``` with ```text) so all three fenced blocks include
a language tag.

In `@examples/evaluate/perf-benchmark/requirements.txt`:
- Around line 2-3: Update the pinned vLLM version in the requirements entry from
vllm==0.12.0 to vllm==0.19.0 (or a later patch, e.g. vllm>=0.19.0) to remediate
multiple security advisories; locate the vllm line in the requirements file (the
line containing "vllm==0.12.0") and change it accordingly, then verify that
guidellm==0.6.0 remains compatible and run your dependency install/tests to
confirm no breakage.

In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Line 230: The current MAX_TOKENS assignment injects the shell variable
${subset} directly into the Python snippet (unsafe); change the invocation used
to compute MAX_TOKENS so Python reads the JSON file and subset name from argv
instead of embedding ${subset} into source. Concretely, update the MAX_TOKENS
assignment (the shell variable set where MAX_TOKENS is computed) to call python
with the MAX_TOKENS_FILE and the subset as separate arguments (quote the shell
variables when passing them) and have Python access sys.argv to index the JSON
by the passed subset name; this avoids interpolation and prevents injection or
quoting errors.

In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py`:
- Around line 54-56: The code computes median = statistics.median(output_tokens)
from output_tokens = [r["output_tokens"] for r in successful] but will raise
TypeError if any r["output_tokens"] is non-numeric; update the logic that builds
and consumes output_tokens (the variables output_tokens, successful, median, and
next_power_of_two) to filter or coerce only valid numeric values (e.g., skip
None/strings, attempt int() where appropriate) and handle errors when computing
the median by catching TypeError/ValueError, logging a warning about the
malformed entries, and continuing processing using the median of the remaining
valid tokens (or a sensible default) so a single bad file doesn't abort the run.

In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py`:
- Around line 80-84: The CSV extraction uses successful.get("median", "") for
every metric (in parse_sweep_results.py loop over METRICS_TO_EXTRACT), which
misses RPS when only "mean" is present; update the assignment for the
requests_per_second metric (identify by metric_key "requests_per_second" or the
RPS csv_key) to try successful.get("median") first and fall back to
successful.get("mean") before defaulting to "" so row[csv_key] =
successful.get("median") or successful.get("mean") or "".

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 119-122: The function _extract_subset_from_json blindly indexes
nested keys and arrays; add defensive checks to validate that data is a dict,
contains "args" -> "data_args" which is a non-empty list, and that the first
element has "data_files" (or handle missing/empty) before using Path(...).stem;
on invalid structure raise a clear ValueError (or return a sensible default)
with a message including the offending JSON snippet so callers can diagnose
malformed input. Ensure you reference and modify _extract_subset_from_json to
perform the type/length checks and to import/use Path consistently.

---

Nitpick comments:
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Around line 168-169: Split using IFS=',' read -ra SUBSET_ARRAY <<<
"${SUBSETS}" currently leaves leading/trailing spaces in entries (e.g., " qa");
after that split, iterate over SUBSET_ARRAY and trim whitespace from each
element (replace each item with its trimmed version) so array entries contain no
surrounding spaces before further use; update the code around the IFS=',' read
-ra SUBSET_ARRAY line to perform this trim (use bash parameter expansion or a
simple trim helper) on SUBSETS/ SUBSET_ARRAY.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 76-77: Open the CSV using an explicit UTF-8 encoding to ensure
cross-platform portability: update the open(...) call that currently uses
filepath to include encoding="utf-8" (the block that creates reader =
csv.DictReader(f)) so the file is opened with UTF-8 decoding when constructing
reader.
- Around line 225-226: Replace the manual zip over edges with itertools.pairwise
for clarity: add "from itertools import pairwise" at the top and change the loop
"for lo, hi in zip(edges[:-1], edges[1:]):" to "for lo, hi in pairwise(edges):"
(keep the existing mask logic that uses x, lo, hi and the hi < edges[-1]
conditional to handle the inclusive upper bound). This updates the loop in
perf_utils.py where edges, x, and mask are used.

In `@examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py`:
- Line 76: The subset parsing keeps whitespace so values like " qa" won't match;
update the construction of subset_filter (from args.subsets) to split on ","
then strip whitespace from each token and ignore empty tokens before making the
set (i.e., map str.strip and filter out falsy tokens when building subset_filter
in the code that reads args.subsets); ensure this logic is applied where
subset_filter is defined so comparisons against subset names (used elsewhere in
the script) succeed.

In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py`:
- Around line 144-148: The PchipInterpolator import is inside the loop creating
b_interp and t_interp (using b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth);
move the import "from scipy.interpolate import PchipInterpolator" to the
top-level imports with the other module imports and remove the local import from
the loop so the PchipInterpolator symbol is imported once at module load and
reused when creating b_interp and t_interp.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55e58599-958e-4bf5-a7cb-cb9bfe169bf7

📥 Commits

Reviewing files that changed from the base of the PR and between e7fc75e and 3ad36a7.

📒 Files selected for processing (10)

examples/evaluate/perf-benchmark/README.md
examples/evaluate/perf-benchmark/requirements.txt
examples/evaluate/perf-benchmark/run_perf_benchmark.sh
examples/evaluate/perf-benchmark/scripts/parse_gen_len.py
examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py
examples/evaluate/perf-benchmark/scripts/perf_utils.py
examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py
examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py
examples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.sh
examples/evaluate/perf-benchmark/scripts/run_sweep.sh

dsikka

I think it would also be useful to consolidate these with the existing scripts so that we have one set of go-to benchmarking utils cc @rahul-tuli

mergify · 2026-04-28T21:53:22Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

mergify · 2026-04-30T16:12:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @anmarques.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Adds a complete perf-benchmark pipeline under examples/evaluate/ that: - Estimates output token distribution via guidellm throughput mode and derives per-subset max_tokens (first power-of-2 >= median) - Runs guidellm sweep benchmarks with the derived max_tokens cap - Parses sweep results into CSV (rps, latency, ITL, TTFT, output TPS) - Provides multi-version comparison plots (plot_perf_compare.py) and pairwise speedup visualizations with bwr colormap (plot_perf_speedup.py) - Supports configurable sampling params (--gen-kwargs) and data column mapping (--data-column-mapper) for different datasets Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>

- Replace open() with Path.open() (PTH123) - Add noqa: T201 for print statements in CLI scripts - Break long lines to stay within 88-char limit (E501) - Fix import ordering for isort compliance (I001) - Use Path() instead of Path(".") (PTH201) - Move scipy import to module level (PLC0415) - Replace magic value with _MIN_POINTS_FOR_INTERP constant (PLR2004) - Add strict=False to zip() call (B905) - Remove commented-out code (ERA001) - Refactor main() into smaller helpers to reduce complexity (C901, PLR0915, PLR0912) Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>

Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (6)

examples/evaluate/perf-benchmark/scripts/parse_gen_len.py (1)

54-56: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard malformed output_tokens before computing the median.

A single non-numeric output_tokens value will still raise TypeError here and abort the whole batch. Filter/coerce valid numeric values first, and make sure the caller treats that failure as a skipped file rather than a hard stop.

🛠️ Suggested fix

-    output_tokens = [r["output_tokens"] for r in successful]
-    median = statistics.median(output_tokens)
+    output_tokens = []
+    for req in successful:
+        value = req.get("output_tokens")
+        if isinstance(value, (int, float)):
+            output_tokens.append(value)
+    if not output_tokens:
+        raise ValueError(f"No numeric output_tokens found in {filepath}")
+    median = statistics.median(output_tokens)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py` around lines 54 -
56, Filter and coerce numeric output_tokens from the successful list before
computing the median: replace the direct list comprehension that builds
output_tokens with logic that iterates over successful, attempts to coerce each
r["output_tokens"] to an int (or float) and only keeps valid numbers, logging or
marking any malformed entries as skipped; then compute median on the cleaned
list and call next_power_of_two(median) only if the cleaned list is non-empty
(otherwise treat the file/batch as skipped/failure rather than raising). Ensure
you reference the variables output_tokens, successful, median, max_tokens and
the next_power_of_two call when making these changes so the guard is applied in
the right spot.

examples/evaluate/perf-benchmark/README.md (1)

29-49: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to the two raw fenced blocks.

The usage and output-tree fences still trip MD040. Tag them as text so the README lint passes.

✏️ Suggested fix

-```
+```text
 ./run_perf_benchmark.sh --target URL [OPTIONS]
 ...
-```
+```

-```
+```text
 perf_results_YYYYMMDD_HHMMSS/
 ...
-```
+```

Also applies to: 115-127

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/README.md` around lines 29 - 49, Update the
two raw fenced code blocks in README.md (the usage block starting with
"./run_perf_benchmark.sh --target URL [OPTIONS]" and the output-tree block
starting with "perf_results_YYYYMMDD_HHMMSS/") to include the language tag
"text" (i.e., change the opening fences from ``` to ```text) so the Markdown
linter MD040 stops flagging them; the same change should be applied to the
duplicate block around lines 115-127.

examples/evaluate/perf-benchmark/requirements.txt (1)

2-3: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Bump the pinned vLLM release.

vllm==0.12.0 is still pinned here, and that version is covered by multiple critical advisories. Please move this example to a patched release and re-verify the benchmark workflow.
🔧 Suggested fix
-vllm==0.12.0
+vllm>=0.19.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/requirements.txt` around lines 2 - 3, Update
the pinned vLLM version in requirements.txt: replace the "vllm==0.12.0" entry
with a patched release (e.g., the latest secure vllm release) and leave
"guidellm==0.6.0" unchanged; after updating, re-run the perf-benchmark workflow
and verify it completes successfully to ensure compatibility with the new vllm
version.

examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py (1)

80-83: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep requests_per_second aligned with GuideLLM's JSON schema.

requests_per_second is the one metric that may only expose successful.mean here; using median unconditionally leaves blank RPS values in the CSV and can cascade into the plotting scripts. Please restore the mean fallback already used in perf_utils.py.

🛠️ Suggested fix

         for metric_key, csv_key in METRICS_TO_EXTRACT:
             metric_data = metrics.get(metric_key, {})
             successful = metric_data.get("successful", {})
-            row[csv_key] = successful.get("median", "")
+            if metric_key == "requests_per_second":
+                row[csv_key] = successful.get("median", successful.get("mean", ""))
+            else:
+                row[csv_key] = successful.get("median", "")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py` around lines
80 - 83, The CSV writer loop in parse_sweep_results.py iterates
METRICS_TO_EXTRACT and unconditionally writes successful.get("median"), which
leaves requests_per_second empty when only successful.mean exists; update the
assignment so that for the metric_key "requests_per_second" (or generally when
median is missing) you fall back to successful.get("mean", "")—i.e., in the loop
that references metric_key/csv_key and metric_data/successful, first try
successful.get("median") and if falsy use successful.get("mean") before writing
row[csv_key], matching the fallback logic used in perf_utils.py.

examples/evaluate/perf-benchmark/run_perf_benchmark.sh (1)

225-225: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid interpolating ${subset} into Python source

Line 225 embeds user-provided subset text directly into Python code; quote-containing subset names break parsing and can lead to code-injection-style behavior.

Suggested fix

-    MAX_TOKENS=$(python -c "import json; print(json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'])")
+    MAX_TOKENS=$(
+        python - "${SUBSET_MAX_TOKENS_FILE}" "${subset}" <<'PY'
+import json
+import sys
+
+path, subset = sys.argv[1], sys.argv[2]
+with open(path) as f:
+    print(json.load(f)[subset])
+PY
+    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh` at line 225, The
shell embeds the untrusted ${subset} into the Python source causing
quoting/injection risks; change the MAX_TOKENS command to pass the filename and
subset as separate argv parameters (or use jq) and have Python read
sys.argv[1]/sys.argv[2] before indexing. For example, replace the current line
that uses json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'] with a call
that passes "$SUBSET_MAX_TOKENS_FILE" and "$subset" to python -c and in the
Python snippet use sys.argv to load the file and index the dict (or use jq -r
--arg subset "$subset" '.[$subset]' "$SUBSET_MAX_TOKENS_FILE"). Ensure you
reference the existing SUBSET_MAX_TOKENS_FILE and subset variables when making
the change.

examples/evaluate/perf-benchmark/scripts/perf_utils.py (1)

127-130: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden subset extraction for malformed JSON structures

Direct nested indexing here raises unhelpful exceptions and can crash consumers when schema differs.

Suggested fix

 def _extract_subset_from_json(data: dict) -> str:
     """Extract subset name from the JSON content's data_args field."""
-    data_files = data["args"]["data_args"][0]["data_files"]
-    return Path(data_files).stem
+    try:
+        data_files = data["args"]["data_args"][0]["data_files"]
+        return Path(data_files).stem
+    except (KeyError, TypeError, IndexError) as e:
+        raise ValueError(
+            "Cannot extract subset from JSON: expected args.data_args[0].data_files"
+        ) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 127 -
130, The _extract_subset_from_json function currently assumes
data["args"]["data_args"][0]["data_files"] exists and will raise unhelpful
exceptions for malformed inputs; update _extract_subset_from_json to defensively
validate types and presence of keys (data is dict, "args" in data and is dict,
"data_args" is a non-empty list, first element contains "data_files" as a
string), handle missing or unexpected shapes by returning a sensible default or
raising a clear ValueError with context, and only then use Path(...).stem to
extract the subset; reference the function name _extract_subset_from_json and
the "args"/"data_args"/"data_files" fields when locating the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Around line 226-227: The script currently builds JSON by concatenating strings
into ALL_MAX_TOKENS using constructs like ALL_MAX_TOKENS+=("\"${subset}\":
${MAX_TOKENS}"), which can produce invalid JSON when subset contains quotes or
backslashes; change the approach to collect structured key/value pairs (e.g.,
accumulate entries in a Bash array or temporary newline-separated file) and then
call a small Python one-liner or script to safely serialize the mapping to JSON
(use the same variable names subset and MAX_TOKENS and replace the unsafe
ALL_MAX_TOKENS concatenation and echo lines at both places like 226 and 295-297
with the safe serialization step).
- Around line 262-270: The shell wrapper run_perf_benchmark.sh is invoking
parse_sweep_with_metrics.py with the wrong CLI flag (--vllm-url) while the
parser expects --url; update the invocation in run_perf_benchmark.sh (the
PARSE_ARGS array and where it's passed to python) to use --url instead of
--vllm-url, or alternatively update the argument parser in
examples/evaluate/perf-benchmark/scripts/parse_sweep_with_metrics.py to accept
--vllm-url (e.g., add an alias for --url) so the flag names are consistent
between run_perf_benchmark.sh and parse_sweep_with_metrics.py.
- Around line 311-317: The script currently proceeds silently when no subset
CSVs exist; modify the PARTIAL_CSV_FILES handling so that if
${`#PARTIAL_CSV_FILES`[@]} is 0 you print an error (e.g., "[ERROR] No subset CSVs
produced, aborting") and exit with a non‑zero status (exit 1), otherwise perform
the existing combine logic that uses PARTIAL_CSV_FILES and CSV_FILE and echo the
info message; ensure this logic lives around the PARTIAL_CSV_FILES array check
so the script fails fast when no partial CSVs were produced.
- Around line 149-152: The script reads VLLM_URL directly which will cause an
"unbound variable" error when run with set -u; change the check and expansion to
use safe parameter expansion (e.g. use ${VLLM_URL:-} or test ${VLLM_URL+x}) so
the if condition becomes if [[ -z "${VLLM_URL:-}" ]] and then assign VLLM_URL
from TARGET as currently done (VLLM_URL="${TARGET%/}" and
VLLM_URL="${VLLM_URL%/v1}"), ensuring all references to VLLM_URL use the safe
expansion to avoid termination under set -u.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 117-119: The JSON loader (_load_json) is using
requests_per_second.successful.mean while the CSV loader (_load_csv) uses
rps_median, causing inconsistent RPS values; update the code that sets rps in
_load_json to read
bench["metrics"]["requests_per_second"]["successful"]["median"] (instead of
"mean") so both loaders use the same RPS statistic (rps_median) when appending
points, ensuring comparable outputs between CSV and JSON sources.

In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py`:
- Around line 158-167: The smoothed arrays returned by smooth_curve
(b_x_smooth/b_y_smooth and t_x_smooth/t_y_smooth) may contain fewer than two
unique points after binning/deduplication, which makes creating
PchipInterpolator fail; before instantiating PchipInterpolator on b_interp and
t_interp, add a guard that checks len(b_x_smooth) >= 2 and len(b_y_smooth) >= 2
and len(t_x_smooth) >= 2 and len(t_y_smooth) >= 2 (or equivalently check unique
point counts) and if any are too sparse, return None or skip processing this
subset so the plotting continues gracefully. Ensure you reference the variables
b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth and the PchipInterpolator
creation sites when making the change.

---

Duplicate comments:
In `@examples/evaluate/perf-benchmark/README.md`:
- Around line 29-49: Update the two raw fenced code blocks in README.md (the
usage block starting with "./run_perf_benchmark.sh --target URL [OPTIONS]" and
the output-tree block starting with "perf_results_YYYYMMDD_HHMMSS/") to include
the language tag "text" (i.e., change the opening fences from ``` to ```text) so
the Markdown linter MD040 stops flagging them; the same change should be applied
to the duplicate block around lines 115-127.

In `@examples/evaluate/perf-benchmark/requirements.txt`:
- Around line 2-3: Update the pinned vLLM version in requirements.txt: replace
the "vllm==0.12.0" entry with a patched release (e.g., the latest secure vllm
release) and leave "guidellm==0.6.0" unchanged; after updating, re-run the
perf-benchmark workflow and verify it completes successfully to ensure
compatibility with the new vllm version.

In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Line 225: The shell embeds the untrusted ${subset} into the Python source
causing quoting/injection risks; change the MAX_TOKENS command to pass the
filename and subset as separate argv parameters (or use jq) and have Python read
sys.argv[1]/sys.argv[2] before indexing. For example, replace the current line
that uses json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'] with a call
that passes "$SUBSET_MAX_TOKENS_FILE" and "$subset" to python -c and in the
Python snippet use sys.argv to load the file and index the dict (or use jq -r
--arg subset "$subset" '.[$subset]' "$SUBSET_MAX_TOKENS_FILE"). Ensure you
reference the existing SUBSET_MAX_TOKENS_FILE and subset variables when making
the change.

In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py`:
- Around line 54-56: Filter and coerce numeric output_tokens from the successful
list before computing the median: replace the direct list comprehension that
builds output_tokens with logic that iterates over successful, attempts to
coerce each r["output_tokens"] to an int (or float) and only keeps valid
numbers, logging or marking any malformed entries as skipped; then compute
median on the cleaned list and call next_power_of_two(median) only if the
cleaned list is non-empty (otherwise treat the file/batch as skipped/failure
rather than raising). Ensure you reference the variables output_tokens,
successful, median, max_tokens and the next_power_of_two call when making these
changes so the guard is applied in the right spot.

In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py`:
- Around line 80-83: The CSV writer loop in parse_sweep_results.py iterates
METRICS_TO_EXTRACT and unconditionally writes successful.get("median"), which
leaves requests_per_second empty when only successful.mean exists; update the
assignment so that for the metric_key "requests_per_second" (or generally when
median is missing) you fall back to successful.get("mean", "")—i.e., in the loop
that references metric_key/csv_key and metric_data/successful, first try
successful.get("median") and if falsy use successful.get("mean") before writing
row[csv_key], matching the fallback logic used in perf_utils.py.

In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 127-130: The _extract_subset_from_json function currently assumes
data["args"]["data_args"][0]["data_files"] exists and will raise unhelpful
exceptions for malformed inputs; update _extract_subset_from_json to defensively
validate types and presence of keys (data is dict, "args" in data and is dict,
"data_args" is a non-empty list, first element contains "data_files" as a
string), handle missing or unexpected shapes by returning a sensible default or
raising a clear ValueError with context, and only then use Path(...).stem to
extract the subset; reference the function name _extract_subset_from_json and
the "args"/"data_args"/"data_files" fields when locating the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99238499-2328-4687-9fcf-85ba8ea7517c

📥 Commits

Reviewing files that changed from the base of the PR and between 3ad36a7 and 0ca4ffc.

📒 Files selected for processing (13)

examples/evaluate/perf-benchmark/README.md
examples/evaluate/perf-benchmark/requirements.txt
examples/evaluate/perf-benchmark/run_perf_benchmark.sh
examples/evaluate/perf-benchmark/scripts/fetch_and_save_vllm_metrics.sh
examples/evaluate/perf-benchmark/scripts/fetch_vllm_metrics.py
examples/evaluate/perf-benchmark/scripts/parse_gen_len.py
examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py
examples/evaluate/perf-benchmark/scripts/parse_sweep_with_metrics.py
examples/evaluate/perf-benchmark/scripts/perf_utils.py
examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py
examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py
examples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.sh
examples/evaluate/perf-benchmark/scripts/run_sweep.sh

✅ Files skipped from review due to trivial changes (2)

examples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.sh
examples/evaluate/perf-benchmark/scripts/fetch_and_save_vllm_metrics.sh

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

mergify · 2026-05-05T17:08:43Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

rahul-tuli

Thanks @shanjiaz, lgtm!

fynnsu · 2026-05-04T19:47:58Z

+
+## Quick Start
+
+See the example script at [`examples/evaluate/example_humaneval_qwen3_8b_dflash.sh`](../../../examples/evaluate/example_humaneval_qwen3_8b_dflash.sh) for a complete end-to-end example that:


Bad link. Maybe just link to where the file will be once merged.

fynnsu · 2026-05-04T19:52:55Z

+
+For each subset, runs `guidellm benchmark --profile throughput` to generate responses for all prompts and record output token counts.
+
+Script: `run_perf_benchmark.sh`


I'm a little confused by these "Script" items in the pipeline steps. Aren't all of these steps run when you call run_perf_benchmark.sh?

fynnsu · 2026-05-04T19:56:23Z

+
+Parses all sweep JSONs and extracts median metrics for each sweep point. Throughput-mode entries are excluded (they represent max-load saturation).
+
+When `VLLM_URL` environment variable is set, the pipeline also:


Little bit confused by this VLLM_URL env variable. Aren't we providing --target ... to the run_perf_benchmark.sh ? Isn't that kind of the vllm url?

Do we need VLLM_URL to be passed as well?

anmarques · 2026-05-13T20:09:27Z

+    PARSE_ARGS=(--acceptance-only --current-metrics "${CURRENT_FILE}" --baseline-metrics "${BASELINE_FILE}")
+    [[ -n "${OUTPUT_DIR}" ]] && PARSE_ARGS+=(--output "${OUTPUT_DIR}/acceptance.csv")
+
+    python "${SCRIPT_DIR}/parse_sweep.py" "${PARSE_ARGS[@]}"


The parsing of acceptance rates must happen for each subset individually. This will process acceptance rates after the loop over subsets, which will report the overall acceptance rates.

Thanks for pointing it out! This call is just to create the final csv file. Around line 355 we capture baseline metrics before the current dataset sweep and find diff between two runs to calculate the final result. I have tested the full sweep too, the result seems comparable to when we run each dataset separately. Will keep this in mind as I keep working on refactoring.

anmarques

I think we should expose acceptance rates computation via a different end point, even if we reuse the codebase (it actually has a separate block within run_perf_benchmark so I think it makes sense to keep separate)

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

mergify Bot added documentation Improvements or additions to documentation quality-failed labels Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

anmarques force-pushed the anmarques/add-perf-benchmark-utilities branch from 3ad36a7 to 9185764 Compare April 17, 2026 22:41

mergify Bot removed the quality-failed label Apr 17, 2026

dsikka reviewed Apr 20, 2026

View reviewed changes

rahul-tuli mentioned this pull request Apr 27, 2026

deprecate old eval system #479

Open

4 tasks

mergify Bot added quality-failed and removed quality-failed labels Apr 28, 2026

mergify Bot added the needs-rebase label Apr 30, 2026

anmarques and others added 5 commits April 30, 2026 13:12

Format perf-benchmark README.md with mdformat

96ae692

Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>

Add per-position-request retrieval

fd4edb0

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

Add unit and e2e test

9665c94

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

shanjiaz force-pushed the anmarques/add-perf-benchmark-utilities branch from d5a4c01 to 9665c94 Compare April 30, 2026 17:24

mergify Bot removed the needs-rebase label Apr 30, 2026

remove test setup

0ca4ffc

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

shanjiaz added 2 commits April 30, 2026 22:21

refactor eval system

858bfa3

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

Merge branch 'main' into anmarques/add-perf-benchmark-utilities

50197f4

mergify Bot added the quality-failed label May 5, 2026

quality

72d91c0

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

mergify Bot removed the quality-failed label May 5, 2026

docstring

607e631

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

add acceptance rate only support

51eace1

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

rahul-tuli approved these changes May 8, 2026

View reviewed changes

fynnsu reviewed May 13, 2026

View reviewed changes

anmarques commented May 13, 2026

View reviewed changes

anmarques commented May 14, 2026

View reviewed changes

refactor eval system

d8471cf

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>


		## Quick Start

		See the example script at [`examples/evaluate/example_humaneval_qwen3_8b_dflash.sh`](../../../examples/evaluate/example_humaneval_qwen3_8b_dflash.sh) for a complete end-to-end example that:


		For each subset, runs `guidellm benchmark --profile throughput` to generate responses for all prompts and record output token counts.

		Script: `run_perf_benchmark.sh`


		Parses all sweep JSONs and extracts median metrics for each sweep point. Throughput-mode entries are excluded (they represent max-load saturation).

		When `VLLM_URL` environment variable is set, the pipeline also:

Conversation

anmarques commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 28, 2026

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu May 4, 2026

Choose a reason for hiding this comment

Uh oh!

fynnsu May 4, 2026

Choose a reason for hiding this comment

Uh oh!

fynnsu May 4, 2026

Choose a reason for hiding this comment

Uh oh!

anmarques May 13, 2026

Choose a reason for hiding this comment

Uh oh!

shanjiaz May 14, 2026

Choose a reason for hiding this comment

Uh oh!

anmarques left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anmarques commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading