Add guidellm-based performance benchmarking utilities#434
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a complete end-to-end performance benchmarking pipeline for vLLM using GuideLLM: orchestration scripts, generation-length estimation, max-tokens computation, sweep runs, parsing/aggregation (including optional vLLM metrics), and plotting utilities producing CSVs and PNG visualizations. Changes
Sequence Diagram(s)sequenceDiagram
participant Orch as run_perf_benchmark.sh
participant GenLen as run_gen_len_estimation.sh
participant ParseGL as parse_gen_len.py
participant Sweep as run_sweep.sh
participant ParseSweep as parse_sweep_with_metrics.py
participant GuideLLM as GuideLLM
participant vLLM as vLLM(/metrics) optional
Orch->>GenLen: Step 1: run throughput profile per subset
GenLen->>GuideLLM: benchmark --profile throughput
GuideLLM-->>GenLen: gen_len_*.json
GenLen-->>Orch: per-subset gen_len JSONs
Orch->>ParseGL: Step 2: compute max_tokens
ParseGL->>ParseGL: extract medians -> next_power_of_two
ParseGL-->>Orch: max_tokens.json
Orch->>Sweep: Step 3: run sweeps with max_tokens
Sweep->>GuideLLM: benchmark --profile sweep (per-subset)
GuideLLM-->>Sweep: sweep_*.json
Sweep-->>Orch: per-subset sweep JSONs
Orch->>ParseSweep: Step 4: parse sweeps (+ optional vLLM metrics)
ParseSweep->>vLLM: (optional) GET /metrics
vLLM-->>ParseSweep: Prometheus text
ParseSweep->>ParseSweep: extract medians, spec-decode aggregates
ParseSweep-->>Orch: perf_results.csv (+ augmented CSVs)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The quality checks have failed. Please run |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
examples/evaluate/perf-benchmark/run_perf_benchmark.sh (1)
168-169: Trim whitespace when splitting--subsetsto prevent accidental invalid subset names.A value like
--subsets "HumanEval, qa"currently produces" qa"as a subset entry.💡 Suggested fix
-IFS=',' read -ra SUBSET_ARRAY <<< "${SUBSETS}" +IFS=',' read -ra RAW_SUBSETS <<< "${SUBSETS}" +SUBSET_ARRAY=() +for s in "${RAW_SUBSETS[@]}"; do + s="${s#"${s%%[![:space:]]*}"}" + s="${s%"${s##*[![:space:]]}"}" + [[ -n "${s}" ]] && SUBSET_ARRAY+=("${s}") +done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh` around lines 168 - 169, Split using IFS=',' read -ra SUBSET_ARRAY <<< "${SUBSETS}" currently leaves leading/trailing spaces in entries (e.g., " qa"); after that split, iterate over SUBSET_ARRAY and trim whitespace from each element (replace each item with its trimmed version) so array entries contain no surrounding spaces before further use; update the code around the IFS=',' read -ra SUBSET_ARRAY line to perform this trim (use bash parameter expansion or a simple trim helper) on SUBSETS/ SUBSET_ARRAY.examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py (1)
76-76: Trim subset names when parsing--subsetsto avoid accidental filter misses.Line 76 currently keeps whitespace, so
--subsets "HumanEval, qa"won’t matchqa.💡 Suggested fix
- subset_filter = set(args.subsets.split(",")) if args.subsets else None + subset_filter = ( + {s.strip() for s in args.subsets.split(",") if s.strip()} + if args.subsets + else None + )As per coding guidelines
examples/**/*.py: "Review for clarity, correctness, and educational value. Ensure examples are end-to-end runnable, configurations match current API, and comments explain speculative decoding-specific concepts for users new to the algorithm."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py` at line 76, The subset parsing keeps whitespace so values like " qa" won't match; update the construction of subset_filter (from args.subsets) to split on "," then strip whitespace from each token and ignore empty tokens before making the set (i.e., map str.strip and filter out falsy tokens when building subset_filter in the code that reads args.subsets); ensure this logic is applied where subset_filter is defined so comparisons against subset names (used elsewhere in the script) succeed.examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py (1)
144-148: Move import to module level.The
scipy.interpolate.PchipInterpolatorimport is placed inside the loop, causing it to be re-executed for each subset. Move it to the top of the file with other imports for idiomatic Python style and marginally better performance.♻️ Proposed fix
Add to imports at top of file (after line 26):
from scipy.interpolate import PchipInterpolatorThen update lines 144-148:
# Interpolate both onto common dense grid - from scipy.interpolate import PchipInterpolator - b_interp = PchipInterpolator(b_x_smooth, b_y_smooth) t_interp = PchipInterpolator(t_x_smooth, t_y_smooth)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py` around lines 144 - 148, The PchipInterpolator import is inside the loop creating b_interp and t_interp (using b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth); move the import "from scipy.interpolate import PchipInterpolator" to the top-level imports with the other module imports and remove the local import from the loop so the PchipInterpolator symbol is imported once at module load and reused when creating b_interp and t_interp.examples/evaluate/perf-benchmark/scripts/perf_utils.py (2)
76-77: Consider specifying explicit encoding.For cross-platform portability, it's good practice to specify
encoding="utf-8"explicitly when opening text files.♻️ Proposed fix
- with open(filepath, newline="") as f: + with open(filepath, newline="", encoding="utf-8") as f:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 76 - 77, Open the CSV using an explicit UTF-8 encoding to ensure cross-platform portability: update the open(...) call that currently uses filepath to include encoding="utf-8" (the block that creates reader = csv.DictReader(f)) so the file is opened with UTF-8 decoding when constructing reader.
225-226: Consider usingitertools.pairwise()for clarity.Per static analysis,
itertools.pairwise()(Python 3.10+) is the idiomatic way to iterate successive pairs. While the currentzip()is functionally correct here,pairwise()better expresses the intent.♻️ Proposed fix
Add import at top:
from itertools import pairwiseThen update line 225:
- for lo, hi in zip(edges[:-1], edges[1:]): + for lo, hi in pairwise(edges):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 225 - 226, Replace the manual zip over edges with itertools.pairwise for clarity: add "from itertools import pairwise" at the top and change the loop "for lo, hi in zip(edges[:-1], edges[1:]):" to "for lo, hi in pairwise(edges):" (keep the existing mask logic that uses x, lo, hi and the hi < edges[-1] conditional to handle the inclusive upper bound). This updates the loop in perf_utils.py where edges, x, and mask are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/evaluate/perf-benchmark/README.md`:
- Around line 38-58: The README's fenced code blocks are missing language tags
causing MD040 lint errors; update the usage block that begins with
"./run_perf_benchmark.sh --target URL [OPTIONS]" and the output tree block that
begins with "perf_results_YYYYMMDD_HHMMSS/" (and the similar block around the
later section that starts with the perf results path) to use explicit language
identifiers (e.g., replace ``` with ```text) so all three fenced blocks include
a language tag.
In `@examples/evaluate/perf-benchmark/requirements.txt`:
- Around line 2-3: Update the pinned vLLM version in the requirements entry from
vllm==0.12.0 to vllm==0.19.0 (or a later patch, e.g. vllm>=0.19.0) to remediate
multiple security advisories; locate the vllm line in the requirements file (the
line containing "vllm==0.12.0") and change it accordingly, then verify that
guidellm==0.6.0 remains compatible and run your dependency install/tests to
confirm no breakage.
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Line 230: The current MAX_TOKENS assignment injects the shell variable
${subset} directly into the Python snippet (unsafe); change the invocation used
to compute MAX_TOKENS so Python reads the JSON file and subset name from argv
instead of embedding ${subset} into source. Concretely, update the MAX_TOKENS
assignment (the shell variable set where MAX_TOKENS is computed) to call python
with the MAX_TOKENS_FILE and the subset as separate arguments (quote the shell
variables when passing them) and have Python access sys.argv to index the JSON
by the passed subset name; this avoids interpolation and prevents injection or
quoting errors.
In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py`:
- Around line 54-56: The code computes median = statistics.median(output_tokens)
from output_tokens = [r["output_tokens"] for r in successful] but will raise
TypeError if any r["output_tokens"] is non-numeric; update the logic that builds
and consumes output_tokens (the variables output_tokens, successful, median, and
next_power_of_two) to filter or coerce only valid numeric values (e.g., skip
None/strings, attempt int() where appropriate) and handle errors when computing
the median by catching TypeError/ValueError, logging a warning about the
malformed entries, and continuing processing using the median of the remaining
valid tokens (or a sensible default) so a single bad file doesn't abort the run.
In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py`:
- Around line 80-84: The CSV extraction uses successful.get("median", "") for
every metric (in parse_sweep_results.py loop over METRICS_TO_EXTRACT), which
misses RPS when only "mean" is present; update the assignment for the
requests_per_second metric (identify by metric_key "requests_per_second" or the
RPS csv_key) to try successful.get("median") first and fall back to
successful.get("mean") before defaulting to "" so row[csv_key] =
successful.get("median") or successful.get("mean") or "".
In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 119-122: The function _extract_subset_from_json blindly indexes
nested keys and arrays; add defensive checks to validate that data is a dict,
contains "args" -> "data_args" which is a non-empty list, and that the first
element has "data_files" (or handle missing/empty) before using Path(...).stem;
on invalid structure raise a clear ValueError (or return a sensible default)
with a message including the offending JSON snippet so callers can diagnose
malformed input. Ensure you reference and modify _extract_subset_from_json to
perform the type/length checks and to import/use Path consistently.
---
Nitpick comments:
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Around line 168-169: Split using IFS=',' read -ra SUBSET_ARRAY <<<
"${SUBSETS}" currently leaves leading/trailing spaces in entries (e.g., " qa");
after that split, iterate over SUBSET_ARRAY and trim whitespace from each
element (replace each item with its trimmed version) so array entries contain no
surrounding spaces before further use; update the code around the IFS=',' read
-ra SUBSET_ARRAY line to perform this trim (use bash parameter expansion or a
simple trim helper) on SUBSETS/ SUBSET_ARRAY.
In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 76-77: Open the CSV using an explicit UTF-8 encoding to ensure
cross-platform portability: update the open(...) call that currently uses
filepath to include encoding="utf-8" (the block that creates reader =
csv.DictReader(f)) so the file is opened with UTF-8 decoding when constructing
reader.
- Around line 225-226: Replace the manual zip over edges with itertools.pairwise
for clarity: add "from itertools import pairwise" at the top and change the loop
"for lo, hi in zip(edges[:-1], edges[1:]):" to "for lo, hi in pairwise(edges):"
(keep the existing mask logic that uses x, lo, hi and the hi < edges[-1]
conditional to handle the inclusive upper bound). This updates the loop in
perf_utils.py where edges, x, and mask are used.
In `@examples/evaluate/perf-benchmark/scripts/plot_perf_compare.py`:
- Line 76: The subset parsing keeps whitespace so values like " qa" won't match;
update the construction of subset_filter (from args.subsets) to split on ","
then strip whitespace from each token and ignore empty tokens before making the
set (i.e., map str.strip and filter out falsy tokens when building subset_filter
in the code that reads args.subsets); ensure this logic is applied where
subset_filter is defined so comparisons against subset names (used elsewhere in
the script) succeed.
In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py`:
- Around line 144-148: The PchipInterpolator import is inside the loop creating
b_interp and t_interp (using b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth);
move the import "from scipy.interpolate import PchipInterpolator" to the
top-level imports with the other module imports and remove the local import from
the loop so the PchipInterpolator symbol is imported once at module load and
reused when creating b_interp and t_interp.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 55e58599-958e-4bf5-a7cb-cb9bfe169bf7
📒 Files selected for processing (10)
examples/evaluate/perf-benchmark/README.mdexamples/evaluate/perf-benchmark/requirements.txtexamples/evaluate/perf-benchmark/run_perf_benchmark.shexamples/evaluate/perf-benchmark/scripts/parse_gen_len.pyexamples/evaluate/perf-benchmark/scripts/parse_sweep_results.pyexamples/evaluate/perf-benchmark/scripts/perf_utils.pyexamples/evaluate/perf-benchmark/scripts/plot_perf_compare.pyexamples/evaluate/perf-benchmark/scripts/plot_perf_speedup.pyexamples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.shexamples/evaluate/perf-benchmark/scripts/run_sweep.sh
3ad36a7 to
9185764
Compare
dsikka
left a comment
There was a problem hiding this comment.
I think it would also be useful to consolidate these with the existing scripts so that we have one set of go-to benchmarking utils cc @rahul-tuli
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
Adds a complete perf-benchmark pipeline under examples/evaluate/ that: - Estimates output token distribution via guidellm throughput mode and derives per-subset max_tokens (first power-of-2 >= median) - Runs guidellm sweep benchmarks with the derived max_tokens cap - Parses sweep results into CSV (rps, latency, ITL, TTFT, output TPS) - Provides multi-version comparison plots (plot_perf_compare.py) and pairwise speedup visualizations with bwr colormap (plot_perf_speedup.py) - Supports configurable sampling params (--gen-kwargs) and data column mapping (--data-column-mapper) for different datasets Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>
- Replace open() with Path.open() (PTH123)
- Add noqa: T201 for print statements in CLI scripts
- Break long lines to stay within 88-char limit (E501)
- Fix import ordering for isort compliance (I001)
- Use Path() instead of Path(".") (PTH201)
- Move scipy import to module level (PLC0415)
- Replace magic value with _MIN_POINTS_FOR_INTERP constant (PLR2004)
- Add strict=False to zip() call (B905)
- Remove commented-out code (ERA001)
- Refactor main() into smaller helpers to reduce complexity
(C901, PLR0915, PLR0912)
Made-with: Cursor
Signed-off-by: Alexandre Marques <almarque@redhat.com>
Made-with: Cursor Signed-off-by: Alexandre Marques <almarque@redhat.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
d5a4c01 to
9665c94
Compare
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (6)
examples/evaluate/perf-benchmark/scripts/parse_gen_len.py (1)
54-56:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard malformed
output_tokensbefore computing the median.A single non-numeric
output_tokensvalue will still raiseTypeErrorhere and abort the whole batch. Filter/coerce valid numeric values first, and make sure the caller treats that failure as a skipped file rather than a hard stop.🛠️ Suggested fix
- output_tokens = [r["output_tokens"] for r in successful] - median = statistics.median(output_tokens) + output_tokens = [] + for req in successful: + value = req.get("output_tokens") + if isinstance(value, (int, float)): + output_tokens.append(value) + if not output_tokens: + raise ValueError(f"No numeric output_tokens found in {filepath}") + median = statistics.median(output_tokens)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py` around lines 54 - 56, Filter and coerce numeric output_tokens from the successful list before computing the median: replace the direct list comprehension that builds output_tokens with logic that iterates over successful, attempts to coerce each r["output_tokens"] to an int (or float) and only keeps valid numbers, logging or marking any malformed entries as skipped; then compute median on the cleaned list and call next_power_of_two(median) only if the cleaned list is non-empty (otherwise treat the file/batch as skipped/failure rather than raising). Ensure you reference the variables output_tokens, successful, median, max_tokens and the next_power_of_two call when making these changes so the guard is applied in the right spot.examples/evaluate/perf-benchmark/README.md (1)
29-49:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language tags to the two raw fenced blocks.
The usage and output-tree fences still trip MD040. Tag them as
textso the README lint passes.✏️ Suggested fix
-``` +```text ./run_perf_benchmark.sh --target URL [OPTIONS] ... -``` +``` -``` +```text perf_results_YYYYMMDD_HHMMSS/ ... -``` +```Also applies to: 115-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/README.md` around lines 29 - 49, Update the two raw fenced code blocks in README.md (the usage block starting with "./run_perf_benchmark.sh --target URL [OPTIONS]" and the output-tree block starting with "perf_results_YYYYMMDD_HHMMSS/") to include the language tag "text" (i.e., change the opening fences from ``` to ```text) so the Markdown linter MD040 stops flagging them; the same change should be applied to the duplicate block around lines 115-127.examples/evaluate/perf-benchmark/requirements.txt (1)
2-3:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winBump the pinned vLLM release.
vllm==0.12.0is still pinned here, and that version is covered by multiple critical advisories. Please move this example to a patched release and re-verify the benchmark workflow.🔧 Suggested fix
-vllm==0.12.0 +vllm>=0.19.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/requirements.txt` around lines 2 - 3, Update the pinned vLLM version in requirements.txt: replace the "vllm==0.12.0" entry with a patched release (e.g., the latest secure vllm release) and leave "guidellm==0.6.0" unchanged; after updating, re-run the perf-benchmark workflow and verify it completes successfully to ensure compatibility with the new vllm version.examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py (1)
80-83:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
requests_per_secondaligned with GuideLLM's JSON schema.
requests_per_secondis the one metric that may only exposesuccessful.meanhere; usingmedianunconditionally leaves blank RPS values in the CSV and can cascade into the plotting scripts. Please restore the mean fallback already used inperf_utils.py.🛠️ Suggested fix
for metric_key, csv_key in METRICS_TO_EXTRACT: metric_data = metrics.get(metric_key, {}) successful = metric_data.get("successful", {}) - row[csv_key] = successful.get("median", "") + if metric_key == "requests_per_second": + row[csv_key] = successful.get("median", successful.get("mean", "")) + else: + row[csv_key] = successful.get("median", "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py` around lines 80 - 83, The CSV writer loop in parse_sweep_results.py iterates METRICS_TO_EXTRACT and unconditionally writes successful.get("median"), which leaves requests_per_second empty when only successful.mean exists; update the assignment so that for the metric_key "requests_per_second" (or generally when median is missing) you fall back to successful.get("mean", "")—i.e., in the loop that references metric_key/csv_key and metric_data/successful, first try successful.get("median") and if falsy use successful.get("mean") before writing row[csv_key], matching the fallback logic used in perf_utils.py.examples/evaluate/perf-benchmark/run_perf_benchmark.sh (1)
225-225:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid interpolating
${subset}into Python sourceLine 225 embeds user-provided subset text directly into Python code; quote-containing subset names break parsing and can lead to code-injection-style behavior.
Suggested fix
- MAX_TOKENS=$(python -c "import json; print(json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'])") + MAX_TOKENS=$( + python - "${SUBSET_MAX_TOKENS_FILE}" "${subset}" <<'PY' +import json +import sys + +path, subset = sys.argv[1], sys.argv[2] +with open(path) as f: + print(json.load(f)[subset]) +PY + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh` at line 225, The shell embeds the untrusted ${subset} into the Python source causing quoting/injection risks; change the MAX_TOKENS command to pass the filename and subset as separate argv parameters (or use jq) and have Python read sys.argv[1]/sys.argv[2] before indexing. For example, replace the current line that uses json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'] with a call that passes "$SUBSET_MAX_TOKENS_FILE" and "$subset" to python -c and in the Python snippet use sys.argv to load the file and index the dict (or use jq -r --arg subset "$subset" '.[$subset]' "$SUBSET_MAX_TOKENS_FILE"). Ensure you reference the existing SUBSET_MAX_TOKENS_FILE and subset variables when making the change.examples/evaluate/perf-benchmark/scripts/perf_utils.py (1)
127-130:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHarden subset extraction for malformed JSON structures
Direct nested indexing here raises unhelpful exceptions and can crash consumers when schema differs.
Suggested fix
def _extract_subset_from_json(data: dict) -> str: """Extract subset name from the JSON content's data_args field.""" - data_files = data["args"]["data_args"][0]["data_files"] - return Path(data_files).stem + try: + data_files = data["args"]["data_args"][0]["data_files"] + return Path(data_files).stem + except (KeyError, TypeError, IndexError) as e: + raise ValueError( + "Cannot extract subset from JSON: expected args.data_args[0].data_files" + ) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py` around lines 127 - 130, The _extract_subset_from_json function currently assumes data["args"]["data_args"][0]["data_files"] exists and will raise unhelpful exceptions for malformed inputs; update _extract_subset_from_json to defensively validate types and presence of keys (data is dict, "args" in data and is dict, "data_args" is a non-empty list, first element contains "data_files" as a string), handle missing or unexpected shapes by returning a sensible default or raising a clear ValueError with context, and only then use Path(...).stem to extract the subset; reference the function name _extract_subset_from_json and the "args"/"data_args"/"data_files" fields when locating the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Around line 226-227: The script currently builds JSON by concatenating strings
into ALL_MAX_TOKENS using constructs like ALL_MAX_TOKENS+=("\"${subset}\":
${MAX_TOKENS}"), which can produce invalid JSON when subset contains quotes or
backslashes; change the approach to collect structured key/value pairs (e.g.,
accumulate entries in a Bash array or temporary newline-separated file) and then
call a small Python one-liner or script to safely serialize the mapping to JSON
(use the same variable names subset and MAX_TOKENS and replace the unsafe
ALL_MAX_TOKENS concatenation and echo lines at both places like 226 and 295-297
with the safe serialization step).
- Around line 262-270: The shell wrapper run_perf_benchmark.sh is invoking
parse_sweep_with_metrics.py with the wrong CLI flag (--vllm-url) while the
parser expects --url; update the invocation in run_perf_benchmark.sh (the
PARSE_ARGS array and where it's passed to python) to use --url instead of
--vllm-url, or alternatively update the argument parser in
examples/evaluate/perf-benchmark/scripts/parse_sweep_with_metrics.py to accept
--vllm-url (e.g., add an alias for --url) so the flag names are consistent
between run_perf_benchmark.sh and parse_sweep_with_metrics.py.
- Around line 311-317: The script currently proceeds silently when no subset
CSVs exist; modify the PARTIAL_CSV_FILES handling so that if
${`#PARTIAL_CSV_FILES`[@]} is 0 you print an error (e.g., "[ERROR] No subset CSVs
produced, aborting") and exit with a non‑zero status (exit 1), otherwise perform
the existing combine logic that uses PARTIAL_CSV_FILES and CSV_FILE and echo the
info message; ensure this logic lives around the PARTIAL_CSV_FILES array check
so the script fails fast when no partial CSVs were produced.
- Around line 149-152: The script reads VLLM_URL directly which will cause an
"unbound variable" error when run with set -u; change the check and expansion to
use safe parameter expansion (e.g. use ${VLLM_URL:-} or test ${VLLM_URL+x}) so
the if condition becomes if [[ -z "${VLLM_URL:-}" ]] and then assign VLLM_URL
from TARGET as currently done (VLLM_URL="${TARGET%/}" and
VLLM_URL="${VLLM_URL%/v1}"), ensuring all references to VLLM_URL use the safe
expansion to avoid termination under set -u.
In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 117-119: The JSON loader (_load_json) is using
requests_per_second.successful.mean while the CSV loader (_load_csv) uses
rps_median, causing inconsistent RPS values; update the code that sets rps in
_load_json to read
bench["metrics"]["requests_per_second"]["successful"]["median"] (instead of
"mean") so both loaders use the same RPS statistic (rps_median) when appending
points, ensuring comparable outputs between CSV and JSON sources.
In `@examples/evaluate/perf-benchmark/scripts/plot_perf_speedup.py`:
- Around line 158-167: The smoothed arrays returned by smooth_curve
(b_x_smooth/b_y_smooth and t_x_smooth/t_y_smooth) may contain fewer than two
unique points after binning/deduplication, which makes creating
PchipInterpolator fail; before instantiating PchipInterpolator on b_interp and
t_interp, add a guard that checks len(b_x_smooth) >= 2 and len(b_y_smooth) >= 2
and len(t_x_smooth) >= 2 and len(t_y_smooth) >= 2 (or equivalently check unique
point counts) and if any are too sparse, return None or skip processing this
subset so the plotting continues gracefully. Ensure you reference the variables
b_x_smooth, b_y_smooth, t_x_smooth, t_y_smooth and the PchipInterpolator
creation sites when making the change.
---
Duplicate comments:
In `@examples/evaluate/perf-benchmark/README.md`:
- Around line 29-49: Update the two raw fenced code blocks in README.md (the
usage block starting with "./run_perf_benchmark.sh --target URL [OPTIONS]" and
the output-tree block starting with "perf_results_YYYYMMDD_HHMMSS/") to include
the language tag "text" (i.e., change the opening fences from ``` to ```text) so
the Markdown linter MD040 stops flagging them; the same change should be applied
to the duplicate block around lines 115-127.
In `@examples/evaluate/perf-benchmark/requirements.txt`:
- Around line 2-3: Update the pinned vLLM version in requirements.txt: replace
the "vllm==0.12.0" entry with a patched release (e.g., the latest secure vllm
release) and leave "guidellm==0.6.0" unchanged; after updating, re-run the
perf-benchmark workflow and verify it completes successfully to ensure
compatibility with the new vllm version.
In `@examples/evaluate/perf-benchmark/run_perf_benchmark.sh`:
- Line 225: The shell embeds the untrusted ${subset} into the Python source
causing quoting/injection risks; change the MAX_TOKENS command to pass the
filename and subset as separate argv parameters (or use jq) and have Python read
sys.argv[1]/sys.argv[2] before indexing. For example, replace the current line
that uses json.load(open('${SUBSET_MAX_TOKENS_FILE}'))['${subset}'] with a call
that passes "$SUBSET_MAX_TOKENS_FILE" and "$subset" to python -c and in the
Python snippet use sys.argv to load the file and index the dict (or use jq -r
--arg subset "$subset" '.[$subset]' "$SUBSET_MAX_TOKENS_FILE"). Ensure you
reference the existing SUBSET_MAX_TOKENS_FILE and subset variables when making
the change.
In `@examples/evaluate/perf-benchmark/scripts/parse_gen_len.py`:
- Around line 54-56: Filter and coerce numeric output_tokens from the successful
list before computing the median: replace the direct list comprehension that
builds output_tokens with logic that iterates over successful, attempts to
coerce each r["output_tokens"] to an int (or float) and only keeps valid
numbers, logging or marking any malformed entries as skipped; then compute
median on the cleaned list and call next_power_of_two(median) only if the
cleaned list is non-empty (otherwise treat the file/batch as skipped/failure
rather than raising). Ensure you reference the variables output_tokens,
successful, median, max_tokens and the next_power_of_two call when making these
changes so the guard is applied in the right spot.
In `@examples/evaluate/perf-benchmark/scripts/parse_sweep_results.py`:
- Around line 80-83: The CSV writer loop in parse_sweep_results.py iterates
METRICS_TO_EXTRACT and unconditionally writes successful.get("median"), which
leaves requests_per_second empty when only successful.mean exists; update the
assignment so that for the metric_key "requests_per_second" (or generally when
median is missing) you fall back to successful.get("mean", "")—i.e., in the loop
that references metric_key/csv_key and metric_data/successful, first try
successful.get("median") and if falsy use successful.get("mean") before writing
row[csv_key], matching the fallback logic used in perf_utils.py.
In `@examples/evaluate/perf-benchmark/scripts/perf_utils.py`:
- Around line 127-130: The _extract_subset_from_json function currently assumes
data["args"]["data_args"][0]["data_files"] exists and will raise unhelpful
exceptions for malformed inputs; update _extract_subset_from_json to defensively
validate types and presence of keys (data is dict, "args" in data and is dict,
"data_args" is a non-empty list, first element contains "data_files" as a
string), handle missing or unexpected shapes by returning a sensible default or
raising a clear ValueError with context, and only then use Path(...).stem to
extract the subset; reference the function name _extract_subset_from_json and
the "args"/"data_args"/"data_files" fields when locating the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 99238499-2328-4687-9fcf-85ba8ea7517c
📒 Files selected for processing (13)
examples/evaluate/perf-benchmark/README.mdexamples/evaluate/perf-benchmark/requirements.txtexamples/evaluate/perf-benchmark/run_perf_benchmark.shexamples/evaluate/perf-benchmark/scripts/fetch_and_save_vllm_metrics.shexamples/evaluate/perf-benchmark/scripts/fetch_vllm_metrics.pyexamples/evaluate/perf-benchmark/scripts/parse_gen_len.pyexamples/evaluate/perf-benchmark/scripts/parse_sweep_results.pyexamples/evaluate/perf-benchmark/scripts/parse_sweep_with_metrics.pyexamples/evaluate/perf-benchmark/scripts/perf_utils.pyexamples/evaluate/perf-benchmark/scripts/plot_perf_compare.pyexamples/evaluate/perf-benchmark/scripts/plot_perf_speedup.pyexamples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.shexamples/evaluate/perf-benchmark/scripts/run_sweep.sh
✅ Files skipped from review due to trivial changes (2)
- examples/evaluate/perf-benchmark/scripts/run_gen_len_estimation.sh
- examples/evaluate/perf-benchmark/scripts/fetch_and_save_vllm_metrics.sh
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
|
The quality checks have failed. Please run |
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
rahul-tuli
left a comment
There was a problem hiding this comment.
Thanks @shanjiaz, lgtm!
|
|
||
| ## Quick Start | ||
|
|
||
| See the example script at [`examples/evaluate/example_humaneval_qwen3_8b_dflash.sh`](../../../examples/evaluate/example_humaneval_qwen3_8b_dflash.sh) for a complete end-to-end example that: |
There was a problem hiding this comment.
Bad link. Maybe just link to where the file will be once merged.
|
|
||
| For each subset, runs `guidellm benchmark --profile throughput` to generate responses for all prompts and record output token counts. | ||
|
|
||
| Script: `run_perf_benchmark.sh` |
There was a problem hiding this comment.
I'm a little confused by these "Script" items in the pipeline steps. Aren't all of these steps run when you call run_perf_benchmark.sh?
|
|
||
| Parses all sweep JSONs and extracts median metrics for each sweep point. Throughput-mode entries are excluded (they represent max-load saturation). | ||
|
|
||
| When `VLLM_URL` environment variable is set, the pipeline also: |
There was a problem hiding this comment.
Little bit confused by this VLLM_URL env variable. Aren't we providing --target ... to the run_perf_benchmark.sh ? Isn't that kind of the vllm url?
Do we need VLLM_URL to be passed as well?
| PARSE_ARGS=(--acceptance-only --current-metrics "${CURRENT_FILE}" --baseline-metrics "${BASELINE_FILE}") | ||
| [[ -n "${OUTPUT_DIR}" ]] && PARSE_ARGS+=(--output "${OUTPUT_DIR}/acceptance.csv") | ||
|
|
||
| python "${SCRIPT_DIR}/parse_sweep.py" "${PARSE_ARGS[@]}" |
There was a problem hiding this comment.
The parsing of acceptance rates must happen for each subset individually. This will process acceptance rates after the loop over subsets, which will report the overall acceptance rates.
There was a problem hiding this comment.
Thanks for pointing it out! This call is just to create the final csv file. Around line 355 we capture baseline metrics before the current dataset sweep and find diff between two runs to calculate the final result. I have tested the full sweep too, the result seems comparable to when we run each dataset separately. Will keep this in mind as I keep working on refactoring.
anmarques
left a comment
There was a problem hiding this comment.
I think we should expose acceptance rates computation via a different end point, even if we reuse the codebase (it actually has a separate block within run_perf_benchmark so I think it makes sense to keep separate)
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Adds a complete perf-benchmark pipeline under examples/evaluate/ that:
I have filled in:
Summary by CodeRabbit
Release Notes
New Features
Documentation