Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions llmdbenchmark/analysis/benchmark_report/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

A benchmarking report is a standard data format describing the cluster configuration, workload, and results of a benchmark run. The report acts as a common API for different benchmarking experiments. Each supported harness in llm-d-benchmark creates a benchmark report upon completion of a run, in addition to saving results in its native format.

There are two versions of the benchmark report, `0.1` and `0.2`. Both reports are generated by the `llm-d-benchmark` harness pod, but new applications consuming benchmark data should use version `0.2` reports.
There are three versions of the benchmark report, `0.1`, `0.2`, and `0.2.1`. All reports are generated by the `llm-d-benchmark` harness pod, but new applications consuming benchmark data should use version `0.2.1` reports. Version `0.2.1` is an additive superset of `0.2`: every valid `0.2` report is also a valid `0.2.1` report.

## v0.2 Format Description

Expand Down Expand Up @@ -56,6 +56,22 @@ Session-level statistics for multi-turn workloads. Populated from `*_session_lif

A `session_performance` report is generated alongside the standard `request_performance` report for each stage that has a corresponding session lifecycle file. The `scenario.load.standardized.stage` field identifies which stage the report covers, and `scenario.load.standardized.multi_turn.enabled` is set to `true`.

## v0.2.1 Format Description

Version `0.2.1` is an additive minor revision of `0.2` that adds optional multi-modal payload statistics for image, video, and audio workloads. Every field introduced is optional, so any `0.2` report validates unchanged under `0.2.1` (enforced by `tests/test_benchmark_report_v0_2_1_compat.py`).

See [`br_v0_2_1_example.yaml`](br_v0_2_1_example.yaml) for a dummy example report, and [`br_v0_2_1_json_schema.json`](br_v0_2_1_json_schema.json) for its [JSON Schema](https://json-schema.org/draft/2020-12). All other fields and sections are identical to `0.2`.

The additions, all derived from what the client can determine from the payloads it sent, are:

- **`results.request_performance.aggregate.requests.request_size`** (`Statistics`): total encoded request size in bytes, capturing the large payloads typical of multi-modal requests.
- **`results.request_performance.aggregate.requests.multimodal`**: a per-modality block (`image`, `video`, `audio`), each a distribution set over the media instances sent. `image`/`video` carry `count`, `bytes`, `pixels`, and `aspect_ratio`; `video` adds `frames`; `audio` carries `count`, `bytes`, and `seconds`.
- **`results.request_performance.aggregate.throughput.{image,video,audio}_rate`** (`Statistics`): per-modality delivery rates (`images/s`, `videos/s`, `audios/s`).

New unit categories back these fields: `pixels` (quantity), `ratio` (for aspect ratio, distinct from a 0..1 portion), `bytes` (memory), and a media-throughput category (`images/s`, `videos/s`, `audios/s`) kept separate from the request-rate category so the existing `request_rate` guardrail is unaffected.

Server-side multi-modal metrics (vision token counts, encoding time, multimodal cache hit rates) are out of scope for this revision.

## v0.1 Format Description

A benchmark report describes the inference service configuration, workload, and aggregate results. Individual traces from single inference executions are not captured, rather statistics from multiple traces of identical scenarios are combined to create a report.
Expand Down Expand Up @@ -199,7 +215,7 @@ python3 -m llmdbenchmark.analysis.benchmark_report.cli \

#### Parameters Reference
* `-w, --workload-generator`: Specifies the harness generator. Must be one of: `'guidellm'`, `'inferencemax'`, `'inference-perf'`, `'vllm-benchmark'`, `'nop'`.
* `-b, --br-version`: Target benchmark report version (defaults to `0.1`; use `0.2` for the standard version).
* `-b, --br-version`: Target benchmark report version (defaults to `0.1`; use `0.2` for the standard version, or `0.2.1` to additionally capture the multimodal payload statistics that `inference-perf` emits).
* `-f, --force`: Overwrites the output file if it already exists.
* `results_file` *(Positional)*: Path to the raw native results file to convert. (e.g. For `inference-perf`, this must contain `"stage_"` in its filename, e.g., `stage_0_lifecycle_metrics.json`).
* `output_file` *(Positional, Optional)*: Destination for the converted report. If omitted, the YAML output is printed directly to `stdout`.
Expand Down
2 changes: 2 additions & 0 deletions llmdbenchmark/analysis/benchmark_report/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,13 @@
)
from .schema_v0_1 import BenchmarkReportV01
from .schema_v0_2 import BenchmarkReportV02
from .schema_v0_2_1 import BenchmarkReportV021

__all__ = [
"BenchmarkReport",
"BenchmarkReportV01",
"BenchmarkReportV02",
"BenchmarkReportV021",
"get_nested",
"import_benchmark_report",
"import_yaml",
Expand Down
22 changes: 20 additions & 2 deletions llmdbenchmark/analysis/benchmark_report/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,13 +92,17 @@ class Units(StrEnum):

# Quantity
COUNT = auto()
PIXELS = auto()
# Portion
PERCENT = auto()
FRACTION = auto()
# Ratio (unbounded; unlike a portion, may exceed 1, e.g. aspect ratio)
RATIO = auto()
# Time
MS = auto()
S = auto()
# Memory
BYTES = "bytes"
MB = "MB"
GB = "GB"
TB = "TB"
Expand All @@ -121,15 +125,28 @@ class Units(StrEnum):
TOKEN_PER_S = "tokens/s"
# Request throughput
QUERY_PER_S = "queries/s"
# Media throughput (per-modality payload rates)
IMAGE_PER_S = "images/s"
VIDEO_PER_S = "videos/s"
AUDIO_PER_S = "audios/s"
# Power
WATTS = "Watts"


# Lists of compatible units for a particular application
UNITS_QUANTITY = [Units.COUNT]
UNITS_QUANTITY = [Units.COUNT, Units.PIXELS]
UNITS_PORTION = [Units.PERCENT, Units.FRACTION]
UNITS_RATIO = [Units.RATIO]
UNITS_TIME = [Units.MS, Units.S]
UNITS_MEMORY = [Units.MB, Units.GB, Units.TB, Units.MIB, Units.GIB, Units.TIB]
UNITS_MEMORY = [
Units.BYTES,
Units.MB,
Units.GB,
Units.TB,
Units.MIB,
Units.GIB,
Units.TIB,
]
UNITS_BANDWIDTH = [
Units.MBIT_PER_S,
Units.GBIT_PER_S,
Expand All @@ -141,6 +158,7 @@ class Units(StrEnum):
UNITS_GEN_LATENCY = [Units.MS_PER_TOKEN, Units.S_PER_TOKEN]
UNITS_GEN_THROUGHPUT = [Units.TOKEN_PER_S]
UNITS_REQUEST_THROUGHPUT = [Units.QUERY_PER_S]
UNITS_MEDIA_THROUGHPUT = [Units.IMAGE_PER_S, Units.VIDEO_PER_S, Units.AUDIO_PER_S]
UNITS_POWER = [Units.WATTS]

###############################################################################
Expand Down
Loading
Loading