llm-d · Bslabe123 · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
@@ -2,7 +2,7 @@
 
 A benchmarking report is a standard data format describing the cluster configuration, workload, and results of a benchmark run. The report acts as a common API for different benchmarking experiments. Each supported harness in llm-d-benchmark creates a benchmark report upon completion of a run, in addition to saving results in its native format.
 
-There are two versions of the benchmark report, `0.1` and `0.2`. Both reports are generated by the `llm-d-benchmark` harness pod, but new applications consuming benchmark data should use version `0.2` reports.
+There are three versions of the benchmark report, `0.1`, `0.2`, and `0.2.1`. All reports are generated by the `llm-d-benchmark` harness pod, but new applications consuming benchmark data should use version `0.2.1` reports. Version `0.2.1` is an additive superset of `0.2`: every valid `0.2` report is also a valid `0.2.1` report.
 
 ## v0.2 Format Description
 
@@ -56,6 +56,22 @@ Session-level statistics for multi-turn workloads. Populated from `*_session_lif
 
 A `session_performance` report is generated alongside the standard `request_performance` report for each stage that has a corresponding session lifecycle file. The `scenario.load.standardized.stage` field identifies which stage the report covers, and `scenario.load.standardized.multi_turn.enabled` is set to `true`.
 
+## v0.2.1 Format Description
+
+Version `0.2.1` is an additive minor revision of `0.2` that adds optional multi-modal payload statistics for image, video, and audio workloads. Every field introduced is optional, so any `0.2` report validates unchanged under `0.2.1` (enforced by `tests/test_benchmark_report_v0_2_1_compat.py`).
+
+See [`br_v0_2_1_example.yaml`](br_v0_2_1_example.yaml) for a dummy example report, and [`br_v0_2_1_json_schema.json`](br_v0_2_1_json_schema.json) for its [JSON Schema](https://json-schema.org/draft/2020-12). All other fields and sections are identical to `0.2`.
+
+The additions, all derived from what the client can determine from the payloads it sent, are:
+
+- **`results.request_performance.aggregate.requests.request_size`** (`Statistics`): total encoded request size in bytes, capturing the large payloads typical of multi-modal requests.
+- **`results.request_performance.aggregate.requests.multimodal`**: a per-modality block (`image`, `video`, `audio`), each a distribution set over the media instances sent. `image`/`video` carry `count`, `bytes`, `pixels`, and `aspect_ratio`; `video` adds `frames`; `audio` carries `count`, `bytes`, and `seconds`.
+- **`results.request_performance.aggregate.throughput.{image,video,audio}_rate`** (`Statistics`): per-modality delivery rates (`images/s`, `videos/s`, `audios/s`).
+
+New unit categories back these fields: `pixels` (quantity), `ratio` (for aspect ratio, distinct from a 0..1 portion), `bytes` (memory), and a media-throughput category (`images/s`, `videos/s`, `audios/s`) kept separate from the request-rate category so the existing `request_rate` guardrail is unaffected.
+
+Server-side multi-modal metrics (vision token counts, encoding time, multimodal cache hit rates) are out of scope for this revision.
+
 ## v0.1 Format Description
 
 A benchmark report describes the inference service configuration, workload, and aggregate results. Individual traces from single inference executions are not captured, rather statistics from multiple traces of identical scenarios are combined to create a report.
@@ -199,7 +215,7 @@ python3 -m llmdbenchmark.analysis.benchmark_report.cli \
 
 #### Parameters Reference
 * `-w, --workload-generator`: Specifies the harness generator. Must be one of: `'guidellm'`, `'inferencemax'`, `'inference-perf'`, `'vllm-benchmark'`, `'nop'`.
-* `-b, --br-version`: Target benchmark report version (defaults to `0.1`; use `0.2` for the standard version).
+* `-b, --br-version`: Target benchmark report version (defaults to `0.1`; use `0.2` for the standard version, or `0.2.1` to additionally capture the multimodal payload statistics that `inference-perf` emits).
 * `-f, --force`: Overwrites the output file if it already exists.
 * `results_file` *(Positional)*: Path to the raw native results file to convert. (e.g. For `inference-perf`, this must contain `"stage_"` in its filename, e.g., `stage_0_lifecycle_metrics.json`).
 * `output_file` *(Positional, Optional)*: Destination for the converted report. If omitted, the YAML output is printed directly to `stdout`.

@@ -14,11 +14,13 @@
 )
 from .schema_v0_1 import BenchmarkReportV01
 from .schema_v0_2 import BenchmarkReportV02
+from .schema_v0_2_1 import BenchmarkReportV021
 
 __all__ = [
     "BenchmarkReport",
     "BenchmarkReportV01",
     "BenchmarkReportV02",
+    "BenchmarkReportV021",
     "get_nested",
     "import_benchmark_report",
     "import_yaml",

@@ -92,13 +92,17 @@ class Units(StrEnum):
 
     # Quantity
     COUNT = auto()
+    PIXELS = auto()
     # Portion
     PERCENT = auto()
     FRACTION = auto()
+    # Ratio (unbounded; unlike a portion, may exceed 1, e.g. aspect ratio)
+    RATIO = auto()
     # Time
     MS = auto()
     S = auto()
     # Memory
+    BYTES = "bytes"
     MB = "MB"
     GB = "GB"
     TB = "TB"
@@ -121,15 +125,28 @@ class Units(StrEnum):
     TOKEN_PER_S = "tokens/s"
     # Request throughput
     QUERY_PER_S = "queries/s"
+    # Media throughput (per-modality payload rates)
+    IMAGE_PER_S = "images/s"
+    VIDEO_PER_S = "videos/s"
+    AUDIO_PER_S = "audios/s"
     # Power
     WATTS = "Watts"
 
 
 # Lists of compatible units for a particular application
-UNITS_QUANTITY = [Units.COUNT]
+UNITS_QUANTITY = [Units.COUNT, Units.PIXELS]
 UNITS_PORTION = [Units.PERCENT, Units.FRACTION]
+UNITS_RATIO = [Units.RATIO]
 UNITS_TIME = [Units.MS, Units.S]
-UNITS_MEMORY = [Units.MB, Units.GB, Units.TB, Units.MIB, Units.GIB, Units.TIB]
+UNITS_MEMORY = [
+    Units.BYTES,
+    Units.MB,
+    Units.GB,
+    Units.TB,
+    Units.MIB,
+    Units.GIB,
+    Units.TIB,
+]
 UNITS_BANDWIDTH = [
     Units.MBIT_PER_S,
     Units.GBIT_PER_S,
@@ -141,6 +158,7 @@ class Units(StrEnum):
 UNITS_GEN_LATENCY = [Units.MS_PER_TOKEN, Units.S_PER_TOKEN]
 UNITS_GEN_THROUGHPUT = [Units.TOKEN_PER_S]
 UNITS_REQUEST_THROUGHPUT = [Units.QUERY_PER_S]
+UNITS_MEDIA_THROUGHPUT = [Units.IMAGE_PER_S, Units.VIDEO_PER_S, Units.AUDIO_PER_S]
 UNITS_POWER = [Units.WATTS]
 
 ###############################################################################