feat: synthetic image and video data generation for VLM benchmarking by zakariaelh · Pull Request #732 · vllm-project/guidellm

zakariaelh · 2026-05-15T14:31:42Z

Summary

Adds two new --data types, synthetic_image and synthetic_video, that let users benchmark vLLM-served VLMs (Gemma 4, Qwen3-VL, InternVL3.5, etc.) without bringing their own image or video dataset. Composes with the existing synthetic-text knobs and produces TTFT/ITL within 0.3% of real media at matched input shape on Gemma 4.

This closes the "Generation of synthetic multimodal datasets" item under Active Development in the README.

Details

Levers exposed

Knob	Default	Purpose
`width`, `height` (or `resolution` + `aspect_ratio`)	required	Vision-tower FLOPs
`frames`, `fps` (video)	required	Linear vision cost on most VLMs
`format`	`jpeg` / `mp4`	Decode cost + wire size
`jpeg_quality`, `video_bitrate`	85 / libx264 default	Wire-size lever
`content`	`gradient`	Cache-bust default; opt-in `noise` for worst-case wire size
`text_tokens` (+ stdev/min/max)	required	Text-prefill cost (orthogonal to vision)
`output_tokens`	required	Decode cost
`images_per_request`	1	Multi-image-per-turn
`seed`	0	Reproducibility

Example invocations

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile constant --rate 2 --max-seconds 60 \
  --data "type=synthetic_image,resolution=720p,text_tokens=200,output_tokens=64"

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile constant --rate 2 --max-seconds 60 \
  --data "type=synthetic_video,width=854,height=480,frames=6,fps=3,text_tokens=12,output_tokens=10"

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile sweep --max-seconds 60 \
  --data "type=synthetic_image,width=1024,height=1024,format=png,content=noise,images_per_request=2,text_tokens=128,output_tokens=32,seed=17"

Test Plan

tox -e test-unit -- tests/unit/data/deserializers/test_synthetic_multimodal.py
- 43 unit tests covering decoded dimensions, byte counts, content modes, byte-uniqueness across 1000 gradient rows, reproducibility under matched seed, error handling on unsupported formats / content, deserializer dispatch, JSON config, multi-image emission
tox -e test-integration -- tests/integration/data/test_synthetic_multimodal_benchmark.py
- 2 integration tests that drive a real guidellm benchmark run invocation against the in-tree mock server, end-to-end through the data pipeline + chat-completions request handler

End-to-end validation against real vLLM serving google/gemma-4-E4B-it:

Check	Result
Real-vLLM smoke (image + video, rate=2, 30s)	Zero errors
Resolution sweep TTFT_p50 (480p / 720p / 1080p)	63.7 / 67.9 / 73.6 ms — monotonic
Frame sweep TTFT_p50 (2 / 6 / 12 frames @480p)	94.3 / 210.7 / 376.1 ms — monotonic, vision tokens scale linearly (~75/frame)
Synthetic vs real fidelity at matched shape (854×480, 6f@3fps, 100s @ rate=2)	TTFT_p90 delta 0.3% · ITL_p50 delta 0.0%
Reproducibility (same seed, two runs)	Byte-identical sha256 per row

Full evaluation methodology and per-section results are in the linked status doc.

Related Issues

Resolves the "Generation of synthetic multimodal datasets" item listed under Active Development in README.md

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes code generated or substantially modified by an AI agent
Includes tests generated or substantially modified by an AI agent

Code and tests were drafted by Claude under my direction, then validated against real Gemma 4 inference on vLLM. The validation caught two real bugs in the initial draft, both fixed in 4ffa586 / current 1822225:

features() in both deserializers declared text columns only, so GenerativeColumnMapper never saw image / video (dataset.column_names was text-only) and the request handler silently built text-only chat completions. TTFT was flat across all resolutions before the fix.
MediaEncoder still ran on synthetic rows and called encode_image with the already-encoded canonical dict, raising Unsupported image type: <class 'dict'> and dropping every row. Fixed by making encode_image / encode_video idempotent on the canonical dict shape.

I have reviewed every line of the diff and am the submitter of record.

Pre-encoded data-URL output matching encode_image / encode_video shape. Per-row seeded gradient default with noise / solid / checkerboard opt-ins for images; gradient / noise for videos. Bit-exact mp4 encoding via imageio[ffmpeg] -fflags +bitexact so same seed produces byte-identical payloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SyntheticImageDatasetConfig and SyntheticVideoDatasetConfig live next to the existing text config. text_tokens is canonical; prompt_tokens is accepted as an alias. resolution / aspect_ratio sugar resolves to width/height. Each deserializer peeks at the input type and refuses to claim configs explicitly marked for another deserializer, so the registry dispatch is deterministic when distinctive fields overlap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…video Unit tests cover synthesize_image / synthesize_video helpers (decoded dims, byte counts, reproducibility, per-row uniqueness, 1000-row cache-bust check) and the deserializers (pull 10 rows from a --data string, type-mismatch refusal, prompt_tokens alias, images_per_request). Integration test spins up the in-tree mock server and runs 'guidellm benchmark run' end-to-end with both synthetic_image and synthetic_video --data strings, asserting return code 0 and a non-empty benchmark report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move synthetic multimodal generation out of Active Development for images and video. Audio remains WIP. Add two short --data examples (one image, one video) plus a parameter rundown for the new types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two bugs caught by Section 4 of the evaluation plan against real vLLM: 1. SyntheticImageDataset and SyntheticVideoDataset features() omitted the image/video columns from the typed schema, so dataset.column_names returned only text columns. GenerativeColumnMapper reads column_names first and never sees `image`/`video`, so the request handler builds a text-only chat completion and the image is silently dropped. TTFT was identical across 480p/720p/1080p before the fix. 2. MediaEncoder still runs on synthetic rows. It called encode_image with the already-encoded canonical dict, which raised "Unsupported image type: <class 'dict'>" and dropped every row. Made encode_image and encode_video idempotent on the canonical dict shape so re-application is a no-op. After both fixes: resolution sweep TTFT 63.7 → 67.9 → 73.6ms (monotonic); frame sweep TTFT 94 → 211 → 376ms (monotonic, linear in frames); synth-vs-real fidelity 0.3% TTFT_p90 delta and 0.0% ITL_p50 delta. Co-authored-by: Claude

guidellm's AGENTS.md requires every AI-written test function to carry `## WRITTEN BY AI ##` at the end of its docstring. Adds the marker to all 45 new tests in the multimodal suite. Assisted-by: Claude (Anthropic)

dbutenhof

First pass -- a few documentation comments. I think this is packing too much into README.md, and should be broken out. There are also several places in the guide pages that mention synthetic (text) data that should probably be generalized.

dbutenhof · 2026-05-18T14:41:15Z

 - `--data-samples`: Number of samples to use from the dataset - use `-1` (default) for all samples with dynamic generation, or specify a positive integer to limit sample count
 - `--processor`: Tokenizer or processor name used for generating synthetic data - if not provided and required for the dataset, automatically loads from the model; accepts HuggingFace model IDs or local paths

+### Synthetic Multimodal Data


This section should be moved and restructured.

It's too big & specific for the README.md file.

There's a section in docs/guides/datasets.md on "Synthetic Data": this should be expanded and restructured to reference and frame the potential for synthetic image/video as well as text. But the details you have here should probably be in a new docs/guides/multimodal/synthetic_vision.md (since you cover both images and video but not really "multimodal" in that there's no audio).

Might not be a bad idea to tweak the "Multimodal Benchmarking" section in docs/guides/multimodal/index.md as well to mention synthesized visual datasets with a link to the new details.

dbutenhof · 2026-05-18T14:48:56Z

+
+**Key parameters:**
+
+- `--data "type=synthetic_image,..."`: Knobs include `width`, `height`, the `resolution=720p` / `aspect_ratio=16:9` sugar, `format` (`jpeg` or `png`), `jpeg_quality`, `content` (`gradient` default, `noise`, `solid`, `checkerboard`), `images_per_request`, `text_tokens` (with the same `stdev`/`min`/`max` companions as the synthetic text mode), `output_tokens`, and `seed`. `prompt_tokens` is accepted as an alias for `text_tokens`.


"configuration options" instead of "knobs". I think you should also break this large list of options into a sub-list with a bit more context than just the name. I think I'd be tempted to take each of your "types" as a separate sub-section, to structure your new .md a bit more clearly. Something roughly like this:

# Synthetic visual data ... ## Key parameters GuideLLM currently supports synthetic visual datasets for both image and video [...] ### Synthetic image Use `--data "type=synthetic_image" with the following options: - width: control the width in pixels of the generated image - [...] ### Synthetic video [...]

zakariaelh and others added 8 commits May 15, 2026 10:27

tests: add WRITTEN BY AI marker per AGENTS.md

48928f8

guidellm's AGENTS.md requires every AI-written test function to carry `## WRITTEN BY AI ##` at the end of its docstring. Adds the marker to all 45 new tests in the multimodal suite. Assisted-by: Claude (Anthropic)

Fix pre-existing lint and type-check failures

b58544a

Add coordinate warp to synthetic gradient generator

3f6618e

dbutenhof requested changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: synthetic image and video data generation for VLM benchmarking#732

feat: synthetic image and video data generation for VLM benchmarking#732
zakariaelh wants to merge 8 commits into
vllm-project:mainfrom
zakariaelh:feat/synthetic-multimodal

zakariaelh commented May 15, 2026

Uh oh!

dbutenhof left a comment

Uh oh!

dbutenhof May 18, 2026

Uh oh!

dbutenhof May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Key parameters:

		- `--data "type=synthetic_image,..."`: Knobs include `width`, `height`, the `resolution=720p` / `aspect_ratio=16:9` sugar, `format` (`jpeg` or `png`), `jpeg_quality`, `content` (`gradient` default, `noise`, `solid`, `checkerboard`), `images_per_request`, `text_tokens` (with the same `stdev`/`min`/`max` companions as the synthetic text mode), `output_tokens`, and `seed`. `prompt_tokens` is accepted as an alias for `text_tokens`.

Conversation

zakariaelh commented May 15, 2026

Summary

Details

Levers exposed

Example invocations

Test Plan

Related Issues

Use of AI

Uh oh!

dbutenhof left a comment

Choose a reason for hiding this comment

Uh oh!

dbutenhof May 18, 2026

Choose a reason for hiding this comment

Uh oh!

dbutenhof May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants