Skip to content

feat: synthetic image and video data generation for VLM benchmarking#732

Open
zakariaelh wants to merge 8 commits into
vllm-project:mainfrom
zakariaelh:feat/synthetic-multimodal
Open

feat: synthetic image and video data generation for VLM benchmarking#732
zakariaelh wants to merge 8 commits into
vllm-project:mainfrom
zakariaelh:feat/synthetic-multimodal

Conversation

@zakariaelh
Copy link
Copy Markdown

Summary

Adds two new --data types, synthetic_image and synthetic_video, that let users benchmark vLLM-served VLMs (Gemma 4, Qwen3-VL, InternVL3.5, etc.) without bringing their own image or video dataset. Composes with the existing synthetic-text knobs and produces TTFT/ITL within 0.3% of real media at matched input shape on Gemma 4.

This closes the "Generation of synthetic multimodal datasets" item under Active Development in the README.

Details

  • SyntheticImageDatasetConfig + SyntheticImageDataset + SyntheticImageDatasetDeserializer registered as synthetic_image
  • SyntheticVideoDatasetConfig + SyntheticVideoDataset + SyntheticVideoDatasetDeserializer registered as synthetic_video
  • synthesize_image / synthesize_video helpers in guidellm.extras.vision, sharing the canonical encoded-dict contract with encode_image / encode_video
  • encode_image / encode_video now idempotent on the canonical dict (no-op if input already encoded)
  • Per-row seeded gradients via PCG64 + SeedSequence([seed, row_index]) (cross-platform deterministic, byte-different per row to defeat the mm-processor cache)
  • content modes: gradient (default), noise, solid, checkerboard
  • images_per_request > 1 emits image_0, image_1, ... matching the existing column-mapper defaults
  • pyproject.toml: imageio[ffmpeg] added to the vision extra
  • README usage examples
  • 45 unit + integration tests, all marked smoke/sanity/regression per AGENTS.md, all carrying ## WRITTEN BY AI ## markers

Levers exposed

Knob Default Purpose
width, height (or resolution + aspect_ratio) required Vision-tower FLOPs
frames, fps (video) required Linear vision cost on most VLMs
format jpeg / mp4 Decode cost + wire size
jpeg_quality, video_bitrate 85 / libx264 default Wire-size lever
content gradient Cache-bust default; opt-in noise for worst-case wire size
text_tokens (+ stdev/min/max) required Text-prefill cost (orthogonal to vision)
output_tokens required Decode cost
images_per_request 1 Multi-image-per-turn
seed 0 Reproducibility

Example invocations

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile constant --rate 2 --max-seconds 60 \
  --data "type=synthetic_image,resolution=720p,text_tokens=200,output_tokens=64"

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile constant --rate 2 --max-seconds 60 \
  --data "type=synthetic_video,width=854,height=480,frames=6,fps=3,text_tokens=12,output_tokens=10"

guidellm benchmark run --target http://localhost:8000 --model google/gemma-4-E4B-it \
  --profile sweep --max-seconds 60 \
  --data "type=synthetic_image,width=1024,height=1024,format=png,content=noise,images_per_request=2,text_tokens=128,output_tokens=32,seed=17"

Test Plan

  • tox -e test-unit -- tests/unit/data/deserializers/test_synthetic_multimodal.py
    • 43 unit tests covering decoded dimensions, byte counts, content modes, byte-uniqueness across 1000 gradient rows, reproducibility under matched seed, error handling on unsupported formats / content, deserializer dispatch, JSON config, multi-image emission
  • tox -e test-integration -- tests/integration/data/test_synthetic_multimodal_benchmark.py
    • 2 integration tests that drive a real guidellm benchmark run invocation against the in-tree mock server, end-to-end through the data pipeline + chat-completions request handler

End-to-end validation against real vLLM serving google/gemma-4-E4B-it:

Check Result
Real-vLLM smoke (image + video, rate=2, 30s) Zero errors
Resolution sweep TTFT_p50 (480p / 720p / 1080p) 63.7 / 67.9 / 73.6 ms — monotonic
Frame sweep TTFT_p50 (2 / 6 / 12 frames @480p) 94.3 / 210.7 / 376.1 ms — monotonic, vision tokens scale linearly (~75/frame)
Synthetic vs real fidelity at matched shape (854×480, 6f@3fps, 100s @ rate=2) TTFT_p90 delta 0.3% · ITL_p50 delta 0.0%
Reproducibility (same seed, two runs) Byte-identical sha256 per row

Full evaluation methodology and per-section results are in the linked status doc.

Related Issues

  • Resolves the "Generation of synthetic multimodal datasets" item listed under Active Development in README.md

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes code generated or substantially modified by an AI agent
  • Includes tests generated or substantially modified by an AI agent

Code and tests were drafted by Claude under my direction, then validated against real Gemma 4 inference on vLLM. The validation caught two real bugs in the initial draft, both fixed in 4ffa586 / current 1822225:

  1. features() in both deserializers declared text columns only, so GenerativeColumnMapper never saw image / video (dataset.column_names was text-only) and the request handler silently built text-only chat completions. TTFT was flat across all resolutions before the fix.
  2. MediaEncoder still ran on synthetic rows and called encode_image with the already-encoded canonical dict, raising Unsupported image type: <class 'dict'> and dropping every row. Fixed by making encode_image / encode_video idempotent on the canonical dict shape.

I have reviewed every line of the diff and am the submitter of record.

zakariaelh and others added 8 commits May 15, 2026 10:27
Pre-encoded data-URL output matching encode_image / encode_video shape.
Per-row seeded gradient default with noise / solid / checkerboard opt-ins
for images; gradient / noise for videos. Bit-exact mp4 encoding via
imageio[ffmpeg] -fflags +bitexact so same seed produces byte-identical
payloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SyntheticImageDatasetConfig and SyntheticVideoDatasetConfig live next to
the existing text config. text_tokens is canonical; prompt_tokens is
accepted as an alias. resolution / aspect_ratio sugar resolves to
width/height. Each deserializer peeks at the input type and refuses to
claim configs explicitly marked for another deserializer, so the registry
dispatch is deterministic when distinctive fields overlap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…video

Unit tests cover synthesize_image / synthesize_video helpers (decoded
dims, byte counts, reproducibility, per-row uniqueness, 1000-row
cache-bust check) and the deserializers (pull 10 rows from a --data
string, type-mismatch refusal, prompt_tokens alias, images_per_request).

Integration test spins up the in-tree mock server and runs
'guidellm benchmark run' end-to-end with both synthetic_image and
synthetic_video --data strings, asserting return code 0 and a
non-empty benchmark report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move synthetic multimodal generation out of Active Development for
images and video. Audio remains WIP. Add two short --data examples
(one image, one video) plus a parameter rundown for the new types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs caught by Section 4 of the evaluation plan against real vLLM:

1. SyntheticImageDataset and SyntheticVideoDataset features() omitted the
   image/video columns from the typed schema, so dataset.column_names
   returned only text columns. GenerativeColumnMapper reads column_names
   first and never sees `image`/`video`, so the request handler builds a
   text-only chat completion and the image is silently dropped. TTFT was
   identical across 480p/720p/1080p before the fix.

2. MediaEncoder still runs on synthetic rows. It called encode_image with
   the already-encoded canonical dict, which raised "Unsupported image
   type: <class 'dict'>" and dropped every row. Made encode_image and
   encode_video idempotent on the canonical dict shape so re-application
   is a no-op.

After both fixes: resolution sweep TTFT 63.7 → 67.9 → 73.6ms (monotonic);
frame sweep TTFT 94 → 211 → 376ms (monotonic, linear in frames);
synth-vs-real fidelity 0.3% TTFT_p90 delta and 0.0% ITL_p50 delta.

Co-authored-by: Claude
guidellm's AGENTS.md requires every AI-written test function to carry
`## WRITTEN BY AI ##` at the end of its docstring. Adds the marker to
all 45 new tests in the multimodal suite.

Assisted-by: Claude (Anthropic)
Copy link
Copy Markdown
Collaborator

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass -- a few documentation comments. I think this is packing too much into README.md, and should be broken out. There are also several places in the guide pages that mention synthetic (text) data that should probably be generalized.

Comment thread README.md
- `--data-samples`: Number of samples to use from the dataset - use `-1` (default) for all samples with dynamic generation, or specify a positive integer to limit sample count
- `--processor`: Tokenizer or processor name used for generating synthetic data - if not provided and required for the dataset, automatically loads from the model; accepts HuggingFace model IDs or local paths

### Synthetic Multimodal Data
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section should be moved and restructured.

It's too big & specific for the README.md file.

There's a section in docs/guides/datasets.md on "Synthetic Data": this should be expanded and restructured to reference and frame the potential for synthetic image/video as well as text. But the details you have here should probably be in a new docs/guides/multimodal/synthetic_vision.md (since you cover both images and video but not really "multimodal" in that there's no audio).

Might not be a bad idea to tweak the "Multimodal Benchmarking" section in docs/guides/multimodal/index.md as well to mention synthesized visual datasets with a link to the new details.

Comment thread README.md

**Key parameters:**

- `--data "type=synthetic_image,..."`: Knobs include `width`, `height`, the `resolution=720p` / `aspect_ratio=16:9` sugar, `format` (`jpeg` or `png`), `jpeg_quality`, `content` (`gradient` default, `noise`, `solid`, `checkerboard`), `images_per_request`, `text_tokens` (with the same `stdev`/`min`/`max` companions as the synthetic text mode), `output_tokens`, and `seed`. `prompt_tokens` is accepted as an alias for `text_tokens`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"configuration options" instead of "knobs". I think you should also break this large list of options into a sub-list with a bit more context than just the name. I think I'd be tempted to take each of your "types" as a separate sub-section, to structure your new .md a bit more clearly. Something roughly like this:

# Synthetic visual data
 ...
## Key parameters
GuideLLM currently supports synthetic visual datasets for both image and video [...]
### Synthetic image
Use `--data "type=synthetic_image" with the following options:
- width: control the width in pixels of the generated image
- [...]
### Synthetic video
[...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants