feat: synthetic image and video data generation for VLM benchmarking#732
feat: synthetic image and video data generation for VLM benchmarking#732zakariaelh wants to merge 8 commits into
Conversation
Pre-encoded data-URL output matching encode_image / encode_video shape. Per-row seeded gradient default with noise / solid / checkerboard opt-ins for images; gradient / noise for videos. Bit-exact mp4 encoding via imageio[ffmpeg] -fflags +bitexact so same seed produces byte-identical payloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SyntheticImageDatasetConfig and SyntheticVideoDatasetConfig live next to the existing text config. text_tokens is canonical; prompt_tokens is accepted as an alias. resolution / aspect_ratio sugar resolves to width/height. Each deserializer peeks at the input type and refuses to claim configs explicitly marked for another deserializer, so the registry dispatch is deterministic when distinctive fields overlap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…video Unit tests cover synthesize_image / synthesize_video helpers (decoded dims, byte counts, reproducibility, per-row uniqueness, 1000-row cache-bust check) and the deserializers (pull 10 rows from a --data string, type-mismatch refusal, prompt_tokens alias, images_per_request). Integration test spins up the in-tree mock server and runs 'guidellm benchmark run' end-to-end with both synthetic_image and synthetic_video --data strings, asserting return code 0 and a non-empty benchmark report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move synthetic multimodal generation out of Active Development for images and video. Audio remains WIP. Add two short --data examples (one image, one video) plus a parameter rundown for the new types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs caught by Section 4 of the evaluation plan against real vLLM: 1. SyntheticImageDataset and SyntheticVideoDataset features() omitted the image/video columns from the typed schema, so dataset.column_names returned only text columns. GenerativeColumnMapper reads column_names first and never sees `image`/`video`, so the request handler builds a text-only chat completion and the image is silently dropped. TTFT was identical across 480p/720p/1080p before the fix. 2. MediaEncoder still runs on synthetic rows. It called encode_image with the already-encoded canonical dict, which raised "Unsupported image type: <class 'dict'>" and dropped every row. Made encode_image and encode_video idempotent on the canonical dict shape so re-application is a no-op. After both fixes: resolution sweep TTFT 63.7 → 67.9 → 73.6ms (monotonic); frame sweep TTFT 94 → 211 → 376ms (monotonic, linear in frames); synth-vs-real fidelity 0.3% TTFT_p90 delta and 0.0% ITL_p50 delta. Co-authored-by: Claude
guidellm's AGENTS.md requires every AI-written test function to carry `## WRITTEN BY AI ##` at the end of its docstring. Adds the marker to all 45 new tests in the multimodal suite. Assisted-by: Claude (Anthropic)
dbutenhof
left a comment
There was a problem hiding this comment.
First pass -- a few documentation comments. I think this is packing too much into README.md, and should be broken out. There are also several places in the guide pages that mention synthetic (text) data that should probably be generalized.
| - `--data-samples`: Number of samples to use from the dataset - use `-1` (default) for all samples with dynamic generation, or specify a positive integer to limit sample count | ||
| - `--processor`: Tokenizer or processor name used for generating synthetic data - if not provided and required for the dataset, automatically loads from the model; accepts HuggingFace model IDs or local paths | ||
|
|
||
| ### Synthetic Multimodal Data |
There was a problem hiding this comment.
This section should be moved and restructured.
It's too big & specific for the README.md file.
There's a section in docs/guides/datasets.md on "Synthetic Data": this should be expanded and restructured to reference and frame the potential for synthetic image/video as well as text. But the details you have here should probably be in a new docs/guides/multimodal/synthetic_vision.md (since you cover both images and video but not really "multimodal" in that there's no audio).
Might not be a bad idea to tweak the "Multimodal Benchmarking" section in docs/guides/multimodal/index.md as well to mention synthesized visual datasets with a link to the new details.
|
|
||
| **Key parameters:** | ||
|
|
||
| - `--data "type=synthetic_image,..."`: Knobs include `width`, `height`, the `resolution=720p` / `aspect_ratio=16:9` sugar, `format` (`jpeg` or `png`), `jpeg_quality`, `content` (`gradient` default, `noise`, `solid`, `checkerboard`), `images_per_request`, `text_tokens` (with the same `stdev`/`min`/`max` companions as the synthetic text mode), `output_tokens`, and `seed`. `prompt_tokens` is accepted as an alias for `text_tokens`. |
There was a problem hiding this comment.
"configuration options" instead of "knobs". I think you should also break this large list of options into a sub-list with a bit more context than just the name. I think I'd be tempted to take each of your "types" as a separate sub-section, to structure your new .md a bit more clearly. Something roughly like this:
# Synthetic visual data
...
## Key parameters
GuideLLM currently supports synthetic visual datasets for both image and video [...]
### Synthetic image
Use `--data "type=synthetic_image" with the following options:
- width: control the width in pixels of the generated image
- [...]
### Synthetic video
[...]
Summary
Adds two new
--datatypes,synthetic_imageandsynthetic_video, that let users benchmark vLLM-served VLMs (Gemma 4, Qwen3-VL, InternVL3.5, etc.) without bringing their own image or video dataset. Composes with the existing synthetic-text knobs and produces TTFT/ITL within 0.3% of real media at matched input shape on Gemma 4.This closes the "Generation of synthetic multimodal datasets" item under Active Development in the README.
Details
SyntheticImageDatasetConfig+SyntheticImageDataset+SyntheticImageDatasetDeserializerregistered assynthetic_imageSyntheticVideoDatasetConfig+SyntheticVideoDataset+SyntheticVideoDatasetDeserializerregistered assynthetic_videosynthesize_image/synthesize_videohelpers inguidellm.extras.vision, sharing the canonical encoded-dict contract withencode_image/encode_videoencode_image/encode_videonow idempotent on the canonical dict (no-op if input already encoded)SeedSequence([seed, row_index])(cross-platform deterministic, byte-different per row to defeat the mm-processor cache)contentmodes:gradient(default),noise,solid,checkerboardimages_per_request > 1emitsimage_0,image_1, ... matching the existing column-mapper defaultspyproject.toml:imageio[ffmpeg]added to thevisionextra## WRITTEN BY AI ##markersLevers exposed
width,height(orresolution+aspect_ratio)frames,fps(video)formatjpeg/mp4jpeg_quality,video_bitratecontentgradientnoisefor worst-case wire sizetext_tokens(+ stdev/min/max)output_tokensimages_per_requestseedExample invocations
Test Plan
tox -e test-unit -- tests/unit/data/deserializers/test_synthetic_multimodal.pytox -e test-integration -- tests/integration/data/test_synthetic_multimodal_benchmark.pyguidellm benchmark runinvocation against the in-tree mock server, end-to-end through the data pipeline + chat-completions request handlerEnd-to-end validation against real vLLM serving
google/gemma-4-E4B-it:Full evaluation methodology and per-section results are in the linked status doc.
Related Issues
README.mdUse of AI
Code and tests were drafted by Claude under my direction, then validated against real Gemma 4 inference on vLLM. The validation caught two real bugs in the initial draft, both fixed in
4ffa586/ current1822225:features()in both deserializers declared text columns only, soGenerativeColumnMappernever sawimage/video(dataset.column_names was text-only) and the request handler silently built text-only chat completions. TTFT was flat across all resolutions before the fix.MediaEncoderstill ran on synthetic rows and calledencode_imagewith the already-encoded canonical dict, raisingUnsupported image type: <class 'dict'>and dropping every row. Fixed by makingencode_image/encode_videoidempotent on the canonical dict shape.I have reviewed every line of the diff and am the submitter of record.