Figure 1: Timeline of Multi-Image Generation Methods. The timeline presents the chronological development of methods organized by their respective release years.
This repository provides a comprehensive collection of resources related to multi-image generation, featuring:
- A curated list of methods organized by consistency dimensions
- Categorized datasets for multi-view, character, temporal, and semantic consistency research
- Benchmarks for evaluating multi-image generation quality across different consistency types
Designed to help researchers and practitioners explore, compare, and build state-of-the-art multi-image generation systems.
- What is Multi-Image Generation?
- Methods
- Datasets
- Benchmarks
- Applications
- Contributing
- License
- Citation
Multi-Image Generation refers to the task of generating multiple images with inherent correlations and consistency constraints. Unlike traditional single-image generation, multi-image generation requires maintaining coherence across multiple outputs along one or more dimensions, such as geometric structure, identity attributes, temporal continuity, or semantic relationships. This repository collects methods organized by consistency dimensions, reflecting the primary type of coherence each approach aims to achieve.
Figure 2: Example of multi-view consistency. SyncDreamer generates multi-view consistent images from a single input.
Figure 3: Example of character consistency using StoryMaker. The first three rows show a day in the life of an office worker, and the last two rows are based on Before Sunrise.
![]() Original Input |
β |
![]() Step forward |
β |
![]() Look up to sky |
β |
![]() Zoom out |
Figure 4: Example of temporal consistency. iMontage generates sequential image and maintains temporal consistency across generated transitions.
Figure 5: Example of semantic consistency. Wan-2.7-Image transforms a single reference image into nine cohesive comic panels.
Multi-View Consistency methods generate multiple images of the same 3D object or scene from different viewpoints while maintaining geometric coherence. This is inherently a multi-image task as it requires producing a set of views that correspond to the same underlying 3D structure, with cross-view constraints ensuring consistency across all generated perspectives.
Character consistency methods aim to generate images of one or more subjects while preserving their identity and key features, such as facial attributes or other key characteristics. This is inherently a multi-image problem requiring that the same subject remains recognizable across different scenes or contexts, and is widely studied in applications like storyboards and narratives.
Temporal consistency methods can be seen as multi-image generation tasks, as they involve producing a sequence of images or video frames over time. Each image or frame is generated conditioned on preceding ones, requiring smooth transitions and coherent motion, so that the temporal and physical dynamics of the sequence are preserved.
Semantic consistency is essential for multi-image generation. These methods ensure that multiple generated images maintain coherent layouts, logical semantic relationships, and overall scene structure. In tasks like controllable generation, iterative image editing, and multi-region editing, semantic consistency provides the structural constraints needed to prevent conflicting content across different outputs.
| ποΈ Dataset | π Samples | π Paper | ποΈ Venue | π Date |
|---|---|---|---|---|
| Griffin | ~30,000 frames, 270,000 images | Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | AAAI | 2025-03 |
| MVImgNet2.0 | 520k | MVImgNet2.0: A Larger-scale Dataset of Multi-view Images | SIGGRAPH Asia | 2024-12 |
| OpenMaterial | 1,001 | OpenMaterial: A Large-scale Dataset of Complex Materials for 3D Reconstruction | arXiv | 2024-06 |
| Objaverse-XL | 10M+ | Objaverse-XL: A Universe of 10M+ 3D Objects | arXiv | 2023-07 |
| Objaverse | 800k+ | Objaverse: A Universe of Annotated 3D Objects | CVPR | 2022-11 |
| CO3D | 19K | Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction | ICCV | 2021-09 |
| ποΈ Dataset | π Samples | π Paper | ποΈ Venue | π Date |
|---|---|---|---|---|
| GenVID | 80k | Artifact-Aware Evaluation for High-Quality Video Generation | arXiv | 2026-01 |
| SeqBench | 320 prompts, 2,560 videos | SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models | arXiv | 2025-10 |
| GeneVA | 5,452 prompts, 16,356 videos | GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts | arXiv | 2025-09 |
| BrokenVideos | 3,254 | BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos | arXiv | 2025-06 |
| FlintstonesSV | 20k | FlintstonesSV++ : Improving Story Narration using Visual Scene Graph | ECIR | 2025-04 |
| MovieBench | from 160 movies | MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation | CVPR | 2024-06 |
| PororoSV | 14k+ | Storygan: A sequential conditional gan for story visualization | arXiv | 2018-12 |
| ποΈ Dataset | π Samples | π Paper | ποΈ Venue | π Date |
|---|---|---|---|---|
| MICo-150K | 150k | MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition | CVPR | 2025-12 |
| Echo-4o | 180k | Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation | arXiv | 2025-08 |
| LAION-SG | 482k | LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations | arXiv | 2024-12 |
| SynCD | 90k | Generating Multi-Image Synthetic Data for Text-to-Image Customization | ICCV | 2025-02 |
| π·οΈ Name | π Paper | ποΈ Venue | π Date | π» Code |
|---|---|---|---|---|
| Charge | Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All | arXiv | 2025-12 | - |
| MVGBench | MVGBench: A Comprehensive Benchmark for Multi-view Generation Models | ICCV | 2025-07 | GitHub |
| Griffin | Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | AAAI | 2025-03 | GitHub |
| MEt3R | MEt3R: Measuring Multi-View Consistency in Generated Images | CVPR | 2025-01 | GitHub |
| Robust Multi-View Depth | A Benchmark and a Baseline for Robust Multi-view Depth Estimation | 3DV | 2022-09 | GitHub |
| π·οΈ Name | π Paper | ποΈ Venue | π Date | π» Code |
|---|---|---|---|---|
| SeqBench | SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models | arXiv | 2025-10 | GitHub |
| World Consistency Score | World Consistency Score: A Unified Metric for Video Generation Quality | ICCV | 2025-08 | GitHub |
| VBench++ | VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | TPAMI | 2024-11 | GitHub |
| TC-Bench | TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | arXiv | 2024-06 | GitHub |
| MovieBench | MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation | CVPR | 2024-06 | GitHub |
| VBench | VBench: Comprehensive Benchmark Suite for Video Generative Models | CVPR | 2023-11 | GitHub |
| π·οΈName | π Demo |
|---|---|
| World Labs | Demo |
| Skybox AI | Demo |
| Luma | Demo |
| Canva AI | Demo |
| Kling AI | Demo |
| Runway | Demo |
| Multiverse | Demo |
| Rodin | Demo |
Contributions are welcome! Please feel free to submit a Pull Request. When adding new papers, please follow these rules:
- Ensure the paper is relevant to multi-image generation.
- Insert the new entry in reverse chronological order (newest first).
- Add links to paper and code (if available)
This project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0).
If you find this repo is helpful for your research, please cite our paper:
π A Survey on Multi-Image Generation: Advances, Challenges, and Future Directions
@article{chen2026survey,
title={A Survey on Multi-Image Generation: Advances, Challenges, and Future Directions},
author={Chen, Qirui and Wang, Guo-Hua and Chen, Jinyuan and Chen, Qing-Guo and Zhang, Jun and Luo, Weihua},
year={2026}
}






