Skip to content

NiceRingNode/Awesome-Generative-Models-for-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities

SCUT DLVC Lab arXiv HuggingFace Paper Leaderboard License

Stars Forks

OCRGenBench is the most comprehensive benchmark to date for evaluating the OCR generative capabilities of generative models. It pioneers the unification of text-to-image (T2I) generation, text editing, and OCR-related image-to-image (I2I) translation to comprehensively reflect a model's visual text synthesis abilities β€” referred to as OCR generative capabilities. The benchmark covers 5 common text categories and 33 OCR generative tasks, comprising 1,060 challenging, human-annotated samples with dense text, varied layouts, multiple aspect ratios, and bilingual content. We additionally introduce a unified evaluation metric, OCRGenScore, which assesses text accuracy, instruction following, visual quality, and structural consistency in visual text synthesis.

This repository was formerly known as Awesome Generative Models for OCR, which included only an empirical evaluation of image generation capabilities across 7 models. We have since expanded it into a full benchmark and evaluation framework, enabling reproducible evaluation on an unlimited number of models.

πŸ“ƒ News

πŸ“Œ Pinned

  • πŸš€ [April 2026] Dataset and evaluation scripts are published.

  • πŸš€ [March 2026] Our online leaderboard is now live: OCRGenBench Leaderboard.

  • πŸŽ‰ [March 2026] We propose OCRGenBench, a comprehensive benchmark for evaluating multi-dimensional OCR generative capabilities (visual text synthesis capabilities).

  • πŸŽ‰ [August 2025] Evaluation of Qwen-Image has been added; results are updated in the paper.

  • πŸŽ‰ [July 2025] Our paper is available on arXiv! Citations and stars are welcome if you find our work useful. 😊

  • πŸ”₯ [June 2025] Expanded evaluation now includes a diverse set of closed-source and open-source models!

  • πŸ“’ [March 2025] Initial evaluation of GPT-4o's image generation capabilities is now available!

🍺 Environment

git clone https://github.com/NiceRingNode/Awesome-Generative-Models-for-OCR.git
cd Awesome-Generative-Models-for-OCR
conda create -n ocrgen python=3.11.8
conda activate ocrgen
pip install -r requirements.txt

πŸš€ Run Evaluation

The evaluation pipeline consists of two stages: generating result images and computing metrics.

βš’οΈ Data Preparation

Download the OCRGenBench benchmark from HuggingFace.

Generate the data specification file data/test_cases.json using:

python process.py

The resulting data/test_cases.json follows this format:

[
    {
        "id": 0,
        "field": "slide",
        "task": "T2I",
        "prompt": "...",
        "input_image_path_1": null,
        "input_image_path_2": null,
        "task_type": "generation",
        "output_path": "./output/holder/slide/T2I/1.png"
    },
    ...
]

πŸ–ΌοΈ Generating Result Images

Generating Images from Open-Source Models

  1. Download pretrained weights from HuggingFace using download.py. Model weights will be saved under models/.
python download.py --repo huggingface/model_name
  1. Create a directory for your model and define a dedicated __init__.py. In this file, implement a T2I generation function xxx_generate_func and an image editing function xxx_edit_func. The following example uses Longcat-Image:
mkdir longcat
cd longcat
touch __init__.py
  1. In __init__.py, implement the two interface functions:
def longcat_generate_func(input_prompt=None):
    ...

def longcat_edit_func(input_prompt=None, input_image_path1=None, input_image_path2=None):
    ...
  1. Register the functions in inference.py:
generate_func, edit_func = None, None

def get_model_functions(model_name):
    global generate_func, edit_func
    if model_name == 'longcat':
        from longcat import longcat_generate_func as generate_func, longcat_edit_func as edit_func
    elif model_name == 'your_model_name':
        ...
  1. Run inference:
CUDA_VISIBLE_DEVICES=0 python inference.py --model longcat

Multiple GPUs can be used for inference if needed.

Generating Images from Closed-Source Models

We currently support Nano Banana Pro, GPT-Image-1.5, and Seedream-4.5. To evaluate these models or their variants (e.g., Nano Banana 2, GPT-Image-1, Seedream-5.0), set the corresponding API credentials:

export OPENAI_BASE_URL='https://xxx'
export OPENAI_API_KEY='sk-xxx'

or

export GEMINI_BASE_URL='https://xxx'
export GEMINI_API_KEY='sk-xxx'

Then run:

python inference_api.py --model gemini-3-pro-image-preview

Other closed-source models can be evaluated by following the same procedure and modifying inference_api.py accordingly.

πŸ“Š Computing Metrics

Export your OpenAI credentials for VIEScore, an LLM-as-Judge metric backed by GPT-5:

export OPENAI_BASE_URL='https://xxx'
export OPENAI_API_KEY='sk-xxx'

Compute all metrics with:

python eval.py --model model_name  # e.g., gemini-3-pro-image-preview

[Todo] eval.py will soon support exporting .xlsx files for leaderboard submission.

🌏 Data Categorization

OCRGenBench encompasses five major text scenarios and 33 OCR generative tasks, covering T2I generation, text editing, and OCR I2I translation. The full categorization is illustrated below:

Mindmap

🍨 Data Distribution

OCRGenBench comprises 1,060 high-quality, manually annotated samples. Their distribution is shown below:

Distribution

πŸŽ“ Leaderboard

Performance by task (main leaderboard)

Leaderboard Task 1

Leaderboard Task 2


View the full interactive leaderboard: OCRGenBench Leaderboard

πŸ“‹ Citation

If you find our work helpful, please consider citing our paper:

@article{zhang2025ocrgenbench,
  title={{OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities}},
  author={Zhang, Peirong and Xu, Haowei and Zhang, Jiaxin and Zheng, Xuhan and Xu, Guitao and Zhang, Yuyi and Liu, Junle and Yang, Zhenhua and Zhou, Wei and Jin, Lianwen},
  journal={arXiv preprint arXiv:2507.15085},
  year={2025}
}

πŸ“§ Contact

For questions or collaborations, please reach out to: eeprzhang@mail.scut.edu.cn

🌊 Acknowledgement

We gratefully acknowledge the following open-source projects used for metric computation: VIEScore and DocAligner-Distortion.

Copyright 2025–2026, Deep Learning and Vision Computing (DLVC) Lab, South China University of Technology.

⭐ Star History

Star History

About

[arXiv 25] OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors