VLM fine-tuning for Document IE

⭐ Fine-tuning VLMs on synthetic data for Visual Form Document Information Extraction. ⭐

This repository was created for a project with the goal to evaluate the performance gain of fine-tuning with mostly synthetic data. Now, the repo can be used in order to conduct your own fine-tuning experiments. Here is how:

1. ⚙️ Setup

The code uses unsloth and vllm, which can have contradicting dependencies. It is recommended to use conda and uv and install the exact packages like this: Use setup_symlinks.sh to create symlinks for the models/, outputs/, and .cache/ directories from a different directory.

# 1. Setup
conda env create -f conda_vlm.yml
uv pip install -r requirements.lock

# 2. (Optional) configure storage
bash setup_symlinks.sh /path/to/storage

# 3. Run full pipeline
python main.py \
    train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'

2. 🚀 Pipeline

The main entry point for the pipeline is main.py. It automatically runs the whole pipeline of training, inference, and evaluation. Start it with the obligatory argument train.model_for_training:

python main.py \
    train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'

The configuration is managed with hydra. In order to change settings for your run, overwrite them on the command line or in config. Each pipeline stage has its own config namespace (train, predict, evaluate). Overrides must be prefixed accordingly.

python main.py \
    +train.dataset=bigdataset \
    +predict.dataset=smalldataset

Each pipeline step can be run alone by calling the corresponding file in src/.

2.1 📦 Outputs

models/     # trained models
outputs/    # logs, predictions, and evaluation results
.cache/     # intermediate data for unsloth

2.2 🏋️ Train

python src/train.py \
    model_for_training='models/<model_dir>' \
    dataset=dataset_name

For training, select a model and a dataset. Take care to select a unsloth hyperparameter set that fits your experiment and model.

2.3 🔮 Predict

python src/predict.py \
    model_for_inference='models/<model_dir>' \
    dataset=dataset_name

For inference, simply select a local model or a Hugging Face model and a dataset. Take care to select a vllm config that fits the model. The optional argument stop=foo stops the dataset after foo samples.

2.4 🧹 Postprocess

python src/postprocess.py \
    predictions_to_postprocess='outputs/<model_dir>/predictions.json'

There is an option to postprocess the predictions. At the moment the module Postprocessor is empty. Implement a postprocessor to use it.

2.5 📊 Eval

python src/evaluate.py \
    predictions_to_evaluate='outputs/<model_dir>/predictions.json' \
    skip_chatgpt=False

The evaluation step calculates metrics for every type of field. For freetext fields a Likert score with ChatGPT as an LLM-as-a-judge is calculated. In order to work there needs to be a .chatgpt-key file in root. Use the option skip_chatgpt=True to skip the Likert score.

2.6 📈 Comparing different runs

The script compare_runs.py is not part of the pipeline. It creates plots for more than one model. Use with a list of the results_per_sample.csv file created during the evaluation:

python compare_runs.py \
    results_to_compare='[outputs/<run_dir>/results/results_per_sample.csv, outputs/<run_dir>/results/results_per_sample.csv]'

3. 🗂️ Dataset requirements

This project was originally built with a proprietary dataset. In order to run the whole pipeline, the dataset must confirm to certain criteria.

The pipeline expects a Hugging Face dataset with these columns:

Column	Type	Description
`image`	array	PIL image of the PDF page
`file`	string	the name of the file
`field_id`	string	a unique id for the current field
`groundtruth`	string	the value of the field
`field_type`	string	which type the field is

Additionally, the fields need to be of one of the three types: Freetext, Combfield, Checkbox. The types are evaluated separately during evaluation.

You can add a JSON file resources/name_mapping.json that maps the field_id value in the dataset to a better description of the field to improve the extraction capabilities of the model. To identify files as written with machine writing, add a pattern in config/file_names/base.yaml. This will be used during evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM fine-tuning for Document IE

1. ⚙️ Setup

2. 🚀 Pipeline

2.1 📦 Outputs

2.2 🏋️ Train

2.3 🔮 Predict

2.4 🧹 Postprocess

2.5 📊 Eval

2.6 📈 Comparing different runs

3. 🗂️ Dataset requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
resources		resources
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
compare_runs.py		compare_runs.py
conda_vlm.yml		conda_vlm.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock
setup_symlinks.sh		setup_symlinks.sh

Folders and files

Latest commit

History

Repository files navigation

VLM fine-tuning for Document IE

1. ⚙️ Setup

2. 🚀 Pipeline

2.1 📦 Outputs

2.2 🏋️ Train

2.3 🔮 Predict

2.4 🧹 Postprocess

2.5 📊 Eval

2.6 📈 Comparing different runs

3. 🗂️ Dataset requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages