⭐ Fine-tuning VLMs on synthetic data for Visual Form Document Information Extraction. ⭐
This repository was created for a project with the goal to evaluate the performance gain of fine-tuning with mostly synthetic data. Now, the repo can be used in order to conduct your own fine-tuning experiments. Here is how:
The code uses unsloth and vllm, which can have contradicting dependencies. It is recommended to use conda and uv and install the exact packages like this: Use setup_symlinks.sh to create symlinks for the models/, outputs/, and .cache/ directories from a different directory.
# 1. Setup
conda env create -f conda_vlm.yml
uv pip install -r requirements.lock
# 2. (Optional) configure storage
bash setup_symlinks.sh /path/to/storage
# 3. Run full pipeline
python main.py \
train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'The main entry point for the pipeline is main.py. It automatically runs the whole pipeline of training, inference, and evaluation. Start it with the obligatory argument train.model_for_training:
python main.py \
train.model_for_training='unsloth/Qwen2.5-VL-7B-Instruct-bnb-4bit'The configuration is managed with hydra. In order to change settings for your run, overwrite them on the command line or in config. Each pipeline stage has its own config namespace (train, predict, evaluate). Overrides must be prefixed accordingly.
python main.py \
+train.dataset=bigdataset \
+predict.dataset=smalldatasetEach pipeline step can be run alone by calling the corresponding file in src/.
models/ # trained models
outputs/ # logs, predictions, and evaluation results
.cache/ # intermediate data for unslothpython src/train.py \
model_for_training='models/<model_dir>' \
dataset=dataset_nameFor training, select a model and a dataset. Take care to select a unsloth hyperparameter set that fits your experiment and model.
python src/predict.py \
model_for_inference='models/<model_dir>' \
dataset=dataset_nameFor inference, simply select a local model or a Hugging Face model and a dataset. Take care to select a vllm config that fits the model.
The optional argument stop=foo stops the dataset after foo samples.
python src/postprocess.py \
predictions_to_postprocess='outputs/<model_dir>/predictions.json'There is an option to postprocess the predictions. At the moment the module Postprocessor is empty. Implement a postprocessor to use it.
python src/evaluate.py \
predictions_to_evaluate='outputs/<model_dir>/predictions.json' \
skip_chatgpt=FalseThe evaluation step calculates metrics for every type of field. For freetext fields a Likert score with ChatGPT as an LLM-as-a-judge is calculated. In order to work there needs to be a .chatgpt-key file in root.
Use the option skip_chatgpt=True to skip the Likert score.
The script compare_runs.py is not part of the pipeline.
It creates plots for more than one model. Use with a list of the results_per_sample.csv file created during the evaluation:
python compare_runs.py \
results_to_compare='[outputs/<run_dir>/results/results_per_sample.csv, outputs/<run_dir>/results/results_per_sample.csv]'This project was originally built with a proprietary dataset. In order to run the whole pipeline, the dataset must confirm to certain criteria.
The pipeline expects a Hugging Face dataset with these columns:
| Column | Type | Description |
|---|---|---|
image |
array | PIL image of the PDF page |
file |
string | the name of the file |
field_id |
string | a unique id for the current field |
groundtruth |
string | the value of the field |
field_type |
string | which type the field is |
Additionally, the fields need to be of one of the three types: Freetext, Combfield, Checkbox. The types are evaluated separately during evaluation.
You can add a JSON file resources/name_mapping.json that maps the field_id value in the dataset to a better description of the field to improve the extraction capabilities of the model.
To identify files as written with machine writing, add a pattern in config/file_names/base.yaml. This will be used during evaluation.