This script performs Document Question Answering (DocVQA) locally using the naver-clova-ix/donut-base-finetuned-docvqa model via the Hugging Face transformers library.
You provide an image of a document (e.g., receipt, invoice, form) and a question about its content. The Donut model analyzes the image, reading text and understanding layout without requiring a separate OCR step, to find the answer within the document image.
It includes flexibility for the image input:
- It prioritizes using a local document image file path specified within the script.
- If the specified file isn't found, it downloads a sample image (a simple receipt) for demonstration.
- Performs Document Question Answering locally.
- Uses the
naver-clova-ix/donut-base-finetuned-docvqamodel (Donut architecture, OCR-free). - Answers natural language questions based on document image content and layout.
- Handles user-specified local document image files with a fallback to a sample image.
- Leverages the Hugging Face
transformerslibrary. - Optionally utilizes GPU for faster processing.
- DocVQA Model:
naver-clova-ix/donut-base-finetuned-docvqa
Before running the script, ensure you have the following installed:
- Python: Python 3.8 or later recommended.
- System Dependencies: None specific beyond standard build tools.
- Python Libraries: Install using pip in a virtual environment. Donut requires
sentencepiece. Standard vision libraries are also recommended.pip install transformers torch Pillow torchvision timm requests sentencepiece
transformers: The core Hugging Face library.torch: The deep learning framework backend (PyTorch).Pillow: For loading and handling images.torchvision,timm: Often required/beneficial for underlying vision components.requests: Used to download the sample document image if needed.sentencepiece: Required by the Donut model's tokenizer.
- Clone or Download: Get the
run_docvqa.pyscript onto your local machine. - Create Virtual Environment (Recommended):
(Use
python3 -m venv .venv source .venv/bin/activate.\.venv\Scripts\activateon Windows) - Install Python Libraries: Run the pip command from the Prerequisites section within your activated virtual environment.
-
Configure Inputs (Document Image & Question):
- Open the
run_docvqa.pyscript in a text editor. - Document Image:
- Locate the line:
user_doc_image_path = "my_document.png" - Option A (Recommended): Change the path to the exact path of the document image file (e.g., receipt.jpg, form.png) you want to query.
- Option B: Place your document image file in the same directory as the script and name it
my_document.png(or.jpgetc., and update the variable). - Fallback: If no file is found at
user_doc_image_path, the script downloads and uses the sample receipt image (docvqa_sample_receipt.jpg).
- Locate the line:
- Question:
- Locate the line:
question = "What is the total amount?" - Change the question text to be relevant to the content of the document image you are providing (or the sample receipt).
- Locate the line:
- Open the
-
Run the Script:
- Open your terminal or command prompt.
- Make sure your virtual environment is activated.
- Navigate to the directory containing the script.
- Execute the script using Python:
python run_docvqa.py
The script will print status messages, the image source used, and the question asked. The final output will be the model's answer extracted from the document image: