Skip to content

Commit 2569f6f

Browse files
chore: sync training sample inputs
1 parent 4c73fda commit 2569f6f

32,268 files changed

Lines changed: 176 additions & 812733 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,6 @@ tesseract-artifacts/
2828

2929
# Partial download artifacts from receipt dataset acquisition
3030
models/data/downloaded_*/**/archives/*.part-*
31+
32+
# GitHub cannot store this archive because the repo/account LFS object cap is 2 GiB.
33+
models/.data/eng/images/huggingface.co/datasets/mahmoud2019/ReceiptQA/data.zip
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
license: mit
3+
task_categories:
4+
- question-answering
5+
language:
6+
- en
7+
tags:
8+
- finance
9+
size_categories:
10+
- 100K<n<1M
11+
---
12+
# ReceiptQA: A Comprehensive Dataset for Receipt Understanding and Question Answering
13+
14+
ReceiptQA is a large-scale dataset specifically designed to support and advance research in receipt understanding through question-answering (QA) tasks. This dataset offers a wide range of questions derived from real-world receipt images, addressing diverse challenges such as text extraction, layout understanding, and numerical reasoning. ReceiptQA provides a benchmark for evaluating and improving models for receipt-based QA tasks.
15+
16+
17+
18+
## Dataset Overview
19+
ReceiptQA consists of 3,500 receipt images paired with 171,000 question-answer pairs, constructed using two complementary approaches:
20+
21+
1. **LLM-Generated Subset:** 70,000 QA pairs generated by GPT-4o, validated by human annotators to ensure accuracy and relevance.
22+
2. **Human-Created Subset:** 101,000 QA pairs crafted manually, including both answerable and unanswerable questions for diverse evaluation.
23+
24+
### Key Features:
25+
- Covers five domains: Retail, Food Services, Supermarkets, Fashion, and Medical.
26+
- Includes both straightforward and complex questions.
27+
- Offers a comprehensive benchmark for receipt-specific QA tasks.
28+
29+
### Dataset Statistics
30+
| Domain | Receipts | Human QA Pairs | LLM QA Pairs |
31+
|-----------------|----------|----------------|--------------|
32+
| Retail | 800 | 23,200 | 16,000 |
33+
| Food Services | 700 | 20,300 | 14,000 |
34+
| Supermarkets | 700 | 20,300 | 14,000 |
35+
| Fashion | 650 | 18,850 | 13,000 |
36+
| coffe shop | 650 | 18,850 | 13,000 |
37+
| **Total** | **3,500**| **101,935** | **70,000** |
38+
39+
### Example of Data
40+
41+
Here is a sample of the data structure used in the ReceiptQA dataset:
42+
43+
```json
44+
{
45+
"question": "What is the total amount for this receipt?",
46+
"answer": "559.99 L.E"
47+
},
48+
{
49+
"question": "What is the name of item 1?",
50+
"answer": "Pullover PU-SOK1175"
51+
},
52+
{
53+
"question": "What is the transaction number?",
54+
"answer": "29786"
55+
},
56+
{
57+
"question": "How many items were purchased?",
58+
"answer": "2"
59+
}
60+
```
61+
## Requirements
62+
```bash
63+
# Install required libraries for inference
64+
pip install torch==1.10.0
65+
pip install transformers==4.5.0
66+
pip install datasets==2.3.0
67+
pip install Pillow
68+
```
69+
70+
71+
72+
## Download Links
73+
74+
### Full Dataset
75+
- **Train Set :** [Images](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/train_images.zip?download=true) | [Labels](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/train_label.zip?download=true)
76+
- **Validation Set :** [Images](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/validation_images.zip?download=true) | [Labels](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/validation_label.zip?download=true)
77+
- **Test Set :** [Images](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/test_images.zip?download=true) | [Labels](https://huggingface.co/datasets/mahmoud2019/ReceiptQA/resolve/main/test_label.zip?download=true)
78+
79+
80+
## Using ReceiptQA
81+
To use ReceiptQA for training or evaluation, follow these steps:
82+
83+
### Step 1: Clone the Repository
84+
```bash
85+
git clone https://github.com/your-repo/ReceiptQA](https://github.com/MahmoudElsayedMahmoud/ReceiptQA-A-Comprehensive-Dataset-for-Receipt-Understanding-and-Question-Answering
86+
cd ReceiptQA
87+
```
88+
89+
### Step 2: Download the Dataset
90+
Download the dataset using the links provided above and place it in the `data/` directory.
91+
92+
93+
## Evaluation Metrics
94+
ReceiptQA provides the following metrics for evaluating QA models:
95+
- **Exact Match (EM):** Measures if the predicted answer exactly matches the ground truth.
96+
- **F1 Score:** Evaluates the overlap between the predicted and ground truth answers.
97+
- **Precision:** Measures the accuracy of the predictions.
98+
- **Recall:** Measures the ability to retrieve relevant answers.
99+
- **Answer Containment:** Checks if the ground truth answer is included in the predicted response.
100+
101+
## Models Benchmarked
102+
ReceiptQA has been used to evaluate state-of-the-art models, including:
103+
- **GPT-4**
104+
- **Llama3.2 (11B)**
105+
- **Gemni 2.0**
106+
- **Phi 3.5 Vision**
107+
- **InternVL2 (4B/8B)**
108+
- **LLaVA 7B**
109+
110+
111+
112+
## Citation
113+
If you use ReceiptQA in your research, please cite our paper:
114+
```
115+
Will be publish soon !!
116+
```
117+
118+
119+
120+
## Contact
121+
For questions or feedback, please contact:
122+
- Mahmoud Abdalla: [mahmoudelsayed@chungbuk.ac.kr](mailto:mahmoudelsayed@chungbuk.ac.kr)
123+
- GitHub Issues: [Submit an issue](https://github.com/your-repo/ReceiptQA/issues)
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: MIT License
3+
spdx-id: MIT
4+
featured: true
5+
hidden: false
6+
7+
description: A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
8+
9+
how: Create a text file (typically named LICENSE or LICENSE.txt) in the root of your source code and copy the text of the license into the file. Replace [year] with the current year and [fullname] with the name (or names) of the copyright holders.
10+
11+
using:
12+
Babel: https://github.com/babel/babel/blob/master/LICENSE
13+
.NET: https://github.com/dotnet/runtime/blob/main/LICENSE.TXT
14+
Rails: https://github.com/rails/rails/blob/master/MIT-LICENSE
15+
16+
permissions:
17+
- commercial-use
18+
- modifications
19+
- distribution
20+
- private-use
21+
22+
conditions:
23+
- include-copyright
24+
25+
limitations:
26+
- liability
27+
- warranty
28+
29+
---
30+
MIT License
31+
32+
Copyright (c) [year] [fullname]
33+
34+
Permission is hereby granted, free of charge, to any person obtaining a copy
35+
of this software and associated documentation files (the "Software"), to deal
36+
in the Software without restriction, including without limitation the rights
37+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
38+
copies of the Software, and to permit persons to whom the Software is
39+
furnished to do so, subject to the following conditions:
40+
41+
The above copyright notice and this permission notice shall be included in all
42+
copies or substantial portions of the Software.
43+
44+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
45+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
46+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
47+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
48+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
49+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
50+
SOFTWARE.

models/training_samples/inputs/ara/mahmoud2019-receiptqa/images/test-images-zip/test_images/005a1eab-28e9-4094-9cd4-ca8fe4effe32.txt

Lines changed: 0 additions & 72 deletions
This file was deleted.

models/training_samples/inputs/ara/mahmoud2019-receiptqa/images/test-images-zip/test_images/0136afc5-d48d-4f76-9ede-c04e41f9f9e3.txt

Lines changed: 0 additions & 25 deletions
This file was deleted.

models/training_samples/inputs/ara/mahmoud2019-receiptqa/images/test-images-zip/test_images/01ac7d36-20c7-488f-a733-4f44ba860e30.txt

Lines changed: 0 additions & 59 deletions
This file was deleted.

models/training_samples/inputs/ara/mahmoud2019-receiptqa/images/test-images-zip/test_images/01c7dfaf-b2cd-4633-a434-2538fc31a3ca.txt

Lines changed: 0 additions & 30 deletions
This file was deleted.

0 commit comments

Comments
 (0)