Skip to content

Commit c793697

Browse files
committed
The Parxyval quest begins!
0 parents  commit c793697

28 files changed

Lines changed: 5405 additions & 0 deletions

.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
2+
## Configure Parxy drivers with environment variables
3+
4+
PARXY_LLAMAPARSE_API_KEY=
5+
6+
PARXY_LLMWHISPERER_API_KEY=

.github/CONTRIBUTING.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Contributing
2+
3+
Contributions are **welcome** and will be fully **credited**.
4+
5+
Please read and understand the contribution guide before creating an issue or pull request.
6+
7+
## Etiquette
8+
9+
This project is open source, and as such, the maintainers give their free time to build and maintain the source code held within. They make the code freely available in the hope that it will be of use to other developers. It would be extremely unfair for them to suffer abuse or anger for their hard work.
10+
11+
Please be considerate towards maintainers when raising issues or presenting pull requests. Let's show the
12+
world that developers are civilized and selfless people.
13+
14+
It's the duty of the maintainer to ensure that all submissions to the project are of sufficient
15+
quality to benefit the project. Many developers have different skillsets, strengths, and weaknesses. Respect the maintainer's decision, and do not be upset or abusive if your submission is not used.
16+
17+
## Viability
18+
19+
When requesting or submitting new features, first consider whether it might be useful to others. Open
20+
source projects are used by many developers, who may have entirely different needs to your own. Think about
21+
whether or not your feature is likely to be used by other users of the project.
22+
23+
## Procedure
24+
25+
> [!NOTE]
26+
> Issue tracking is not currently enabled for this repository. We are organising it.
27+
28+
Before filing an issue:
29+
30+
- Attempt to replicate the problem, to ensure that it wasn't a coincidental incident.
31+
- Check to make sure your feature suggestion isn't already present within the project.
32+
- Check the pull requests tab to ensure that the bug doesn't have a fix in progress.
33+
- Check the pull requests tab to ensure that the feature isn't already in progress.
34+
35+
Before submitting a pull request:
36+
37+
- Check the codebase to ensure that your feature doesn't already exist.
38+
- Check the pull requests to ensure that another person hasn't already submitted the feature or fix.
39+
40+
## Requirements
41+
42+
If the project maintainer has any additional requirements, you will find them listed here.
43+
44+
- **Add tests!** - Your patch won't be accepted if it doesn't have tests.
45+
46+
- **Document any change in behaviour** - Make sure the `README.md` and any other relevant documentation are kept up-to-date.
47+
48+
- **Consider our release cycle** - We try to follow [SemVer v2.0.0](https://semver.org/). Randomly breaking public APIs is not an option.
49+
50+
- **One pull request per feature** - If you want to do more than one thing, send multiple pull requests.
51+
52+
- **Send coherent history** - Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please [squash them](https://www.git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Changing-Multiple-Commit-Messages) before submitting.
53+
54+
**Happy coding**!

.github/SECURITY.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Security Policy
2+
3+
If you discover any security related issues, please email security@oneofftech.xyz instead of using the discussions or the issue tracker.

.gitignore

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Python-generated files
2+
__pycache__/
3+
*.py[oc]
4+
build/
5+
dist/
6+
wheels/
7+
*.egg-info
8+
9+
# Virtual environments
10+
.venv
11+
12+
13+
.env
14+
15+
data/
16+

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.12

LICENCE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
MIT License
2+
3+
Copyright (c) 2025-present OneOff-Tech (UG)
4+
Copyright (c) 2025-present Alessio Vertemati
5+
6+
Permission is hereby granted, free of charge, to any person obtaining a copy
7+
of this software and associated documentation files (the "Software"), to deal
8+
in the Software without restriction, including without limitation the rights
9+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
copies of the Software, and to permit persons to whom the Software is
11+
furnished to do so, subject to the following conditions:
12+
13+
The above copyright notice and this permission notice shall be included in all
14+
copies or substantial portions of the Software.
15+
16+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
SOFTWARE.

README.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
![pypi](https://img.shields.io/pypi/v/parxyval.svg)
2+
![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json) [![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
3+
4+
# Parxyval
5+
6+
7+
Parxyval – The Developer's Knight of the Parsing Table.
8+
9+
An evaluation framework for document parsing, inspired by the quest for the Holy Grail. 🏰⚔️
10+
11+
In a world of imperfect parsers, Parxyval helps you measure, compare, and discover which tool truly preserves the meaning of your documents. Benchmark precision, recall, structure, and reliability across multiple parsing services — and let your pipeline find its Grail.
12+
13+
14+
**Requirements**
15+
16+
- Python 3.12 or above.
17+
- A Hugging Face account for downloading datasets that requires a login
18+
19+
20+
**Next steps**
21+
22+
- [Getting started](#getting-started)
23+
- [Available commands](#command-reference)
24+
- [Benchmarks](#benchmarks)
25+
- [Supported evaluations](#evaluations)
26+
27+
28+
29+
## Getting Started
30+
31+
Parxyval is an unstructured document processing evaluation framework offering a CLI interface to benchmark PDF parsing solutions. Follow these steps to get started:
32+
33+
Parxyval is available on Pypi. You can try it using `uv`.
34+
35+
```bash
36+
uvx parxyval --help
37+
```
38+
39+
40+
41+
1. **Download the Dataset**
42+
```bash
43+
# Download sample documents from DocLayNet dataset
44+
parxyval download --limit 100 --include-pdf
45+
```
46+
47+
The ground truth is stored in `./data/doclaynet/json` while pdf files are stored in `./data/doclaynet/pdf`
48+
49+
50+
2. **Parse Documents**
51+
```bash
52+
# Parse PDFs using your chosen driver (default: pymupdf)
53+
parxyval parse --driver pymupdf
54+
55+
# you can personalize input and output locations
56+
# --input data/doclaynet/pdf --output data/doclaynet/processed
57+
```
58+
59+
Parxyval supports all drivers available in [Parxy](https://github.com/OneOffTech/parxy).
60+
61+
Pdf files are read from `./data/doclaynet/pdf` and the parser outputs is written in `./data/doclaynet/processed/{driver}`, e.g. `./data/doclaynet/processed/pymupdf`
62+
63+
3. **Evaluate Results**
64+
65+
```bash
66+
# Run evaluation with selected metrics
67+
parxyval evaluate --metric sequence_matcher --metric bleu_score --input ./data/doclaynet/processed/pymupdf
68+
```
69+
70+
### Command Reference
71+
72+
#### `parxyval download`
73+
Download documents from the DocLayNet dataset.
74+
75+
Options:
76+
- `--limit, -l`: Number of entries to download (default: 100)
77+
- `--skip, -s`: Skip specified number of entries
78+
- `--output, -o`: Output folder path (default: data/doclaynet)
79+
- `--include-pdf`: Download PDF files (default: False)
80+
81+
#### `parxyval parse`
82+
Parse PDF documents using specified driver.
83+
84+
Options:
85+
- `--driver, -d`: Parser driver to use (default: pymupdf)
86+
- `--limit, -l`: Maximum documents to process (default: 100)
87+
- `--skip, -s`: Skip specified number of documents
88+
- `--input, -i`: Input folder with PDFs (default: data/doclaynet/pdf)
89+
- `--output, -o`: Output folder for results (default: data/doclaynet/processed)
90+
91+
#### `parxyval evaluate`
92+
Evaluate parsing results against ground truth.
93+
94+
Arguments:
95+
- `driver`: Parser driver to evaluate (default: pymupdf)
96+
97+
Options:
98+
- `--metric, -m`: Metrics to use (can be specified multiple times)
99+
- `--golden, -g`: Ground truth folder (default: data/doclaynet/json)
100+
- `--input, -i`: Parsed documents folder (default: data/doclaynet/processed/pymupdf)
101+
- `--output, -o`: Results output folder (default: data/doclaynet/results)
102+
103+
104+
## Benchmarks
105+
106+
Parxyval supports various benchmarks for the evaluation of document processing services.
107+
108+
- [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet-v1.2): Evaluate text and layout using the DocLayNet v1.2 dataset.
109+
110+
111+
_Datasets we are evaluating to support:_
112+
113+
- [DP-Bench: Document Parsing Benchmark](https://huggingface.co/datasets/upstage/dp-bench)
114+
- [OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)
115+
116+
## Evaluations
117+
118+
Parxyval provides a comprehensive suite of text evaluation metrics to assess the quality of PDF parsing results. Each metric focuses on different aspects of text similarity and accuracy:
119+
120+
### Text Similarity Metrics
121+
122+
- **Sequence Matcher**: Measures the similarity between two texts using Python's difflib sequence matcher. Ideal for detecting overall textual similarities and differences.
123+
124+
- **Jaccard Similarity**: Computes the similarity between page contents by measuring the intersection over union of their token sets. Perfect for assessing vocabulary overlap between parsed and reference texts.
125+
126+
- **Edit Distance**: Calculates the normalized Levenshtein distance between texts, measuring the minimum number of single-character edits required to change one text into another. Useful for identifying character-level parsing accuracy.
127+
128+
### Natural Language Processing Metrics
129+
130+
- **BLEU Score**: A precision-based metric that compares n-grams between the parsed and reference texts. Particularly effective for evaluating the preservation of word sequences and phrases.
131+
132+
- **METEOR Score**: Advanced metric that considers stemming, synonymy, and paraphrasing. Provides a more nuanced evaluation of semantic similarity between parsed and reference texts.
133+
134+
### Information Retrieval Metrics
135+
136+
- **Precision**: Measures the accuracy of the parsed text by calculating the proportion of correctly parsed tokens relative to all tokens in the parsed text.
137+
138+
- **Recall**: Evaluates completeness by calculating the proportion of reference tokens that were correctly captured in the parsed text.
139+
140+
- **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of parsing accuracy.
141+
142+
All metrics are computed page-wise and then averaged across the entire document, ensuring a comprehensive evaluation of parsing quality at both local and global levels.
143+
144+
145+
## Security Vulnerabilities
146+
147+
Please review our [security policy](./.github/SECURITY.md) on how to report security vulnerabilities.
148+
149+
150+
## Supporters
151+
152+
The project is provided and supported by OneOff-Tech (UG) and Alessio Vertemati.
153+
154+
<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p>
155+
156+
157+
## Licence and Copyright
158+
159+
Parxy is licensed under the [MIT licence](./LICENCE).
160+
161+
- Copyright (c) 2025-present Alessio Vertemati, @avvertix
162+
- Copyright (c) 2025-present Oneoff-tech UG, www.oneofftech.de
163+
- All contributors
164+

docker-compose.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
2+
services:
3+
pdfact:
4+
image: "ghcr.io/data-house/pdfact:main"
5+
ports:
6+
- "4567:4567"
7+
networks:
8+
- parxyval
9+
10+
networks:
11+
parxyval:
12+
driver: bridge

pyproject.toml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
[project]
2+
name = "parxyval"
3+
version = "0.1.0" # DO NOT EDIT, updated automatically
4+
description = "An evaluation framework for document parsing."
5+
license = "MIT"
6+
keywords= ["parxy", "evaluation", "evaluation-framework", "convert", "document", "document-parsing", "parser-benchmark", "pdf", "docx", "html", "markdown", "layout model", "rag", "llm", "document-performance-monitoring", "document-ai","text-extraction","data-quality","rag","parsing-tools"]
7+
readme = "README.md"
8+
authors = [
9+
{ name = "Alessio Vertemati", email = "alessio@oneofftech.xyz" }
10+
]
11+
classifiers = [
12+
"Operating System :: MacOS :: MacOS X",
13+
"Operating System :: POSIX :: Linux",
14+
"Operating System :: Microsoft :: Windows",
15+
"Development Status :: 4 - Beta",
16+
"Intended Audience :: Developers",
17+
"Intended Audience :: Science/Research",
18+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
19+
"Programming Language :: Python :: 3",
20+
"License :: OSI Approved :: MIT License",
21+
]
22+
requires-python = ">=3.12,<4.0"
23+
dependencies = [
24+
"datasets>=4.1.1",
25+
"nltk>=3.9.2",
26+
"pandas>=2.3.3",
27+
"parxy[all]>=0.10.0",
28+
"pydantic>=2.11.9",
29+
"pydantic-settings>=2.11.0",
30+
"pymupdf>=1.26.4",
31+
"python-dotenv>=1.1.1",
32+
"rich>=14.1.0",
33+
"tqdm>=4.67.1",
34+
"typer>=0.19.2",
35+
]
36+
37+
[project.urls]
38+
homepage = "https://github.com/OneOffTech/parxyval"
39+
repository = "https://github.com/OneOffTech/parxyval"
40+
issues = "https://github.com/OneOffTech/parxyval/issues"
41+
42+
[project.scripts]
43+
parxyval = "parxyval.cli.main:app"
44+
45+
[build-system]
46+
requires = ["uv_build>=0.8.11,<0.9.0"]
47+
build-backend = "uv_build"
48+
49+
[dependency-groups]
50+
dev = [
51+
"pytest>=8.4.2",
52+
"ruff>=0.13.2",
53+
]
54+
55+
[tool.uv]
56+
default-groups = "all"
57+
58+
[tool.pytest.ini_options]
59+
addopts = [
60+
"--import-mode=importlib",
61+
]
62+
63+
[tool.ruff.format]
64+
quote-style = "single"
65+
docstring-code-format = true

src/parxyval/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

0 commit comments

Comments
 (0)