|
| 1 | + |
| 2 | + [](https://github.com/astral-sh/uv) |
| 3 | + |
| 4 | +# Parxyval |
| 5 | + |
| 6 | + |
| 7 | +Parxyval – The Developer's Knight of the Parsing Table. |
| 8 | + |
| 9 | +An evaluation framework for document parsing, inspired by the quest for the Holy Grail. 🏰⚔️ |
| 10 | + |
| 11 | +In a world of imperfect parsers, Parxyval helps you measure, compare, and discover which tool truly preserves the meaning of your documents. Benchmark precision, recall, structure, and reliability across multiple parsing services — and let your pipeline find its Grail. |
| 12 | + |
| 13 | + |
| 14 | +**Requirements** |
| 15 | + |
| 16 | +- Python 3.12 or above. |
| 17 | +- A Hugging Face account for downloading datasets that requires a login |
| 18 | + |
| 19 | + |
| 20 | +**Next steps** |
| 21 | + |
| 22 | +- [Getting started](#getting-started) |
| 23 | + - [Available commands](#command-reference) |
| 24 | +- [Benchmarks](#benchmarks) |
| 25 | +- [Supported evaluations](#evaluations) |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +## Getting Started |
| 30 | + |
| 31 | +Parxyval is an unstructured document processing evaluation framework offering a CLI interface to benchmark PDF parsing solutions. Follow these steps to get started: |
| 32 | + |
| 33 | +Parxyval is available on Pypi. You can try it using `uv`. |
| 34 | + |
| 35 | +```bash |
| 36 | +uvx parxyval --help |
| 37 | +``` |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +1. **Download the Dataset** |
| 42 | +```bash |
| 43 | +# Download sample documents from DocLayNet dataset |
| 44 | +parxyval download --limit 100 --include-pdf |
| 45 | +``` |
| 46 | + |
| 47 | +The ground truth is stored in `./data/doclaynet/json` while pdf files are stored in `./data/doclaynet/pdf` |
| 48 | + |
| 49 | + |
| 50 | +2. **Parse Documents** |
| 51 | +```bash |
| 52 | +# Parse PDFs using your chosen driver (default: pymupdf) |
| 53 | +parxyval parse --driver pymupdf |
| 54 | + |
| 55 | +# you can personalize input and output locations |
| 56 | +# --input data/doclaynet/pdf --output data/doclaynet/processed |
| 57 | +``` |
| 58 | + |
| 59 | +Parxyval supports all drivers available in [Parxy](https://github.com/OneOffTech/parxy). |
| 60 | + |
| 61 | +Pdf files are read from `./data/doclaynet/pdf` and the parser outputs is written in `./data/doclaynet/processed/{driver}`, e.g. `./data/doclaynet/processed/pymupdf` |
| 62 | + |
| 63 | +3. **Evaluate Results** |
| 64 | + |
| 65 | +```bash |
| 66 | +# Run evaluation with selected metrics |
| 67 | +parxyval evaluate --metric sequence_matcher --metric bleu_score --input ./data/doclaynet/processed/pymupdf |
| 68 | +``` |
| 69 | + |
| 70 | +### Command Reference |
| 71 | + |
| 72 | +#### `parxyval download` |
| 73 | +Download documents from the DocLayNet dataset. |
| 74 | + |
| 75 | +Options: |
| 76 | +- `--limit, -l`: Number of entries to download (default: 100) |
| 77 | +- `--skip, -s`: Skip specified number of entries |
| 78 | +- `--output, -o`: Output folder path (default: data/doclaynet) |
| 79 | +- `--include-pdf`: Download PDF files (default: False) |
| 80 | + |
| 81 | +#### `parxyval parse` |
| 82 | +Parse PDF documents using specified driver. |
| 83 | + |
| 84 | +Options: |
| 85 | +- `--driver, -d`: Parser driver to use (default: pymupdf) |
| 86 | +- `--limit, -l`: Maximum documents to process (default: 100) |
| 87 | +- `--skip, -s`: Skip specified number of documents |
| 88 | +- `--input, -i`: Input folder with PDFs (default: data/doclaynet/pdf) |
| 89 | +- `--output, -o`: Output folder for results (default: data/doclaynet/processed) |
| 90 | + |
| 91 | +#### `parxyval evaluate` |
| 92 | +Evaluate parsing results against ground truth. |
| 93 | + |
| 94 | +Arguments: |
| 95 | +- `driver`: Parser driver to evaluate (default: pymupdf) |
| 96 | + |
| 97 | +Options: |
| 98 | +- `--metric, -m`: Metrics to use (can be specified multiple times) |
| 99 | +- `--golden, -g`: Ground truth folder (default: data/doclaynet/json) |
| 100 | +- `--input, -i`: Parsed documents folder (default: data/doclaynet/processed/pymupdf) |
| 101 | +- `--output, -o`: Results output folder (default: data/doclaynet/results) |
| 102 | + |
| 103 | + |
| 104 | +## Benchmarks |
| 105 | + |
| 106 | +Parxyval supports various benchmarks for the evaluation of document processing services. |
| 107 | + |
| 108 | +- [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet-v1.2): Evaluate text and layout using the DocLayNet v1.2 dataset. |
| 109 | + |
| 110 | + |
| 111 | +_Datasets we are evaluating to support:_ |
| 112 | + |
| 113 | +- [DP-Bench: Document Parsing Benchmark](https://huggingface.co/datasets/upstage/dp-bench) |
| 114 | +- [OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench) |
| 115 | + |
| 116 | +## Evaluations |
| 117 | + |
| 118 | +Parxyval provides a comprehensive suite of text evaluation metrics to assess the quality of PDF parsing results. Each metric focuses on different aspects of text similarity and accuracy: |
| 119 | + |
| 120 | +### Text Similarity Metrics |
| 121 | + |
| 122 | +- **Sequence Matcher**: Measures the similarity between two texts using Python's difflib sequence matcher. Ideal for detecting overall textual similarities and differences. |
| 123 | + |
| 124 | +- **Jaccard Similarity**: Computes the similarity between page contents by measuring the intersection over union of their token sets. Perfect for assessing vocabulary overlap between parsed and reference texts. |
| 125 | + |
| 126 | +- **Edit Distance**: Calculates the normalized Levenshtein distance between texts, measuring the minimum number of single-character edits required to change one text into another. Useful for identifying character-level parsing accuracy. |
| 127 | + |
| 128 | +### Natural Language Processing Metrics |
| 129 | + |
| 130 | +- **BLEU Score**: A precision-based metric that compares n-grams between the parsed and reference texts. Particularly effective for evaluating the preservation of word sequences and phrases. |
| 131 | + |
| 132 | +- **METEOR Score**: Advanced metric that considers stemming, synonymy, and paraphrasing. Provides a more nuanced evaluation of semantic similarity between parsed and reference texts. |
| 133 | + |
| 134 | +### Information Retrieval Metrics |
| 135 | + |
| 136 | +- **Precision**: Measures the accuracy of the parsed text by calculating the proportion of correctly parsed tokens relative to all tokens in the parsed text. |
| 137 | + |
| 138 | +- **Recall**: Evaluates completeness by calculating the proportion of reference tokens that were correctly captured in the parsed text. |
| 139 | + |
| 140 | +- **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of parsing accuracy. |
| 141 | + |
| 142 | +All metrics are computed page-wise and then averaged across the entire document, ensuring a comprehensive evaluation of parsing quality at both local and global levels. |
| 143 | + |
| 144 | + |
| 145 | +## Security Vulnerabilities |
| 146 | + |
| 147 | +Please review our [security policy](./.github/SECURITY.md) on how to report security vulnerabilities. |
| 148 | + |
| 149 | + |
| 150 | +## Supporters |
| 151 | + |
| 152 | +The project is provided and supported by OneOff-Tech (UG) and Alessio Vertemati. |
| 153 | + |
| 154 | +<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p> |
| 155 | + |
| 156 | + |
| 157 | +## Licence and Copyright |
| 158 | + |
| 159 | +Parxy is licensed under the [MIT licence](./LICENCE). |
| 160 | + |
| 161 | +- Copyright (c) 2025-present Alessio Vertemati, @avvertix |
| 162 | +- Copyright (c) 2025-present Oneoff-tech UG, www.oneofftech.de |
| 163 | +- All contributors |
| 164 | + |
0 commit comments