You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
21
21
22
22
## Parsers, OCR and extraction
23
23
24
-
-[Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway. Access different parsers using a unified API.
24
+
-[Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
25
25
-[Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
26
26
-[SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
27
27
-[Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLM's.
@@ -33,7 +33,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
33
33
-[lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
34
34
-[lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices).
35
35
-[allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
36
-
-[opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A Comprehensive Toolkit for High-Quality PDF Content Extraction.
36
+
-[opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
37
37
-[smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
38
38
-[Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
39
39
-[PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
@@ -49,7 +49,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
49
49
-[Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker. It enables you to carry out various operations on PDF files, including splitting, merging, converting, reorganizing, adding images, rotating, compressing, and more. This locally hosted web application has evolved to encompass a comprehensive set of features, addressing all your PDF requirements.
50
50
-[unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
51
51
-[PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
52
-
-[Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., HTML, Markdown, Word, Excel.
52
+
-[Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
53
53
-[Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
54
54
-[typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
55
55
-[Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
@@ -68,10 +68,10 @@ From parsers for extracting text, images, and tables, to automated PDF creation
68
68
69
69
## Datasets
70
70
71
-
-[tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc).
71
+
-[tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
72
72
-[pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
73
-
-[DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A Large Human-Annotated Dataset for Document-Layout Analysis.
74
-
-[gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents.
73
+
-[DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
74
+
-[gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
75
75
-[DocBank Dataset](https://github.com/doc-analysis/DocBank) - DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing.
0 commit comments