Skip to content

Commit 4c885df

Browse files
committed
Review capitalization of descriptions
1 parent 432c841 commit 4c885df

1 file changed

Lines changed: 6 additions & 6 deletions

File tree

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
2121

2222
## Parsers, OCR and extraction
2323

24-
- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway. Access different parsers using a unified API.
24+
- [Parxy](https://github.com/OneOffTech/parxy) - A PDF parsers gateway to use different parsers using a unified API.
2525
- [Docling](https://github.com/docling-project/docling/) - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
2626
- [SmolDocling](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo) - A multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
2727
- [Filimoa/open-parse](https://github.com/Filimoa/open-parse/) - Improved file parsing for LLM's.
@@ -33,7 +33,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
3333
- [lumina-ai-inc/chunkr](https://github.com/lumina-ai-inc/chunkr) - Vision model based PDF chunking.
3434
- [lumina-ai-inc/PaddleOCR](https://github.com/lumina-ai-inc/PaddleOCR) - Multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices).
3535
- [allenai/olmocr](https://github.com/allenai/olmocr) - Toolkit for linearizing PDFs for LLM datasets/training.
36-
- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A Comprehensive Toolkit for High-Quality PDF Content Extraction.
36+
- [opendatalab/PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - A comprehensive toolkit for high-quality PDF content extraction.
3737
- [smalot/pdfparser](https://github.com/smalot/pdfparser) - A standalone PHP library, provides various tools to extract data from a PDF file.
3838
- [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
3939
- [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
@@ -49,7 +49,7 @@ From parsers for extracting text, images, and tables, to automated PDF creation
4949
- [Stirling-Tools/Stirling-PDF](https://github.com/Stirling-Tools/Stirling-PDF) - A locally hosted web-based PDF manipulation tool using Docker. It enables you to carry out various operations on PDF files, including splitting, merging, converting, reorganizing, adding images, rotating, compressing, and more. This locally hosted web application has evolved to encompass a comprehensive set of features, addressing all your PDF requirements.
5050
- [unjs/unpdf](https://github.com/unjs/unpdf) - Utilities to work with PDFs in Node.js, browser and workers.
5151
- [PdfRest](https://pdfrest.com/) - PDF Api to create, shrink and compress.
52-
- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., HTML, Markdown, Word, Excel.
52+
- [Gotenberg](https://gotenberg.dev/) - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
5353
- [Smallpdf](https://smallpdf.com/) - Set of tools to extract and manipulate PDF content.
5454
- [typst/typst](https://github.com/typst/typst) - A new markup-based typesetting system that is powerful and easy to learn.
5555
- [Vexlio](https://vexlio.com/) - Tool to create diagrams and export in SVG or PDF.
@@ -68,10 +68,10 @@ From parsers for extracting text, images, and tables, to automated PDF creation
6868

6969
## Datasets
7070

71-
- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc).
71+
- [tpn/pdfs](https://github.com/tpn/pdfs) - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
7272
- [pdf-association/pdf-corpora](https://github.com/pdf-association/pdf-corpora) - An index of PDF-centric corpora.
73-
- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A Large Human-Annotated Dataset for Document-Layout Analysis.
74-
- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents.
73+
- [DS4SD/DocLayNet: DocLayNet](https://github.com/DS4SD/DocLayNet) - A large human-annotated dataset for document-layout analysis.
74+
- [gipplab/pdf-benchmark](https://github.com/gipplab/pdf-benchmark) - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
7575
- [DocBank Dataset](https://github.com/doc-analysis/DocBank) - DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing.
7676

7777
## Contributing

0 commit comments

Comments
 (0)