Skip to content

Commit add6257

Browse files
authored
Add script to parse PDFs to huggingface dataset (#55)
* Add script to parse PDF with OCR and save text chunks to HuggingFace dataset * Remove redundant option alias for model in PDF to HuggingFace dataset script
1 parent 77a7dc7 commit add6257

3 files changed

Lines changed: 870 additions & 0 deletions

File tree

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ dev = [
4444
"nbqa>=1.9.1",
4545
"pip-audit>=2.7.3",
4646
"pre-commit>=4.1.0",
47+
"pymupdf>=1.26.7",
4748
"pytest>=8.3.4",
4849
"pytest-asyncio>=1.2.0",
4950
"pytest-cov>=7.0.0",

0 commit comments

Comments
 (0)