Bug Description
opendataloader-pdf inserts spurious spaces in Krutidev-encoded Hindi text, breaking character sequences and corrupting the extracted text.
Environment
- opendataloader-pdf version: 2.0.2
Steps to Reproduce
- Extract page 2 from the PDF using opendataloader-pdf
- Compare with pypdf extraction
Repro and Logs
$ uv run python -c "
import tempfile
from pathlib import Path
import opendataloader_pdf
with tempfile.TemporaryDirectory() as tmpdir:
opendataloader_pdf.convert(
input_path=['page2.pdf'],
output_dir=tmpdir,
format='markdown',
markdown_page_separator='<!-- PAGE %page-number% -->',
quiet=False,
)
md_files = list(Path(tmpdir).glob('*.md'))
if md_files:
content = md_files[0].read_text(encoding='utf-8')
print(content)
" 2>/dev/null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor preprocessing
INFO: File name: /Users/amank/general/projects/decode_samvidhan/page2.pdf
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Number of pages: 1
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Author: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Title: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Creation date: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Modification date: null
Mar 20, 2026 1:29:23 PM org.opendataloader.pdf.markdown.MarkdownGenerator writeToMarkdown
INFO: Created /var/folders/vn/wd_f5y9s75vdxl4_wd5cs2tw0000gn/T/tmpc5ztcqg2/page2.md
<!-- PAGE 1 -->
# izFke eqnz.k 1950 iqu% eqnz.k 1994 iqu% eqnz.k 2015
# ewY;% ``` 4000@&
## © 2015 yksd lHkk lfpoky;
ykds lHkk d s ifz Ø;k vkjS dk; Z lpa kyu fu;e (iUngz ok a lLa dj.k) d s fu;e 382 d svarxZr izdkf'kr rFkk tSudks vkVZ bafM;k] 13@10] MCY;w-bZ-,-] djksy ckx] ubZ fnYyh&110005 }kjk eqfnzrA
Expected Behavior
pypdf extracts correctly:
yksd lHkk ds izfØ;k vkSj dk;Z lapkyu fu;e (iUnzgoka laLdj.k) ds fu;e 382 ds varxZr izdkf'kr
Actual Behavior
opendataloader-pdf extracts with spurious spaces and character reordering:
ykds lHkk d s ifz Ø;k vkjS dk; Z lpa kyu fu;e (iUngz ok a lLa dj.k) d s fu;e 382 d svarxZr izdkf'kr
Test File
Attached: buggy.pdf
This page contains the problematic Krutidev-encoded Hindi text that demonstrates the bug.
Bug Description
opendataloader-pdf inserts spurious spaces in Krutidev-encoded Hindi text, breaking character sequences and corrupting the extracted text.
Environment
Steps to Reproduce
Repro and Logs
Expected Behavior
pypdf extracts correctly:
Actual Behavior
opendataloader-pdf extracts with spurious spaces and character reordering:
Test File
Attached: buggy.pdf
This page contains the problematic Krutidev-encoded Hindi text that demonstrates the bug.