Skip to content

Spurious spaces are getting inserted in PDF with Krutidev text #324

@amankhandelia

Description

@amankhandelia

Bug Description

opendataloader-pdf inserts spurious spaces in Krutidev-encoded Hindi text, breaking character sequences and corrupting the extracted text.

Environment

  • opendataloader-pdf version: 2.0.2

Steps to Reproduce

  1. Extract page 2 from the PDF using opendataloader-pdf
  2. Compare with pypdf extraction

Repro and Logs

$ uv run python -c "
import tempfile
from pathlib import Path
import opendataloader_pdf

with tempfile.TemporaryDirectory() as tmpdir:
    opendataloader_pdf.convert(
        input_path=['page2.pdf'],
        output_dir=tmpdir,
        format='markdown',
        markdown_page_separator='<!-- PAGE %page-number% -->',
        quiet=False,
    )
    
    md_files = list(Path(tmpdir).glob('*.md'))
    if md_files:
        content = md_files[0].read_text(encoding='utf-8')
        print(content)
" 2>/dev/null

Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor preprocessing
INFO: File name: /Users/amank/general/projects/decode_samvidhan/page2.pdf
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Number of pages: 1
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Author: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Title: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Creation date: null
Mar 20, 2026 1:29:22 PM org.opendataloader.pdf.processors.DocumentProcessor calculateDocumentInfo
INFO: Modification date: null
Mar 20, 2026 1:29:23 PM org.opendataloader.pdf.markdown.MarkdownGenerator writeToMarkdown
INFO: Created /var/folders/vn/wd_f5y9s75vdxl4_wd5cs2tw0000gn/T/tmpc5ztcqg2/page2.md
<!-- PAGE 1 -->

# izFke eqnz.k 1950 iqu% eqnz.k 1994 iqu% eqnz.k 2015

# ewY;% ``` 4000@&

## © 2015 yksd lHkk lfpoky;

ykds lHkk d s ifz Ø;k vkjS dk; Z lpa kyu fu;e (iUngz ok a lLa dj.k) d s fu;e 382 d svarxZr izdkf'kr rFkk tSudks vkVZ bafM;k] 13@10] MCY;w-bZ-,-] djksy ckx] ubZ fnYyh&110005 }kjk eqfnzrA

Expected Behavior

pypdf extracts correctly:

yksd lHkk ds izfØ;k vkSj dk;Z lapkyu fu;e (iUnzgoka laLdj.k) ds fu;e 382 ds varxZr izdkf'kr

Actual Behavior

opendataloader-pdf extracts with spurious spaces and character reordering:

ykds lHkk d s ifz Ø;k vkjS dk; Z lpa kyu fu;e (iUngz ok a lLa dj.k) d s fu;e 382 d svarxZr izdkf'kr

Test File

Attached: buggy.pdf

This page contains the problematic Krutidev-encoded Hindi text that demonstrates the bug.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions