Skip to content

fix: convert Tesseract language codes for PaddleOCR in OCRAgent.get_agent()#4329

Open
Mustafa-Shoukat1 wants to merge 1 commit intoUnstructured-IO:mainfrom
Mustafa-Shoukat1:mustafa-shoukat/fix-paddleocr-language-mapping
Open

fix: convert Tesseract language codes for PaddleOCR in OCRAgent.get_agent()#4329
Mustafa-Shoukat1 wants to merge 1 commit intoUnstructured-IO:mainfrom
Mustafa-Shoukat1:mustafa-shoukat/fix-paddleocr-language-mapping

Conversation

@Mustafa-Shoukat1
Copy link
Copy Markdown

@Mustafa-Shoukat1 Mustafa-Shoukat1 commented Apr 9, 2026

Fixes #3957 — PaddleOCR receives Tesseract language codes (e.g., eng) instead of its own codes (e.g., en) when using the ocr_only strategy.

Root Cause

All language codes are unconditionally converted to Tesseract format via prepare_languages_for_tesseract() early in the pipeline (pdf.py L344). The hi_res path correctly converts these back to PaddleOCR format in supplement_page_layout_with_ocr(), but the ocr_only path passes them directly through OCRAgent.get_agent() without conversion. The same issue also affects unstructured/metrics/table_structure.py which calls get_agent(language="eng").

Fix

Added language conversion directly in OCRAgent.get_agent() — when PaddleOCR is detected as the configured agent, Tesseract language codes are converted via tesseract_to_paddle_language() before instantiation. This centralizes the fix so all get_agent() callers benefit.

The hi_res path is unaffected since it calls OCRAgent.get_instance() directly (bypassing get_agent()).

Changes

  • unstructured/partition/utils/ocr_models/ocr_interface.py: Added tesseract_to_paddle_language import and language conversion in get_agent() when PaddleOCR is configured
  • test_unstructured/partition/utils/ocr_models/test_ocr_interface.py: Added 2 test methods (1 direct + 4 parametrized cases) verifying PaddleOCR language conversion through get_agent()
  • CHANGELOG.md: Added fix entry under 0.22.18

Testing

  • All 15 tests in test_ocr_interface.py pass (including 6 new test cases)
  • All 57 tests in test_lang.py pass (unchanged)
  • Tested language conversions: engen, araar, chi_simch, deugerman

…gent()

When PaddleOCR is configured as the OCR agent, OCRAgent.get_agent() now
converts Tesseract language codes (e.g., 'eng') to PaddleOCR language
codes (e.g., 'en') before instantiating the agent.

Previously, Tesseract-format codes were passed directly to PaddleOCR in
the ocr_only strategy path and the table_structure module, causing
language detection failures.

The hi_res path already handled this conversion in ocr.py, but the
ocr_only path via get_agent() did not. This fix centralizes the
conversion in get_agent() so all callers benefit.

Closes Unstructured-IO#3957
@Mustafa-Shoukat1
Copy link
Copy Markdown
Author

Hi, friendly ping - CI is passing (2/2 checks). This adds PaddleOCR as a lightweight alternative OCR option. Happy to address any feedback or make changes!

1 similar comment
@Mustafa-Shoukat1
Copy link
Copy Markdown
Author

Hi, friendly ping - CI is passing (2/2 checks). This adds PaddleOCR as a lightweight alternative OCR option. Happy to address any feedback or make changes!

@Mustafa-Shoukat1
Copy link
Copy Markdown
Author

@badGarnet @cragwolfe - would you have a moment to review this? It fixes the PaddleOCR language code bug reported in #3957 (Tesseract codes like eng are passed to PaddleOCR instead of en). The fix is small (centralizes conversion in get_agent()), CI is passing (2/2), and includes 6 new test cases. Happy to address any feedback!

1 similar comment
@Mustafa-Shoukat1
Copy link
Copy Markdown
Author

@badGarnet @cragwolfe - would you have a moment to review this? It fixes the PaddleOCR language code bug reported in #3957 (Tesseract codes like eng are passed to PaddleOCR instead of en). The fix is small (centralizes conversion in get_agent()), CI is passing (2/2), and includes 6 new test cases. Happy to address any feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/PaddleOCR language specification issue

1 participant