fix: convert Tesseract language codes for PaddleOCR in OCRAgent.get_agent()#4329
Conversation
…gent() When PaddleOCR is configured as the OCR agent, OCRAgent.get_agent() now converts Tesseract language codes (e.g., 'eng') to PaddleOCR language codes (e.g., 'en') before instantiating the agent. Previously, Tesseract-format codes were passed directly to PaddleOCR in the ocr_only strategy path and the table_structure module, causing language detection failures. The hi_res path already handled this conversion in ocr.py, but the ocr_only path via get_agent() did not. This fix centralizes the conversion in get_agent() so all callers benefit. Closes Unstructured-IO#3957
|
Hi, friendly ping - CI is passing (2/2 checks). This adds PaddleOCR as a lightweight alternative OCR option. Happy to address any feedback or make changes! |
1 similar comment
|
Hi, friendly ping - CI is passing (2/2 checks). This adds PaddleOCR as a lightweight alternative OCR option. Happy to address any feedback or make changes! |
|
@badGarnet @cragwolfe - would you have a moment to review this? It fixes the PaddleOCR language code bug reported in #3957 (Tesseract codes like |
1 similar comment
|
@badGarnet @cragwolfe - would you have a moment to review this? It fixes the PaddleOCR language code bug reported in #3957 (Tesseract codes like |
Fixes #3957 — PaddleOCR receives Tesseract language codes (e.g.,
eng) instead of its own codes (e.g.,en) when using theocr_onlystrategy.Root Cause
All language codes are unconditionally converted to Tesseract format via
prepare_languages_for_tesseract()early in the pipeline (pdf.pyL344). Thehi_respath correctly converts these back to PaddleOCR format insupplement_page_layout_with_ocr(), but theocr_onlypath passes them directly throughOCRAgent.get_agent()without conversion. The same issue also affectsunstructured/metrics/table_structure.pywhich callsget_agent(language="eng").Fix
Added language conversion directly in
OCRAgent.get_agent()— when PaddleOCR is detected as the configured agent, Tesseract language codes are converted viatesseract_to_paddle_language()before instantiation. This centralizes the fix so allget_agent()callers benefit.The
hi_respath is unaffected since it callsOCRAgent.get_instance()directly (bypassingget_agent()).Changes
unstructured/partition/utils/ocr_models/ocr_interface.py: Addedtesseract_to_paddle_languageimport and language conversion inget_agent()when PaddleOCR is configuredtest_unstructured/partition/utils/ocr_models/test_ocr_interface.py: Added 2 test methods (1 direct + 4 parametrized cases) verifying PaddleOCR language conversion throughget_agent()CHANGELOG.md: Added fix entry under 0.22.18Testing
test_ocr_interface.pypass (including 6 new test cases)test_lang.pypass (unchanged)eng→en,ara→ar,chi_sim→ch,deu→german