fix: add 'gr' as alias for Greek language in Tesseract language codes by s0wa48 · Pull Request #4263 · Unstructured-IO/unstructured

s0wa48 · 2026-02-24T16:10:13Z

Summary

The issue reports that Greek language PDFs are rendered with incorrect alphabet when using languages=["gr"]
"gr" is the ISO 3166-1 alpha-2 country code for Greece, which users commonly use to refer to the Greek language
However, the TESSERACT_LANGUAGES_AND_CODES dictionary in constants.py only recognized "greek" and "greek, modern" as valid language identifiers for Tesseract's "ell" language pack
This fix adds "gr" as an additional alias for "ell" (the Tesseract code for modern Greek), so that users specifying languages=["gr"] will correctly use the Greek OCR model

Fixes #2939

This PR was auto-generated by Gittensor bot using Claude AI to fix a reported issue.

s0wa48 · 2026-02-24T19:48:59Z

Closing this PR due to unresolvable merge conflicts. Feel free to reopen if conflicts are resolved manually.

s0wa48 added 2 commits February 24, 2026 17:10

fix: add 'gr' as alias for Greek language in Tesseract language codes

16f1de5

fix: address CI failures

d24b157

s0wa48 closed this Feb 24, 2026