Skip to content

fix: add 'gr' as alias for Greek language in Tesseract language codes#4263

Closed
s0wa48 wants to merge 2 commits intoUnstructured-IO:mainfrom
s0wa48:fix/issue-2939-text-extraction-issue-greek-l
Closed

fix: add 'gr' as alias for Greek language in Tesseract language codes#4263
s0wa48 wants to merge 2 commits intoUnstructured-IO:mainfrom
s0wa48:fix/issue-2939-text-extraction-issue-greek-l

Conversation

@s0wa48
Copy link
Copy Markdown

@s0wa48 s0wa48 commented Feb 24, 2026

Summary

  • The issue reports that Greek language PDFs are rendered with incorrect alphabet when using languages=["gr"]
  • "gr" is the ISO 3166-1 alpha-2 country code for Greece, which users commonly use to refer to the Greek language
  • However, the TESSERACT_LANGUAGES_AND_CODES dictionary in constants.py only recognized "greek" and "greek, modern" as valid language identifiers for Tesseract's "ell" language pack
  • This fix adds "gr" as an additional alias for "ell" (the Tesseract code for modern Greek), so that users specifying languages=["gr"] will correctly use the Greek OCR model

Fixes #2939


This PR was auto-generated by Gittensor bot using Claude AI to fix a reported issue.

@s0wa48
Copy link
Copy Markdown
Author

s0wa48 commented Feb 24, 2026

Closing this PR due to unresolvable merge conflicts. Feel free to reopen if conflicts are resolved manually.

@s0wa48 s0wa48 closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

1 participant