1- # ImageOcr
1+ # Image.OCR (image_ocr)
22
33Idiomatic Elixir interface to the [ Tesseract] ( https://github.com/tesseract-ocr/tesseract )
44OCR engine. Implemented as a NIF over the Tesseract 5.x C++ API; accepts
@@ -10,18 +10,61 @@ recognised text.
1010* Tesseract ** ≥ 5.0** and Leptonica installed at build time, with both
1111 reachable via ` pkg-config ` .
1212
13+ ### macOS
14+
15+ ``` bash
16+ brew install tesseract leptonica pkg-config
17+ ```
18+
19+ Xcode Command Line Tools (for ` clang++ ` ) must be installed:
20+ ` xcode-select --install ` .
21+
22+ ### Debian / Ubuntu
23+
24+ ``` bash
25+ sudo apt-get install -y \
26+ build-essential pkg-config \
27+ libtesseract-dev libleptonica-dev tesseract-ocr
28+ ```
29+
30+ Ubuntu ** 24.04+** is required for Tesseract ≥ 5.0; 22.04 ships 4.x and
31+ will not build. On 22.04 either upgrade or install Tesseract 5 from a
32+ PPA / source.
33+
34+ ### Fedora / RHEL / CentOS Stream
35+
36+ ``` bash
37+ sudo dnf install -y \
38+ gcc-c++ pkgconf-pkg-config \
39+ tesseract-devel leptonica-devel
40+ ```
41+
42+ ### Arch / Manjaro
43+
1344 ``` bash
14- # macOS
15- brew install tesseract leptonica
45+ sudo pacman -S base-devel pkgconf tesseract leptonica
46+ ```
1647
17- # Debian / Ubuntu
18- apt-get install libtesseract-dev libleptonica-dev pkg-config
48+ ### Alpine
1949
20- # Alpine
21- apk add tesseract-ocr-dev leptonica-dev pkg-config
50+ ``` bash
51+ apk add build-base pkgconf tesseract-ocr-dev leptonica-dev
2252 ```
2353
24- * Elixir ** ≥ 1.20-rc** and OTP 27 or 28.
54+ ### Windows
55+
56+ Native Windows builds are not supported out of the box. Use ** WSL2 with
57+ Ubuntu 24.04** and follow the Debian/Ubuntu instructions above — this is
58+ the path of least resistance and is what we test against.
59+
60+ Building natively requires MSYS2 / MinGW-w64 with ` g++ ` , ` pkg-config ` ,
61+ ` mingw-w64-x86_64-tesseract-ocr ` , and ` mingw-w64-x86_64-leptonica `
62+ available on ` PATH ` . Untested upstream — patches welcome.
63+
64+ * Elixir ** ≥ 1.17** and OTP ** ≥ 26** .
65+
66+ * A working C++17 compiler (` g++ ` or ` clang++ ` ) and ` pkg-config ` on the
67+ build host. The NIF is built with ` elixir_make ` on first compile.
2568
2669## Installation
2770
@@ -46,8 +89,8 @@ so the package is usable out of the box.
4689## Quick start
4790
4891``` elixir
49- {:ok , ocr} = ImageOcr .new () # defaults to language: "en"
50- {:ok , text} = ImageOcr .read_text (ocr, " page.png" )
92+ {:ok , ocr} = Image . OCR .new () # defaults to language: "en"
93+ {:ok , text} = Image . OCR .read_text (ocr, " page.png" )
5194```
5295
5396` read_text/3 ` accepts:
@@ -58,29 +101,66 @@ so the package is usable out of the box.
58101 via ` Vix.Vips.Image.new_from_buffer/1 `
59102
60103For per-word output with confidence and bounding boxes, use
61- ` ImageOcr .recognize/3` :
104+ ` Image.OCR .recognize/3` :
62105
63106``` elixir
64- {:ok , words} = ImageOcr .recognize (ocr, image)
107+ {:ok , words} = Image . OCR .recognize (ocr, image)
65108# => [%{text: "Hello", confidence: 96.4, bbox: {32, 18, 198, 64}}, …]
66109```
67110
111+ ## Locales
112+
113+ The ` :locale ` option (and the mix-task language arguments) accept:
114+
115+ * ** ISO 639-1** two-letter codes — ` "en" ` , ` :en ` , ` "fr" ` , ` :de ` , ` "ja" ` .
116+
117+ * ** BCP-47** tags for region- or script-specific variants — ` "zh-Hans" `
118+ (Simplified Chinese), ` "zh-Hant" ` (Traditional), ` "sr-Latn" ` (Serbian
119+ in Latin script), ` "az-Cyrl" ` . The built-in table covers the common
120+ cases.
121+
122+ * ** Any BCP-47 locale** — ` "en-US" ` , ` "fr-CA" ` , ` "zh-Hans-CN" ` ,
123+ ` "sr-Latn-RS" ` — when the optional [ ` :localize ` ] ( https://hex.pm/packages/localize )
124+ dependency is installed. With Localize, the locale is parsed and the
125+ language + script subtags are used to pick the right Tesseract trained
126+ data; territory subtags are ignored (Tesseract doesn't differentiate by
127+ territory).
128+
129+ * ** Tesseract codes** verbatim — ` "frk" ` (German Fraktur), ` "osd" `
130+ (orientation/script detection), ` "script/Latin" ` .
131+
132+ * ** ` + ` -joined combinations** — ` "en+fr" ` , ` "chi_sim+eng" ` , ` "ja+en" ` .
133+
134+ ` "zh" ` on its own is rejected as ambiguous — use ` "zh-Hans" ` or
135+ ` "zh-Hant" ` . See ` Image.OCR.Languages ` for the full mapping table.
136+
137+ To enable BCP-47 parsing add Localize to your project:
138+
139+ ``` elixir
140+ def deps do
141+ [
142+ {:image_ocr , " ~> 0.1.0" },
143+ {:localize , " ~> 0.25" }
144+ ]
145+ end
146+ ```
147+
68148## Concurrency
69149
70- A single ` ImageOcr ` instance wraps one ` tesseract::TessBaseAPI ` , which is ** not
150+ A single ` Image.OCR ` instance wraps one ` tesseract::TessBaseAPI ` , which is ** not
71151safe for concurrent use** . The NIF guards each instance with a mutex so
72152accidental sharing degrades to serialisation rather than UB, but for real
73153parallelism you want one instance per worker. The simplest way is the
74154included pool:
75155
76156``` elixir
77157children = [
78- {ImageOcr . Pool , name: MyOcr , language : " eng " , pool_size: 4 }
158+ {Image . OCR . Pool , name: MyOcr , locale : " en " , pool_size: 4 }
79159]
80160
81161Supervisor .start_link (children, strategy: :one_for_one )
82162
83- {:ok , text} = ImageOcr .Pool .read_text (MyOcr , " page.png" )
163+ {:ok , text} = Image . OCR .Pool .read_text (MyOcr , " page.png" )
84164```
85165
86166` pool_size ` defaults to ` System.schedulers_online() ` . Each worker holds the
@@ -95,7 +175,7 @@ schedulers regardless of pool size.
95175
96176The trained-data directory is resolved in this order:
97177
98- 1 . The ` :datapath ` option passed to ` ImageOcr .new/1` .
178+ 1 . The ` :datapath ` option passed to ` Image.OCR .new/1` .
991792 . ` Application.get_env(:image_ocr, :tessdata_path) ` .
1001803 . The ` TESSDATA_PREFIX ` environment variable.
1011814 . The vendored fallback at ` priv/tessdata/ ` .
@@ -112,26 +192,29 @@ config :image_ocr, tessdata_path: "/var/lib/image_ocr/tessdata"
112192Manage trained-data files without leaving your project:
113193
114194``` bash
115- # Install one or more languages
116- mix image_ocr.tessdata.add fra deu
195+ # Install one or more languages (ISO 639-1 codes)
196+ mix image.ocr.tessdata.add fr de
197+
198+ # BCP-47 for region/script-specific variants
199+ mix image.ocr.tessdata.add zh-Hans zh-Hant sr-Latn
117200
118- # Use a specific variant (" fast" / " best" / "legacy") or branch
119- mix image_ocr. tessdata.add chi_sim --variant best
201+ # Pick a variant: fast (default, ~2-4 MB), best (~10-15 MB), legacy (largest)
202+ mix image.ocr. tessdata.add en --variant best
120203
121- # Write to a specific directory (overrides config)
122- mix image_ocr. tessdata.add jpn --path /var/lib/tessdata
204+ # Write to a specific directory (overrides config and TESSDATA_PREFIX )
205+ mix image.ocr. tessdata.add ja --path /var/lib/tessdata
123206
124207# Refresh every installed language to its latest upstream commit
125- mix image_ocr .tessdata.update
208+ mix image.ocr .tessdata.update
126209
127210# Show what's installed
128- mix image_ocr .tessdata.list
211+ mix image.ocr .tessdata.list
129212
130213# Remove a language
131- mix image_ocr. tessdata.remove deu
214+ mix image.ocr. tessdata.remove de
132215```
133216
134- The tasks read from and write to the same path that ` ImageOcr .new/1` does, so
217+ The tasks read from and write to the same path that ` Image.OCR .new/1` does, so
135218there is one source of truth.
136219
137220## Tesseract 4.x vs 5.x
@@ -143,6 +226,12 @@ SIMD use and float32 models. The C++ API surface we use is identical between
1432264.x and 5.x, so 4.1+ would likely work — but we keep the support matrix
144227tight.
145228
229+ ## Livebook
230+
231+ An interactive demonstration is at [ ` notebooks/demo.livemd ` ] ( notebooks/demo.livemd ) .
232+ It covers one-shot OCR, reusable instances, per-word bounding boxes, the
233+ NimblePool, PSM/SetVariable tweaks, and uploading your own image.
234+
146235## License
147236
148237Apache-2.0.
0 commit comments