Skip to content

Commit a086f78

Browse files
committed
Rename ImageOcr → Image.OCR; add :locale option with optional Localize-backed BCP-47 support
1 parent 1b2bbc3 commit a086f78

24 files changed

Lines changed: 702 additions & 191 deletions

CHANGELOG.md

Lines changed: 75 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,79 @@
22

33
## 0.1.0 (initial release)
44

5-
* NIF binding to Tesseract 5.x via `tesseract::TessBaseAPI`.
6-
* `ImageOcr.new/1`, `ImageOcr.read_text/3`, `ImageOcr.recognize/3`,
7-
`ImageOcr.quick_read/2`, `ImageOcr.tesseract_version/0`.
8-
* `ImageOcr.Input` accepts `Vix.Vips.Image.t()`, file paths, and in-memory
9-
encoded image binaries.
10-
* `ImageOcr.Pool``NimblePool`-backed pool of OCR instances for parallel
11-
recognition.
12-
* `ImageOcr.Tessdata` for trained-data path resolution. Lookup order:
5+
### Core
6+
7+
* NIF binding to Tesseract 5.x (`tesseract::TessBaseAPI`) built with
8+
`elixir_make`. Refuses to build against Tesseract < 5.0.
9+
10+
* All recognition entry points are dirty-CPU-bound NIFs and do not block
11+
normal schedulers.
12+
13+
* Per-instance `ErlNifMutex` so accidental concurrent use of a single
14+
instance degrades to serialisation rather than undefined behaviour.
15+
16+
### API
17+
18+
* `Image.OCR.new/1` — build a reusable OCR instance with `:locale`,
19+
`:datapath`, `:psm`, `:variables` options.
20+
21+
* `Image.OCR.read_text/3` — recognise text into a UTF-8 string.
22+
23+
* `Image.OCR.recognize/3` — per-word results with confidence and bounding
24+
boxes.
25+
26+
* `Image.OCR.quick_read/2` — one-shot convenience.
27+
28+
* `Image.OCR.tesseract_version/0` — linked Tesseract library version.
29+
30+
### Inputs
31+
32+
* `Image.OCR.Input` accepts `Vix.Vips.Image.t()`, file paths, charlists,
33+
and in-memory binaries of encoded image data (PNG, JPEG, TIFF, …).
34+
35+
* Auto-normalisation of pixel format (cast to 8-bit, RGBA flatten, band
36+
reduction) before recognition.
37+
38+
### Concurrency
39+
40+
* `Image.OCR.Pool``NimblePool`-backed pool of OCR instances. One
41+
`TessBaseAPI*` per worker for true parallelism. Pool size defaults to
42+
`System.schedulers_online()`.
43+
44+
### Locales
45+
46+
* The `:locale` option on `Image.OCR.new/1` accepts ISO 639-1 codes
47+
(`"en"`, `:fr`), BCP-47 region/script tags (`"zh-Hans"`, `"sr-Latn"`),
48+
Tesseract codes verbatim (`"frk"`, `"osd"`), and `+`-joined
49+
combinations (`"en+fr"`).
50+
51+
* `"zh"` is rejected as ambiguous — callers must use `"zh-Hans"` or
52+
`"zh-Hant"`.
53+
54+
* Optional [`:localize`](https://hex.pm/packages/localize) dependency
55+
enables full BCP-47 parsing (`"en-US"`, `"fr-CA"`, `"zh-Hans-CN"`,
56+
`"sr-Latn-RS"`). Compile-fenced — Localize is not required.
57+
58+
### Trained data
59+
60+
* `Image.OCR.Tessdata` resolves the trained-data directory in this order:
1361
explicit `:datapath` option → `:image_ocr, :tessdata_path` config →
14-
`TESSDATA_PREFIX` env var → vendored `priv/tessdata/`.
15-
* Vendored English (`eng`) trained-data from `tessdata_fast`.
16-
* Mix tasks: `image_ocr.tessdata.{add,update,list,remove}`.
62+
`TESSDATA_PREFIX` env → vendored `priv/tessdata/`.
63+
64+
* English (`eng`) `tessdata_fast` is vendored so the package is usable
65+
out of the box.
66+
67+
* Mix tasks: `image.ocr.tessdata.{add,update,list,remove}` accept the
68+
same language identifiers as the runtime API. `--variant` selects
69+
`fast` (default), `best`, or `legacy`.
70+
71+
### Tooling
72+
73+
* CI matrix across Elixir 1.17 / 1.18 / 1.19 / 1.20-rc and OTP 26 / 27 /
74+
28. Lint cell runs `mix format --check-formatted` and
75+
`mix dialyzer`.
76+
77+
* Dialyzer configured with `plt_add_apps: [:mix, :inets, :ssl,
78+
:public_key, :ex_unit]`.
79+
80+
* Demo Livebook at `notebooks/demo.livemd`.

README.md

Lines changed: 115 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ImageOcr
1+
# Image.OCR (image_ocr)
22

33
Idiomatic Elixir interface to the [Tesseract](https://github.com/tesseract-ocr/tesseract)
44
OCR engine. Implemented as a NIF over the Tesseract 5.x C++ API; accepts
@@ -10,18 +10,61 @@ recognised text.
1010
* Tesseract **≥ 5.0** and Leptonica installed at build time, with both
1111
reachable via `pkg-config`.
1212

13+
### macOS
14+
15+
```bash
16+
brew install tesseract leptonica pkg-config
17+
```
18+
19+
Xcode Command Line Tools (for `clang++`) must be installed:
20+
`xcode-select --install`.
21+
22+
### Debian / Ubuntu
23+
24+
```bash
25+
sudo apt-get install -y \
26+
build-essential pkg-config \
27+
libtesseract-dev libleptonica-dev tesseract-ocr
28+
```
29+
30+
Ubuntu **24.04+** is required for Tesseract ≥ 5.0; 22.04 ships 4.x and
31+
will not build. On 22.04 either upgrade or install Tesseract 5 from a
32+
PPA / source.
33+
34+
### Fedora / RHEL / CentOS Stream
35+
36+
```bash
37+
sudo dnf install -y \
38+
gcc-c++ pkgconf-pkg-config \
39+
tesseract-devel leptonica-devel
40+
```
41+
42+
### Arch / Manjaro
43+
1344
```bash
14-
# macOS
15-
brew install tesseract leptonica
45+
sudo pacman -S base-devel pkgconf tesseract leptonica
46+
```
1647

17-
# Debian / Ubuntu
18-
apt-get install libtesseract-dev libleptonica-dev pkg-config
48+
### Alpine
1949

20-
# Alpine
21-
apk add tesseract-ocr-dev leptonica-dev pkg-config
50+
```bash
51+
apk add build-base pkgconf tesseract-ocr-dev leptonica-dev
2252
```
2353

24-
* Elixir **≥ 1.20-rc** and OTP 27 or 28.
54+
### Windows
55+
56+
Native Windows builds are not supported out of the box. Use **WSL2 with
57+
Ubuntu 24.04** and follow the Debian/Ubuntu instructions above — this is
58+
the path of least resistance and is what we test against.
59+
60+
Building natively requires MSYS2 / MinGW-w64 with `g++`, `pkg-config`,
61+
`mingw-w64-x86_64-tesseract-ocr`, and `mingw-w64-x86_64-leptonica`
62+
available on `PATH`. Untested upstream — patches welcome.
63+
64+
* Elixir **≥ 1.17** and OTP **≥ 26**.
65+
66+
* A working C++17 compiler (`g++` or `clang++`) and `pkg-config` on the
67+
build host. The NIF is built with `elixir_make` on first compile.
2568

2669
## Installation
2770

@@ -46,8 +89,8 @@ so the package is usable out of the box.
4689
## Quick start
4790

4891
```elixir
49-
{:ok, ocr} = ImageOcr.new() # defaults to language: "en"
50-
{:ok, text} = ImageOcr.read_text(ocr, "page.png")
92+
{:ok, ocr} = Image.OCR.new() # defaults to language: "en"
93+
{:ok, text} = Image.OCR.read_text(ocr, "page.png")
5194
```
5295

5396
`read_text/3` accepts:
@@ -58,29 +101,66 @@ so the package is usable out of the box.
58101
via `Vix.Vips.Image.new_from_buffer/1`
59102

60103
For per-word output with confidence and bounding boxes, use
61-
`ImageOcr.recognize/3`:
104+
`Image.OCR.recognize/3`:
62105

63106
```elixir
64-
{:ok, words} = ImageOcr.recognize(ocr, image)
107+
{:ok, words} = Image.OCR.recognize(ocr, image)
65108
# => [%{text: "Hello", confidence: 96.4, bbox: {32, 18, 198, 64}}, …]
66109
```
67110

111+
## Locales
112+
113+
The `:locale` option (and the mix-task language arguments) accept:
114+
115+
* **ISO 639-1** two-letter codes — `"en"`, `:en`, `"fr"`, `:de`, `"ja"`.
116+
117+
* **BCP-47** tags for region- or script-specific variants — `"zh-Hans"`
118+
(Simplified Chinese), `"zh-Hant"` (Traditional), `"sr-Latn"` (Serbian
119+
in Latin script), `"az-Cyrl"`. The built-in table covers the common
120+
cases.
121+
122+
* **Any BCP-47 locale**`"en-US"`, `"fr-CA"`, `"zh-Hans-CN"`,
123+
`"sr-Latn-RS"` — when the optional [`:localize`](https://hex.pm/packages/localize)
124+
dependency is installed. With Localize, the locale is parsed and the
125+
language + script subtags are used to pick the right Tesseract trained
126+
data; territory subtags are ignored (Tesseract doesn't differentiate by
127+
territory).
128+
129+
* **Tesseract codes** verbatim — `"frk"` (German Fraktur), `"osd"`
130+
(orientation/script detection), `"script/Latin"`.
131+
132+
* **`+`-joined combinations**`"en+fr"`, `"chi_sim+eng"`, `"ja+en"`.
133+
134+
`"zh"` on its own is rejected as ambiguous — use `"zh-Hans"` or
135+
`"zh-Hant"`. See `Image.OCR.Languages` for the full mapping table.
136+
137+
To enable BCP-47 parsing add Localize to your project:
138+
139+
```elixir
140+
def deps do
141+
[
142+
{:image_ocr, "~> 0.1.0"},
143+
{:localize, "~> 0.25"}
144+
]
145+
end
146+
```
147+
68148
## Concurrency
69149

70-
A single `ImageOcr` instance wraps one `tesseract::TessBaseAPI`, which is **not
150+
A single `Image.OCR` instance wraps one `tesseract::TessBaseAPI`, which is **not
71151
safe for concurrent use**. The NIF guards each instance with a mutex so
72152
accidental sharing degrades to serialisation rather than UB, but for real
73153
parallelism you want one instance per worker. The simplest way is the
74154
included pool:
75155

76156
```elixir
77157
children = [
78-
{ImageOcr.Pool, name: MyOcr, language: "eng", pool_size: 4}
158+
{Image.OCR.Pool, name: MyOcr, locale: "en", pool_size: 4}
79159
]
80160

81161
Supervisor.start_link(children, strategy: :one_for_one)
82162

83-
{:ok, text} = ImageOcr.Pool.read_text(MyOcr, "page.png")
163+
{:ok, text} = Image.OCR.Pool.read_text(MyOcr, "page.png")
84164
```
85165

86166
`pool_size` defaults to `System.schedulers_online()`. Each worker holds the
@@ -95,7 +175,7 @@ schedulers regardless of pool size.
95175

96176
The trained-data directory is resolved in this order:
97177

98-
1. The `:datapath` option passed to `ImageOcr.new/1`.
178+
1. The `:datapath` option passed to `Image.OCR.new/1`.
99179
2. `Application.get_env(:image_ocr, :tessdata_path)`.
100180
3. The `TESSDATA_PREFIX` environment variable.
101181
4. The vendored fallback at `priv/tessdata/`.
@@ -112,26 +192,29 @@ config :image_ocr, tessdata_path: "/var/lib/image_ocr/tessdata"
112192
Manage trained-data files without leaving your project:
113193

114194
```bash
115-
# Install one or more languages
116-
mix image_ocr.tessdata.add fra deu
195+
# Install one or more languages (ISO 639-1 codes)
196+
mix image.ocr.tessdata.add fr de
197+
198+
# BCP-47 for region/script-specific variants
199+
mix image.ocr.tessdata.add zh-Hans zh-Hant sr-Latn
117200

118-
# Use a specific variant ("fast" / "best" / "legacy") or branch
119-
mix image_ocr.tessdata.add chi_sim --variant best
201+
# Pick a variant: fast (default, ~2-4 MB), best (~10-15 MB), legacy (largest)
202+
mix image.ocr.tessdata.add en --variant best
120203

121-
# Write to a specific directory (overrides config)
122-
mix image_ocr.tessdata.add jpn --path /var/lib/tessdata
204+
# Write to a specific directory (overrides config and TESSDATA_PREFIX)
205+
mix image.ocr.tessdata.add ja --path /var/lib/tessdata
123206

124207
# Refresh every installed language to its latest upstream commit
125-
mix image_ocr.tessdata.update
208+
mix image.ocr.tessdata.update
126209

127210
# Show what's installed
128-
mix image_ocr.tessdata.list
211+
mix image.ocr.tessdata.list
129212

130213
# Remove a language
131-
mix image_ocr.tessdata.remove deu
214+
mix image.ocr.tessdata.remove de
132215
```
133216

134-
The tasks read from and write to the same path that `ImageOcr.new/1` does, so
217+
The tasks read from and write to the same path that `Image.OCR.new/1` does, so
135218
there is one source of truth.
136219

137220
## Tesseract 4.x vs 5.x
@@ -143,6 +226,12 @@ SIMD use and float32 models. The C++ API surface we use is identical between
143226
4.x and 5.x, so 4.1+ would likely work — but we keep the support matrix
144227
tight.
145228

229+
## Livebook
230+
231+
An interactive demonstration is at [`notebooks/demo.livemd`](notebooks/demo.livemd).
232+
It covers one-shot OCR, reusable instances, per-word bounding boxes, the
233+
NimblePool, PSM/SetVariable tweaks, and uploading your own image.
234+
146235
## License
147236

148237
Apache-2.0.

c_src/image_ocr_nif.cc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
// image_ocr NIF: thin wrapper around tesseract::TessBaseAPI.
1+
// Image.OCR NIF: thin wrapper around tesseract::TessBaseAPI.
22
//
33
// Concurrency model:
4-
// * Each %ImageOcr{} owns one TessBaseAPI*. TessBaseAPI is NOT safe for
4+
// * Each %Image.OCR{} owns one TessBaseAPI*. TessBaseAPI is NOT safe for
55
// concurrent use on the same instance. We hold a per-resource ErlNifMutex
66
// so accidental sharing degrades to serialization rather than UB.
77
// * For real parallelism, callers create one instance per worker (see
8-
// ImageOcr.Pool). All recognition entry points are dirty-CPU NIFs.
8+
// Image.OCR.Pool). All recognition entry points are dirty-CPU NIFs.
99

1010
#include <cstring>
1111
#include <string>
@@ -287,4 +287,4 @@ ErlNifFunc nif_funcs[] = {
287287

288288
} // namespace
289289

290-
ERL_NIF_INIT(Elixir.ImageOcr.Nif, nif_funcs, load, nullptr, nullptr, nullptr)
290+
ERL_NIF_INIT(Elixir.Image.OCR.Nif, nif_funcs, load, nullptr, nullptr, nullptr)

0 commit comments

Comments
 (0)