Skip to content

Commit 61bf978

Browse files
committed
Initial commit
0 parents  commit 61bf978

30 files changed

Lines changed: 2497 additions & 0 deletions

.formatter.exs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Used by "mix format"
2+
[
3+
inputs: ["{mix,.formatter}.exs", "{config,lib,test}/**/*.{ex,exs}"]
4+
]

.gitignore

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# The directory Mix will write compiled artifacts to.
2+
/_build/
3+
4+
# If you run "mix test --cover", coverage assets end up here.
5+
/cover/
6+
7+
# The directory Mix downloads your dependencies sources to.
8+
/deps/
9+
10+
# Where third-party dependencies like ExDoc output generated docs.
11+
/doc/
12+
13+
# Temporary files, for example, from tests.
14+
/tmp/
15+
16+
# If the VM crashes, it generates a dump, let's ignore it too.
17+
erl_crash.dump
18+
19+
# Also ignore archive artifacts (built via "mix archive.build").
20+
*.ez
21+
22+
# Ignore package tarball (built via "mix hex.build").
23+
image_ocr-*.tar
24+
25+
# Built NIF — produced by elixir_make; should not be checked in.
26+
/priv/*.so
27+

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Changelog
2+
3+
## 0.1.0 (initial release)
4+
5+
* NIF binding to Tesseract 5.x via `tesseract::TessBaseAPI`.
6+
* `ImageOcr.new/1`, `ImageOcr.read_text/3`, `ImageOcr.recognize/3`,
7+
`ImageOcr.quick_read/2`, `ImageOcr.tesseract_version/0`.
8+
* `ImageOcr.Input` accepts `Vix.Vips.Image.t()`, file paths, and in-memory
9+
encoded image binaries.
10+
* `ImageOcr.Pool``NimblePool`-backed pool of OCR instances for parallel
11+
recognition.
12+
* `ImageOcr.Tessdata` for trained-data path resolution. Lookup order:
13+
explicit `:datapath` option → `:image_ocr, :tessdata_path` config →
14+
`TESSDATA_PREFIX` env var → vendored `priv/tessdata/`.
15+
* Vendored English (`eng`) trained-data from `tessdata_fast`.
16+
* Mix tasks: `image_ocr.tessdata.{add,update,list,remove}`.

LICENSE

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Apache License
2+
Version 2.0, January 2004
3+
http://www.apache.org/licenses/
4+
5+
Copyright 2026 Kip Cole
6+
7+
Licensed under the Apache License, Version 2.0 (the "License");
8+
you may not use this file except in compliance with the License.
9+
You may obtain a copy of the License at
10+
11+
http://www.apache.org/licenses/LICENSE-2.0
12+
13+
Unless required by applicable law or agreed to in writing, software
14+
distributed under the License is distributed on an "AS IS" BASIS,
15+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
See the License for the specific language governing permissions and
17+
limitations under the License.

Makefile

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
PRIV_DIR = $(MIX_APP_PATH)/priv
2+
NIF_LIB = $(PRIV_DIR)/image_ocr_nif.so
3+
4+
ERTS_INCLUDE_DIR ?= $(shell erl -noshell -eval 'io:format("~ts/erts-~ts/include/", [code:root_dir(), erlang:system_info(version)]).' -s init stop)
5+
6+
PKG_CFLAGS = $(shell pkg-config --cflags tesseract lept)
7+
PKG_LDLIBS = $(shell pkg-config --libs tesseract lept)
8+
9+
CXX ?= c++
10+
CXXFLAGS ?= -O3 -fPIC -std=c++17 -Wall -Wextra
11+
CPPFLAGS += -I"$(ERTS_INCLUDE_DIR)" $(PKG_CFLAGS)
12+
LDFLAGS +=
13+
LDLIBS += $(PKG_LDLIBS)
14+
15+
UNAME_S := $(shell uname -s)
16+
ifeq ($(UNAME_S),Darwin)
17+
LDFLAGS += -dynamiclib -undefined dynamic_lookup -flat_namespace
18+
else
19+
LDFLAGS += -shared
20+
endif
21+
22+
SRC = c_src/image_ocr_nif.cc
23+
24+
all: check_version $(NIF_LIB)
25+
26+
check_version:
27+
@pkg-config --atleast-version=5.0.0 tesseract || \
28+
(echo "ERROR: image_ocr requires tesseract >= 5.0.0 (found $$(pkg-config --modversion tesseract 2>/dev/null || echo none))"; exit 1)
29+
30+
$(NIF_LIB): $(SRC)
31+
@mkdir -p $(PRIV_DIR)
32+
$(CXX) $(CXXFLAGS) $(CPPFLAGS) $(LDFLAGS) $(SRC) -o $@ $(LDLIBS)
33+
34+
clean:
35+
rm -f $(NIF_LIB)
36+
37+
.PHONY: all clean check_version

README.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# ImageOcr
2+
3+
Idiomatic Elixir interface to the [Tesseract](https://github.com/tesseract-ocr/tesseract)
4+
OCR engine. Implemented as a NIF over the Tesseract 5.x C++ API; accepts
5+
`Vix.Vips.Image` structs, file paths, or in-memory image binaries and returns
6+
recognised text.
7+
8+
## Requirements
9+
10+
* Tesseract **≥ 5.0** and Leptonica installed at build time, with both
11+
reachable via `pkg-config`.
12+
13+
```bash
14+
# macOS
15+
brew install tesseract leptonica
16+
17+
# Debian / Ubuntu
18+
apt-get install libtesseract-dev libleptonica-dev pkg-config
19+
20+
# Alpine
21+
apk add tesseract-ocr-dev leptonica-dev pkg-config
22+
```
23+
24+
* Elixir **≥ 1.20-rc** and OTP 27 or 28.
25+
26+
## Installation
27+
28+
```elixir
29+
def deps do
30+
[
31+
{:image_ocr, "~> 0.1.0"}
32+
]
33+
end
34+
```
35+
36+
Build the NIF on first compile:
37+
38+
```bash
39+
mix deps.get
40+
mix compile
41+
```
42+
43+
`image_ocr` ships the English (`eng`) `tessdata_fast` model in `priv/tessdata/`
44+
so the package is usable out of the box.
45+
46+
## Quick start
47+
48+
```elixir
49+
{:ok, ocr} = ImageOcr.new() # defaults to language: "en"
50+
{:ok, text} = ImageOcr.read_text(ocr, "page.png")
51+
```
52+
53+
`read_text/3` accepts:
54+
55+
* a `Vix.Vips.Image.t()` — used directly
56+
* a path to an image file — loaded via `Vix.Vips.Image.new_from_file/1`
57+
* an in-memory binary of encoded image data (PNG, JPEG, TIFF, …) — loaded
58+
via `Vix.Vips.Image.new_from_buffer/1`
59+
60+
For per-word output with confidence and bounding boxes, use
61+
`ImageOcr.recognize/3`:
62+
63+
```elixir
64+
{:ok, words} = ImageOcr.recognize(ocr, image)
65+
# => [%{text: "Hello", confidence: 96.4, bbox: {32, 18, 198, 64}}, …]
66+
```
67+
68+
## Concurrency
69+
70+
A single `ImageOcr` instance wraps one `tesseract::TessBaseAPI`, which is **not
71+
safe for concurrent use**. The NIF guards each instance with a mutex so
72+
accidental sharing degrades to serialisation rather than UB, but for real
73+
parallelism you want one instance per worker. The simplest way is the
74+
included pool:
75+
76+
```elixir
77+
children = [
78+
{ImageOcr.Pool, name: MyOcr, language: "eng", pool_size: 4}
79+
]
80+
81+
Supervisor.start_link(children, strategy: :one_for_one)
82+
83+
{:ok, text} = ImageOcr.Pool.read_text(MyOcr, "page.png")
84+
```
85+
86+
`pool_size` defaults to `System.schedulers_online()`. Each worker holds the
87+
loaded language model in memory — typically 2–50 MB depending on the language
88+
and trained-data variant — so size deliberately if you also load multiple
89+
languages or run on small hosts.
90+
91+
Recognition runs on dirty CPU schedulers, so it does not block the normal
92+
schedulers regardless of pool size.
93+
94+
## Trained-data (`tessdata`)
95+
96+
The trained-data directory is resolved in this order:
97+
98+
1. The `:datapath` option passed to `ImageOcr.new/1`.
99+
2. `Application.get_env(:image_ocr, :tessdata_path)`.
100+
3. The `TESSDATA_PREFIX` environment variable.
101+
4. The vendored fallback at `priv/tessdata/`.
102+
103+
Configure a project-wide location once:
104+
105+
```elixir
106+
# config/config.exs
107+
config :image_ocr, tessdata_path: "/var/lib/image_ocr/tessdata"
108+
```
109+
110+
### Mix tasks
111+
112+
Manage trained-data files without leaving your project:
113+
114+
```bash
115+
# Install one or more languages
116+
mix image_ocr.tessdata.add fra deu
117+
118+
# Use a specific variant ("fast" / "best" / "legacy") or branch
119+
mix image_ocr.tessdata.add chi_sim --variant best
120+
121+
# Write to a specific directory (overrides config)
122+
mix image_ocr.tessdata.add jpn --path /var/lib/tessdata
123+
124+
# Refresh every installed language to its latest upstream commit
125+
mix image_ocr.tessdata.update
126+
127+
# Show what's installed
128+
mix image_ocr.tessdata.list
129+
130+
# Remove a language
131+
mix image_ocr.tessdata.remove deu
132+
```
133+
134+
The tasks read from and write to the same path that `ImageOcr.new/1` does, so
135+
there is one source of truth.
136+
137+
## Tesseract 4.x vs 5.x
138+
139+
`image_ocr` requires Tesseract 5.x (currently 5.5+) and refuses to build
140+
against older versions. 5.x is actively maintained, ships in current LTS
141+
distros, and runs noticeably faster than 4.x on modern CPUs thanks to better
142+
SIMD use and float32 models. The C++ API surface we use is identical between
143+
4.x and 5.x, so 4.1+ would likely work — but we keep the support matrix
144+
tight.
145+
146+
## License
147+
148+
Apache-2.0.

0 commit comments

Comments
 (0)