|
| 1 | +# ImageOcr |
| 2 | + |
| 3 | +Idiomatic Elixir interface to the [Tesseract](https://github.com/tesseract-ocr/tesseract) |
| 4 | +OCR engine. Implemented as a NIF over the Tesseract 5.x C++ API; accepts |
| 5 | +`Vix.Vips.Image` structs, file paths, or in-memory image binaries and returns |
| 6 | +recognised text. |
| 7 | + |
| 8 | +## Requirements |
| 9 | + |
| 10 | +* Tesseract **≥ 5.0** and Leptonica installed at build time, with both |
| 11 | + reachable via `pkg-config`. |
| 12 | + |
| 13 | + ```bash |
| 14 | + # macOS |
| 15 | + brew install tesseract leptonica |
| 16 | + |
| 17 | + # Debian / Ubuntu |
| 18 | + apt-get install libtesseract-dev libleptonica-dev pkg-config |
| 19 | + |
| 20 | + # Alpine |
| 21 | + apk add tesseract-ocr-dev leptonica-dev pkg-config |
| 22 | + ``` |
| 23 | + |
| 24 | +* Elixir **≥ 1.20-rc** and OTP 27 or 28. |
| 25 | + |
| 26 | +## Installation |
| 27 | + |
| 28 | +```elixir |
| 29 | +def deps do |
| 30 | + [ |
| 31 | + {:image_ocr, "~> 0.1.0"} |
| 32 | + ] |
| 33 | +end |
| 34 | +``` |
| 35 | + |
| 36 | +Build the NIF on first compile: |
| 37 | + |
| 38 | +```bash |
| 39 | +mix deps.get |
| 40 | +mix compile |
| 41 | +``` |
| 42 | + |
| 43 | +`image_ocr` ships the English (`eng`) `tessdata_fast` model in `priv/tessdata/` |
| 44 | +so the package is usable out of the box. |
| 45 | + |
| 46 | +## Quick start |
| 47 | + |
| 48 | +```elixir |
| 49 | +{:ok, ocr} = ImageOcr.new() # defaults to language: "en" |
| 50 | +{:ok, text} = ImageOcr.read_text(ocr, "page.png") |
| 51 | +``` |
| 52 | + |
| 53 | +`read_text/3` accepts: |
| 54 | + |
| 55 | +* a `Vix.Vips.Image.t()` — used directly |
| 56 | +* a path to an image file — loaded via `Vix.Vips.Image.new_from_file/1` |
| 57 | +* an in-memory binary of encoded image data (PNG, JPEG, TIFF, …) — loaded |
| 58 | + via `Vix.Vips.Image.new_from_buffer/1` |
| 59 | + |
| 60 | +For per-word output with confidence and bounding boxes, use |
| 61 | +`ImageOcr.recognize/3`: |
| 62 | + |
| 63 | +```elixir |
| 64 | +{:ok, words} = ImageOcr.recognize(ocr, image) |
| 65 | +# => [%{text: "Hello", confidence: 96.4, bbox: {32, 18, 198, 64}}, …] |
| 66 | +``` |
| 67 | + |
| 68 | +## Concurrency |
| 69 | + |
| 70 | +A single `ImageOcr` instance wraps one `tesseract::TessBaseAPI`, which is **not |
| 71 | +safe for concurrent use**. The NIF guards each instance with a mutex so |
| 72 | +accidental sharing degrades to serialisation rather than UB, but for real |
| 73 | +parallelism you want one instance per worker. The simplest way is the |
| 74 | +included pool: |
| 75 | + |
| 76 | +```elixir |
| 77 | +children = [ |
| 78 | + {ImageOcr.Pool, name: MyOcr, language: "eng", pool_size: 4} |
| 79 | +] |
| 80 | + |
| 81 | +Supervisor.start_link(children, strategy: :one_for_one) |
| 82 | + |
| 83 | +{:ok, text} = ImageOcr.Pool.read_text(MyOcr, "page.png") |
| 84 | +``` |
| 85 | + |
| 86 | +`pool_size` defaults to `System.schedulers_online()`. Each worker holds the |
| 87 | +loaded language model in memory — typically 2–50 MB depending on the language |
| 88 | +and trained-data variant — so size deliberately if you also load multiple |
| 89 | +languages or run on small hosts. |
| 90 | + |
| 91 | +Recognition runs on dirty CPU schedulers, so it does not block the normal |
| 92 | +schedulers regardless of pool size. |
| 93 | + |
| 94 | +## Trained-data (`tessdata`) |
| 95 | + |
| 96 | +The trained-data directory is resolved in this order: |
| 97 | + |
| 98 | +1. The `:datapath` option passed to `ImageOcr.new/1`. |
| 99 | +2. `Application.get_env(:image_ocr, :tessdata_path)`. |
| 100 | +3. The `TESSDATA_PREFIX` environment variable. |
| 101 | +4. The vendored fallback at `priv/tessdata/`. |
| 102 | + |
| 103 | +Configure a project-wide location once: |
| 104 | + |
| 105 | +```elixir |
| 106 | +# config/config.exs |
| 107 | +config :image_ocr, tessdata_path: "/var/lib/image_ocr/tessdata" |
| 108 | +``` |
| 109 | + |
| 110 | +### Mix tasks |
| 111 | + |
| 112 | +Manage trained-data files without leaving your project: |
| 113 | + |
| 114 | +```bash |
| 115 | +# Install one or more languages |
| 116 | +mix image_ocr.tessdata.add fra deu |
| 117 | + |
| 118 | +# Use a specific variant ("fast" / "best" / "legacy") or branch |
| 119 | +mix image_ocr.tessdata.add chi_sim --variant best |
| 120 | + |
| 121 | +# Write to a specific directory (overrides config) |
| 122 | +mix image_ocr.tessdata.add jpn --path /var/lib/tessdata |
| 123 | + |
| 124 | +# Refresh every installed language to its latest upstream commit |
| 125 | +mix image_ocr.tessdata.update |
| 126 | + |
| 127 | +# Show what's installed |
| 128 | +mix image_ocr.tessdata.list |
| 129 | + |
| 130 | +# Remove a language |
| 131 | +mix image_ocr.tessdata.remove deu |
| 132 | +``` |
| 133 | + |
| 134 | +The tasks read from and write to the same path that `ImageOcr.new/1` does, so |
| 135 | +there is one source of truth. |
| 136 | + |
| 137 | +## Tesseract 4.x vs 5.x |
| 138 | + |
| 139 | +`image_ocr` requires Tesseract 5.x (currently 5.5+) and refuses to build |
| 140 | +against older versions. 5.x is actively maintained, ships in current LTS |
| 141 | +distros, and runs noticeably faster than 4.x on modern CPUs thanks to better |
| 142 | +SIMD use and float32 models. The C++ API surface we use is identical between |
| 143 | +4.x and 5.x, so 4.1+ would likely work — but we keep the support matrix |
| 144 | +tight. |
| 145 | + |
| 146 | +## License |
| 147 | + |
| 148 | +Apache-2.0. |
0 commit comments