Skip to content

Commit 223fca2

Browse files
authored
Merge pull request #8 from cpetersen/tesseract-rs
tesseract-rs
2 parents f3b9345 + 2865b36 commit 223fca2

30 files changed

Lines changed: 263 additions & 502 deletions

.github/workflows/ci.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,18 @@ jobs:
1818
ruby-version: ruby
1919
bundler-cache: true
2020

21+
- name: Download tessdata for CI
22+
run: |
23+
mkdir -p tessdata
24+
curl -L https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata -o tessdata/eng.traineddata
25+
export TESSDATA_PREFIX=$PWD/tessdata
26+
2127
- name: Compile native extension
22-
run: bundle exec rake compile
28+
run: |
29+
export TESSDATA_PREFIX=$PWD/tessdata
30+
bundle exec rake compile
2331
2432
- name: Run specs
2533
run: |
34+
export TESSDATA_PREFIX=$PWD/tessdata
2635
bundle exec rake spec

README.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,8 @@ gem install parsekit
3737
- Ruby >= 3.0.0
3838
- Rust toolchain (stable)
3939
- C compiler (for linking)
40-
- System libraries for document parsing:
41-
- **macOS**: `brew install leptonica tesseract poppler`
42-
- **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
43-
- **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
44-
- **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions
4540

46-
For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
41+
That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.
4742

4843
## Usage
4944

@@ -127,7 +122,7 @@ excel_text = parser.parse_xlsx(excel_data)
127122
| Word | .docx | `parse_docx` | Office Open XML format |
128123
| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
129124
| PowerPoint | .pptx | - | **Not yet supported** - see [implementation plan](docs/PPTX_PLAN.md) |
130-
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
125+
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
131126
| JSON | .json | `parse_json` | Pretty-printed output |
132127
| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
133128
| Text | .txt, .csv, .md | `parse_text` | With encoding detection |
@@ -166,14 +161,35 @@ To run tests with coverage:
166161
rake dev:coverage
167162
```
168163

164+
### OCR Mode Configuration
165+
166+
By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
167+
168+
**Using system Tesseract during installation:**
169+
```bash
170+
gem install parsekit -- --no-default-features
171+
```
172+
173+
**For development with system Tesseract:**
174+
```bash
175+
rake compile CARGO_FEATURES="" # Disables bundled-tesseract feature
176+
```
177+
178+
**System Tesseract requirements:**
179+
- **macOS**: `brew install tesseract`
180+
- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
181+
- **Fedora/RHEL**: `sudo dnf install tesseract-devel`
182+
183+
The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.
184+
169185
## Architecture
170186

171187
ParseKit uses a hybrid Ruby/Rust architecture:
172188

173189
- **Ruby Layer**: Provides convenient API and format detection
174190
- **Rust Layer**: Implements high-performance parsing using:
175191
- MuPDF for PDF text extraction (statically linked)
176-
- rusty-tesseract for OCR (with embedded Tesseract)
192+
- tesseract-rs for OCR (with bundled Tesseract by default)
177193
- Pure Rust libraries for DOCX/XLSX parsing
178194
- Magnus for Ruby-Rust FFI bindings
179195

docs/DEPENDENCIES.md

Lines changed: 57 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,57 @@
11
# System Dependencies
22

3-
The parsekit gem wraps the [parser-core](https://crates.io/crates/parser-core) Rust crate, which requires several system libraries for document parsing and OCR functionality.
3+
The parsekit gem bundles all necessary libraries, making installation simple with no system dependencies required.
44

5-
## Required Libraries
5+
## Zero Dependencies by Default
66

7-
### macOS
7+
As of version 0.2.0, ParseKit bundles:
8+
- **Tesseract OCR**: Statically linked, no system installation needed
9+
- **MuPDF**: Statically linked for PDF parsing
810

9-
Install using Homebrew:
11+
## Installation
1012

11-
```bash
12-
brew install leptonica tesseract poppler
13-
```
13+
Simply install the gem:
1414

15-
If you encounter pkg-config issues:
1615
```bash
17-
brew install pkg-config
18-
export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"
16+
gem install parsekit
1917
```
2018

21-
### Ubuntu/Debian
19+
No additional system libraries are required!
2220

23-
```bash
24-
sudo apt-get update
25-
sudo apt-get install -y \
26-
libleptonica-dev \
27-
libtesseract-dev \
28-
libpoppler-cpp-dev \
29-
tesseract-ocr \
30-
pkg-config
31-
```
21+
## For Advanced Users: System Mode
3222

33-
### Fedora/RHEL/CentOS
23+
If you already have Tesseract installed and want to use your system installation instead of the bundled version (for faster gem compilation during development), you can opt out of bundling:
3424

35-
```bash
36-
sudo dnf install -y \
37-
leptonica-devel \
38-
tesseract-devel \
39-
poppler-cpp-devel \
40-
tesseract \
41-
pkg-config
42-
```
25+
### Using System Tesseract
4326

44-
### Alpine Linux
27+
Install system dependencies first:
4528

29+
#### macOS
4630
```bash
47-
apk add \
48-
leptonica-dev \
49-
tesseract-ocr-dev \
50-
poppler-dev \
51-
pkgconfig
31+
brew install tesseract
5232
```
5333

54-
### Windows
55-
56-
On Windows, you'll need to:
57-
58-
1. Install [MSYS2](https://www.msys2.org/)
59-
2. In MSYS2 terminal:
34+
#### Ubuntu/Debian
6035
```bash
61-
pacman -S mingw-w64-x86_64-leptonica
62-
pacman -S mingw-w64-x86_64-tesseract-ocr
63-
pacman -S mingw-w64-x86_64-poppler
36+
sudo apt-get update
37+
sudo apt-get install -y libtesseract-dev tesseract-ocr
6438
```
6539

66-
## Troubleshooting
67-
68-
### pkg-config not found
69-
70-
If you get errors about pkg-config:
71-
72-
1. **macOS**: `brew install pkg-config`
73-
2. **Linux**: Install pkg-config for your distribution
74-
3. Set `PKG_CONFIG_PATH` to include the directory with `.pc` files
75-
76-
### Library not found
40+
#### Fedora/RHEL/CentOS
41+
```bash
42+
sudo dnf install -y tesseract-devel tesseract
43+
```
7744

78-
If libraries are installed but not found:
45+
Then install the gem without bundled features:
7946

8047
```bash
81-
# Find where .pc files are located
82-
find /usr -name "lept.pc" 2>/dev/null
83-
find /opt -name "lept.pc" 2>/dev/null
84-
85-
# Add to PKG_CONFIG_PATH
86-
export PKG_CONFIG_PATH="/path/to/pc/files:$PKG_CONFIG_PATH"
48+
gem install parsekit -- --no-default-features
8749
```
8850

89-
### Building without certain features
90-
91-
Currently, all dependencies are required. Future versions may make OCR optional.
51+
For development:
52+
```bash
53+
rake compile CARGO_FEATURES="" # Disables bundled-tesseract
54+
```
9255

9356
## Docker
9457

@@ -97,20 +60,11 @@ For containerized environments, here's a sample Dockerfile:
9760
```dockerfile
9861
FROM ruby:3.2
9962

100-
# Install system dependencies
101-
RUN apt-get update && apt-get install -y \
102-
libleptonica-dev \
103-
libtesseract-dev \
104-
libpoppler-cpp-dev \
105-
tesseract-ocr \
106-
pkg-config \
107-
&& rm -rf /var/lib/apt/lists/*
108-
109-
# Install Rust
63+
# Install Rust (required for compilation)
11064
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
11165
ENV PATH="/root/.cargo/bin:${PATH}"
11266

113-
# Your application setup
67+
# No system dependencies needed with bundled mode!
11468
WORKDIR /app
11569
COPY Gemfile* ./
11670
RUN bundle install
@@ -119,15 +73,33 @@ COPY . .
11973

12074
## CI/CD
12175

122-
For GitHub Actions, add this step before building:
76+
For GitHub Actions, no additional dependencies are needed:
12377

12478
```yaml
125-
- name: Install system dependencies
79+
- name: Setup Ruby
80+
uses: ruby/setup-ruby@v1
81+
with:
82+
ruby-version: ruby
83+
bundler-cache: true
84+
85+
- name: Compile and test
12686
run: |
127-
if [ "$RUNNER_OS" == "Linux" ]; then
128-
sudo apt-get update
129-
sudo apt-get install -y libleptonica-dev libtesseract-dev libpoppler-cpp-dev
130-
elif [ "$RUNNER_OS" == "macOS" ]; then
131-
brew install leptonica tesseract poppler
132-
fi
133-
```
87+
bundle exec rake compile
88+
bundle exec rake spec
89+
```
90+
91+
## Troubleshooting
92+
93+
### Compilation takes too long
94+
95+
The bundled mode compiles Tesseract from source, which can take 1-3 minutes on initial installation. This is a one-time cost. If you need faster rebuilds during development, consider using system mode.
96+
97+
### Out of memory during compilation
98+
99+
Bundling libraries requires more memory during compilation. If you encounter OOM errors:
100+
1. Increase available memory
101+
2. Or use system mode instead
102+
103+
### Want to use a specific Tesseract version
104+
105+
Use system mode and install your preferred Tesseract version through your package manager.

ext/parsekit/Cargo.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ magnus = { version = "0.7", features = ["rb-sys"] }
1515
# Document parsing - testing embedded C libraries
1616
# MuPDF builds from source and statically links
1717
mupdf = { version = "0.5", default-features = false, features = [] }
18-
# OCR - Tesseract with image loading support
19-
rusty-tesseract = "1.1" # Tesseract wrapper with image loading
18+
# OCR - Using tesseract-rs for both system and bundled modes
19+
tesseract-rs = "0.1" # Tesseract with optional bundling
2020
image = "0.25" # Image processing library (match rusty-tesseract's version)
2121
calamine = "0.26" # Excel parsing
2222
docx-rs = "0.4" # Word document parsing
@@ -26,7 +26,8 @@ regex = "1.10" # Text parsing
2626
encoding_rs = "0.8" # Encoding detection
2727

2828
[features]
29-
default = []
29+
default = ["bundled-tesseract"]
30+
bundled-tesseract = []
3031

3132
[profile.release]
3233
opt-level = 3

0 commit comments

Comments
 (0)