Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,18 @@ jobs:
ruby-version: ruby
bundler-cache: true

- name: Download tessdata for CI
run: |
mkdir -p tessdata
curl -L https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata -o tessdata/eng.traineddata
export TESSDATA_PREFIX=$PWD/tessdata

- name: Compile native extension
run: bundle exec rake compile
run: |
export TESSDATA_PREFIX=$PWD/tessdata
bundle exec rake compile

- name: Run specs
run: |
export TESSDATA_PREFIX=$PWD/tessdata
bundle exec rake spec
32 changes: 24 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,8 @@ gem install parsekit
- Ruby >= 3.0.0
- Rust toolchain (stable)
- C compiler (for linking)
- System libraries for document parsing:
- **macOS**: `brew install leptonica tesseract poppler`
- **Ubuntu/Debian**: `sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev`
- **Fedora/RHEL**: `sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel`
- **Windows**: See [DEPENDENCIES.md](DEPENDENCIES.md) for MSYS2 instructions

For detailed installation instructions and troubleshooting, see [DEPENDENCIES.md](DEPENDENCIES.md).
That's it! ParseKit bundles all necessary libraries including Tesseract for OCR, so you don't need to install any system dependencies.

## Usage

Expand Down Expand Up @@ -127,7 +122,7 @@ excel_text = parser.parse_xlsx(excel_data)
| Word | .docx | `parse_docx` | Office Open XML format |
| Excel | .xlsx, .xls | `parse_xlsx` | Both modern and legacy formats |
| PowerPoint | .pptx | - | **Not yet supported** - see [implementation plan](docs/PPTX_PLAN.md) |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via embedded Tesseract |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | `ocr_image` | OCR via bundled Tesseract |
| JSON | .json | `parse_json` | Pretty-printed output |
| XML/HTML | .xml, .html | `parse_xml` | Extracts text content |
| Text | .txt, .csv, .md | `parse_text` | With encoding detection |
Expand Down Expand Up @@ -166,14 +161,35 @@ To run tests with coverage:
rake dev:coverage
```

### OCR Mode Configuration

By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:

**Using system Tesseract during installation:**
```bash
gem install parsekit -- --no-default-features
```

**For development with system Tesseract:**
```bash
rake compile CARGO_FEATURES="" # Disables bundled-tesseract feature
```

**System Tesseract requirements:**
- **macOS**: `brew install tesseract`
- **Ubuntu/Debian**: `sudo apt-get install libtesseract-dev`
- **Fedora/RHEL**: `sudo dnf install tesseract-devel`

The bundled mode adds ~1-3 minutes to initial gem installation but provides a completely self-contained experience with no external dependencies.

## Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

- **Ruby Layer**: Provides convenient API and format detection
- **Rust Layer**: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- rusty-tesseract for OCR (with embedded Tesseract)
- tesseract-rs for OCR (with bundled Tesseract by default)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings

Expand Down
142 changes: 57 additions & 85 deletions docs/DEPENDENCIES.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,57 @@
# System Dependencies

The parsekit gem wraps the [parser-core](https://crates.io/crates/parser-core) Rust crate, which requires several system libraries for document parsing and OCR functionality.
The parsekit gem bundles all necessary libraries, making installation simple with no system dependencies required.

## Required Libraries
## Zero Dependencies by Default

### macOS
As of version 0.2.0, ParseKit bundles:
- **Tesseract OCR**: Statically linked, no system installation needed
- **MuPDF**: Statically linked for PDF parsing

Install using Homebrew:
## Installation

```bash
brew install leptonica tesseract poppler
```
Simply install the gem:

If you encounter pkg-config issues:
```bash
brew install pkg-config
export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH"
gem install parsekit
```

### Ubuntu/Debian
No additional system libraries are required!

```bash
sudo apt-get update
sudo apt-get install -y \
libleptonica-dev \
libtesseract-dev \
libpoppler-cpp-dev \
tesseract-ocr \
pkg-config
```
## For Advanced Users: System Mode

### Fedora/RHEL/CentOS
If you already have Tesseract installed and want to use your system installation instead of the bundled version (for faster gem compilation during development), you can opt out of bundling:

```bash
sudo dnf install -y \
leptonica-devel \
tesseract-devel \
poppler-cpp-devel \
tesseract \
pkg-config
```
### Using System Tesseract

### Alpine Linux
Install system dependencies first:

#### macOS
```bash
apk add \
leptonica-dev \
tesseract-ocr-dev \
poppler-dev \
pkgconfig
brew install tesseract
```

### Windows

On Windows, you'll need to:

1. Install [MSYS2](https://www.msys2.org/)
2. In MSYS2 terminal:
#### Ubuntu/Debian
```bash
pacman -S mingw-w64-x86_64-leptonica
pacman -S mingw-w64-x86_64-tesseract-ocr
pacman -S mingw-w64-x86_64-poppler
sudo apt-get update
sudo apt-get install -y libtesseract-dev tesseract-ocr
```

## Troubleshooting

### pkg-config not found

If you get errors about pkg-config:

1. **macOS**: `brew install pkg-config`
2. **Linux**: Install pkg-config for your distribution
3. Set `PKG_CONFIG_PATH` to include the directory with `.pc` files

### Library not found
#### Fedora/RHEL/CentOS
```bash
sudo dnf install -y tesseract-devel tesseract
```

If libraries are installed but not found:
Then install the gem without bundled features:

```bash
# Find where .pc files are located
find /usr -name "lept.pc" 2>/dev/null
find /opt -name "lept.pc" 2>/dev/null

# Add to PKG_CONFIG_PATH
export PKG_CONFIG_PATH="/path/to/pc/files:$PKG_CONFIG_PATH"
gem install parsekit -- --no-default-features
```

### Building without certain features

Currently, all dependencies are required. Future versions may make OCR optional.
For development:
```bash
rake compile CARGO_FEATURES="" # Disables bundled-tesseract
```

## Docker

Expand All @@ -97,20 +60,11 @@ For containerized environments, here's a sample Dockerfile:
```dockerfile
FROM ruby:3.2

# Install system dependencies
RUN apt-get update && apt-get install -y \
libleptonica-dev \
libtesseract-dev \
libpoppler-cpp-dev \
tesseract-ocr \
pkg-config \
&& rm -rf /var/lib/apt/lists/*

# Install Rust
# Install Rust (required for compilation)
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Your application setup
# No system dependencies needed with bundled mode!
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
Expand All @@ -119,15 +73,33 @@ COPY . .

## CI/CD

For GitHub Actions, add this step before building:
For GitHub Actions, no additional dependencies are needed:

```yaml
- name: Install system dependencies
- name: Setup Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: ruby
bundler-cache: true

- name: Compile and test
run: |
if [ "$RUNNER_OS" == "Linux" ]; then
sudo apt-get update
sudo apt-get install -y libleptonica-dev libtesseract-dev libpoppler-cpp-dev
elif [ "$RUNNER_OS" == "macOS" ]; then
brew install leptonica tesseract poppler
fi
```
bundle exec rake compile
bundle exec rake spec
```

## Troubleshooting

### Compilation takes too long

The bundled mode compiles Tesseract from source, which can take 1-3 minutes on initial installation. This is a one-time cost. If you need faster rebuilds during development, consider using system mode.

### Out of memory during compilation

Bundling libraries requires more memory during compilation. If you encounter OOM errors:
1. Increase available memory
2. Or use system mode instead

### Want to use a specific Tesseract version

Use system mode and install your preferred Tesseract version through your package manager.
7 changes: 4 additions & 3 deletions ext/parsekit/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ magnus = { version = "0.7", features = ["rb-sys"] }
# Document parsing - testing embedded C libraries
# MuPDF builds from source and statically links
mupdf = { version = "0.5", default-features = false, features = [] }
# OCR - Tesseract with image loading support
rusty-tesseract = "1.1" # Tesseract wrapper with image loading
# OCR - Using tesseract-rs for both system and bundled modes
tesseract-rs = "0.1" # Tesseract with optional bundling
image = "0.25" # Image processing library (match rusty-tesseract's version)
calamine = "0.26" # Excel parsing
docx-rs = "0.4" # Word document parsing
Expand All @@ -26,7 +26,8 @@ regex = "1.10" # Text parsing
encoding_rs = "0.8" # Encoding detection

[features]
default = []
default = ["bundled-tesseract"]
bundled-tesseract = []

[profile.release]
opt-level = 3
Expand Down
Loading