You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| XML/HTML | .xml, .html |`parse_xml`| Extracts text content |
133
128
| Text | .txt, .csv, .md |`parse_text`| With encoding detection |
@@ -166,14 +161,35 @@ To run tests with coverage:
166
161
rake dev:coverage
167
162
```
168
163
164
+
### OCR Mode Configuration
165
+
166
+
By default, ParseKit bundles Tesseract for zero-dependency OCR support. Advanced users who already have Tesseract installed system-wide and want faster gem installation can use system mode:
The parsekit gem wraps the [parser-core](https://crates.io/crates/parser-core) Rust crate, which requires several system libraries for document parsing and OCR functionality.
3
+
The parsekit gem bundles all necessary libraries, making installation simple with no system dependencies required.
4
4
5
-
## Required Libraries
5
+
## Zero Dependencies by Default
6
6
7
-
### macOS
7
+
As of version 0.2.0, ParseKit bundles:
8
+
-**Tesseract OCR**: Statically linked, no system installation needed
If you already have Tesseract installed and want to use your system installation instead of the bundled version (for faster gem compilation during development), you can opt out of bundling:
The bundled mode compiles Tesseract from source, which can take 1-3 minutes on initial installation. This is a one-time cost. If you need faster rebuilds during development, consider using system mode.
96
+
97
+
### Out of memory during compilation
98
+
99
+
Bundling libraries requires more memory during compilation. If you encounter OOM errors:
100
+
1. Increase available memory
101
+
2. Or use system mode instead
102
+
103
+
### Want to use a specific Tesseract version
104
+
105
+
Use system mode and install your preferred Tesseract version through your package manager.
0 commit comments