You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ Therefore, this project was created because, while [`docling`](https://github.co
29
29
30
30
- Async OCR requests and batch PDF processing using the Z.AI API.
31
31
- Concurrent figure downloads for each PDF.
32
-
- Fast processing: approximately 25 seconds per batch of 32 PDFs. Speed depends on the z.ai API availability. See the cost section for more details on spending.
32
+
- Fast processing with separate controls for total pipeline concurrency and OCR API concurrency.
33
33
34
34
> [!note]
35
35
> This tool was designed to be used with academic papers written in English. Parsing other PDFs, heavy in tables or figures, or in other languages rather than English has not been tested.
`--workers` controls how many PDFs are processed concurrently in batch mode. `--ocr-workers` controls concurrent OCR API calls. Effective OCR concurrency is `min(--workers, --ocr-workers)`.
52
+
51
53
## Installation
52
54
53
55
Install from crates.io:
@@ -87,6 +89,7 @@ Options:
87
89
--timeout <TIMEOUT> HTTP timeout in seconds for OCR requests and figure downloads. [default: 180]
88
90
--max-download-bytes <MAX_DOWNLOAD_BYTES> Maximum allowed size (bytes) for each downloaded figure file. [default: 20971520]
89
91
--workers <WORKERS> Maximum number of PDFs processed concurrently in batch mode. [default: 32]
92
+
--ocr-workers <OCR_WORKERS> Maximum number of concurrent OCR API calls in batch mode; effective OCR concurrency is min(--workers, --ocr-workers). [default: 2]
90
93
-v, --verbose Enable verbose progress messages on stderr.
91
94
--overwrite Replace existing managed output artifacts (index.md, figures/, and tables/ when enabled).
92
95
--normalize-tables Normalize OCR HTML tables into Markdown and store raw HTML under tables/.
0 commit comments