Skip to content

Commit e100965

Browse files
committed
docs: document workers and ocr-workers semantics
1 parent 7ba02b2 commit e100965

1 file changed

Lines changed: 5 additions & 2 deletions

File tree

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Therefore, this project was created because, while [`docling`](https://github.co
2929

3030
- Async OCR requests and batch PDF processing using the Z.AI API.
3131
- Concurrent figure downloads for each PDF.
32-
- Fast processing: approximately 25 seconds per batch of 32 PDFs. Speed depends on the z.ai API availability. See the cost section for more details on spending.
32+
- Fast processing with separate controls for total pipeline concurrency and OCR API concurrency.
3333

3434
> [!note]
3535
> This tool was designed to be used with academic papers written in English. Parsing other PDFs, heavy in tables or figures, or in other languages rather than English has not been tested.
@@ -45,9 +45,11 @@ paperdown --input path/to/paper.pdf
4545
My preferred method is batch directory processing:
4646

4747
```bash
48-
paperdown --input pdf/ --output md/ --workers 4 --overwrite
48+
paperdown --input pdf/ --output md/ --workers 32 --ocr-workers 2 --overwrite
4949
```
5050

51+
`--workers` controls how many PDFs are processed concurrently in batch mode. `--ocr-workers` controls concurrent OCR API calls. Effective OCR concurrency is `min(--workers, --ocr-workers)`.
52+
5153
## Installation
5254

5355
Install from crates.io:
@@ -87,6 +89,7 @@ Options:
8789
--timeout <TIMEOUT> HTTP timeout in seconds for OCR requests and figure downloads. [default: 180]
8890
--max-download-bytes <MAX_DOWNLOAD_BYTES> Maximum allowed size (bytes) for each downloaded figure file. [default: 20971520]
8991
--workers <WORKERS> Maximum number of PDFs processed concurrently in batch mode. [default: 32]
92+
--ocr-workers <OCR_WORKERS> Maximum number of concurrent OCR API calls in batch mode; effective OCR concurrency is min(--workers, --ocr-workers). [default: 2]
9093
-v, --verbose Enable verbose progress messages on stderr.
9194
--overwrite Replace existing managed output artifacts (index.md, figures/, and tables/ when enabled).
9295
--normalize-tables Normalize OCR HTML tables into Markdown and store raw HTML under tables/.

0 commit comments

Comments
 (0)