perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract by KRRT7 · Pull Request #4294 · Unstructured-IO/unstructured

KRRT7 · 2026-03-24T12:31:22Z

Summary

Concurrent OCR across pages using aiopytesseract (async subprocess execution via asyncio.gather) for all Tesseract OCR paths:
- hi_res strategy: full_page and individual_blocks modes
- ocr_only strategy: layout elements extraction (hocr + plaintext in parallel per page)
- Table OCR: pre-fetch all table crop tokens concurrently
Remove pytesseract dependency entirely — sync methods delegate to async aiopytesseract via asyncio.run(), aiopytesseract is the sole tesseract interface
OCRAgentTesseract async methods: get_layout_from_image_async, get_text_from_image_async, get_layout_elements_from_image_async — each accepts an optional asyncio.Semaphore for concurrency control
DRY: Extract compute_zoom() shared by sync and async paths (zoom logic in one place)
Concurrency controlled by OCR_CONCURRENCY env var (defaults to cpu_count)

Why

End-to-end profiling of partition_pdf(strategy="hi_res") on a 10-page PDF showed OCR at 45.6% of total time. Each page spawns 1-2 Tesseract subprocesses sequentially, but pages are independent. aiopytesseract uses asyncio.create_subprocess_exec instead of subprocess.Popen, enabling asyncio.gather across pages.

Benchmark

Isolated OCR benchmark (pre-rendered pages, no model inference competing for CPU):

PDF: loremipsum_multipage.pdf (10 pages)
CPUs: 8, OCR_CONCURRENCY: 8

Sequential median: 61.7s (6.17s/page)
Concurrent median: 28.4s (2.84s/page)
Speedup: 2.17x
Time saved: 33.2s (54%)

In the full partition_pdf(strategy="hi_res") pipeline, the speedup is smaller (~1.1x) because model inference also uses CPU. On production machines with more cores (16-64), the concurrent path will scale better since Tesseract subprocesses are independently schedulable.

Changes

File	Change
`pyproject.toml`	Add `aiopytesseract>=1.1.0` as required dep, remove `pytesseract` from image extras
`tesseract_ocr.py`	Sync methods delegate to async via `asyncio.run()`, `compute_zoom()`, remove `pytesseract` import
`ocr.py`	`run_ocr_concurrent` dispatcher, `async_ocr_page` (full_page + individual_blocks), `async_table_extraction`
`pdf.py`	Concurrent `ocr_only` strategy via `get_layout_elements_from_image_async`
`strategies.py`	Update dependency check and messages from `pytesseract` to `aiopytesseract`
`scripts/collect_env.py`	Report `aiopytesseract` version instead of `pytesseract`
`test_*.py` (3 files)	Update mocks from `pytesseract` to `aiopytesseract`

Test plan

partition_pdf(strategy="hi_res") produces identical elements with/without async
partition_pdf(strategy="ocr_only") produces identical elements
OCR_CONCURRENCY=1 falls back to effectively sequential execution
Non-tesseract agents (paddle, google_vision) still work via sync fallback
Table extraction with infer_table_structure=True produces same HTML

Replace the sequential per-page OCR loop in process_file_with_ocr with concurrent execution using asyncio.gather + aiopytesseract. Each page's Tesseract subprocess now runs in parallel instead of waiting for the previous page to finish. Benchmark (loremipsum_multipage.pdf, 10 pages, CPU): Stage Before After Change OCR 77.5s 32.9s -57% Total 169.8s 115.9s -32% Falls back to sequential processing when: - OCR agent is not tesseract (e.g., PaddleOCR) - OCR mode is individual_blocks (not full_page) - aiopytesseract is not installed Concurrency is controlled by OCR_CONCURRENCY env var (default: cpu_count).

The forked unstructured-pytesseract has only 15 trivial line differences from upstream pytesseract (type annotation style, unused extra_config param). Drop the fork and use the actively maintained upstream package.

Add get_layout_from_image_async() to OCRAgentTesseract so the zoom logic lives in one place instead of being duplicated in ocr.py. Drop _ prefixes from module-level functions (run_ocr_concurrent, async_ocr_page, OCR_CONCURRENCY).

The zoom calculation logic was duplicated between get_layout_from_image and get_layout_from_image_async. Extract it into compute_zoom() so both methods share the same implementation.

Add get_text_from_image_async and get_layout_elements_from_image_async to OCRAgentTesseract. The latter runs hocr + plaintext extraction in parallel per page via asyncio.gather. Wire into _partition_pdf_or_image_with_ocr so all pages are OCR'd concurrently when tesseract + aiopytesseract are available. Removes the now-unused _partition_pdf_or_image_with_ocr_from_image function.

Make async_ocr_page handle both full_page and individual_blocks modes. In individual_blocks mode, gather all get_text_from_image_async calls concurrently instead of sequential pytesseract subprocesses. Add async_table_extraction that pre-fetches OCR tokens for all tables concurrently via get_layout_from_image_async, then runs table model predict sequentially (CPU-bound). Extend run_ocr_concurrent to use async for all tesseract OCR modes, not just full_page.

aiopytesseract is a required dependency, so try/except ImportError fallbacks are unnecessary. Simplify to direct isinstance checks.

…pytesseract Sync methods (get_text_from_image, get_layout_from_image, get_layout_elements_from_image) now call asyncio.run() on their async counterparts, eliminating the pytesseract Python package dependency. aiopytesseract (already a required dep) is the sole tesseract interface.

…sertions - Change aiopytesseract dpi=300 to dpi=70 to match tesseract's default when reading temp PNGs without DPI metadata (the old unstructured-pytesseract behavior). This was the root cause of OCR accuracy regressions (NarrativeText vs Title, DC 20224 vs DC 224). - Move aiopytesseract from core deps to image optional group (replacing unstructured-pytesseract) to preserve original dependency scope. - Restore original strict test assertion for test_partition_image_hi_res_ocr_mode since DPI=70 preserves identical OCR output.

cragwolfe

hold up

PastelStorm

Findings

High: new asyncio.run() calls break callers that already have an active event loop.
In unstructured/partition/pdf.py, _partition_pdf_or_image_with_ocr() now does asyncio.run(gather_pages()). In unstructured/partition/utils/ocr_models/tesseract_ocr.py, the sync wrappers get_text_from_image(), get_layout_from_image(), and get_layout_elements_from_image() now also call asyncio.run(...). That will raise RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks, async services, or any sync wrapper invoked from async code. This is especially notable because the same branch already introduced _run_coro() in unstructured/partition/pdf_image/ocr.py to avoid exactly that issue elsewhere, so the behavior is now inconsistent across entry points.
High: the new async table-extraction path silently suppresses table-agent load failures.
There is now a semantic mismatch between the old and new paths in unstructured/partition/pdf_image/ocr.py. supplement_page_layout_with_ocr() still raises RuntimeError("Unable to load table extraction agent.") when tables.tables_agent is unavailable, but async_table_extraction() just returns. That means the concurrent Tesseract path can silently skip table extraction where the existing path would fail loudly, which is a user-visible behavior regression.
Medium: OCRLayoutDumper.add_ocred_page() can label pages in completion order instead of document order.
OCRLayoutDumper tracks page numbers with an internal incrementing counter rather than accepting the actual page number, and the new async OCR path calls add_ocred_page() from concurrently running page tasks. As a result, analysis output can mislabel OCR pages whenever page N finishes before page N-1. This does not affect the returned partition elements, but it does make the OCR analysis artifact incorrect.
Medium: tables.tables_agent.predict() still blocks the event loop inside async_table_extraction().
The async table path prefetches OCR concurrently, but the actual TATR inference still runs synchronously in the coroutine via tables.tables_agent.predict(...). For table-heavy documents, this stalls other coroutines while each table prediction runs and cuts into the throughput gains this PR is trying to achieve.
Medium: PDF page image handles are leaked on OCR failure.
process_file_with_ocr() now opens all rendered PDF page images up front, stores them in page_args, then closes them only after _run_coro(...) returns successfully. If OCR raises, the close loop is never reached. That is a real resource leak and can also interfere with temp-file cleanup on platforms with stricter file locking semantics.
Medium: OCR_CONCURRENCY is not validated and can either crash or deadlock the OCR path.
OCR_CONCURRENCY = int(os.environ.get(...)) will raise immediately on invalid values, and OCR_CONCURRENCY=0 creates a zero-capacity semaphore that blocks every guarded OCR call forever. Since this branch introduces the setting as a user-facing control, it should be validated and clamped. Also, unlike most of the rest of the codebase, this value is frozen at module import time rather than going through env_config.
Medium: _partition_pdf_or_image_with_ocr() now buffers all page images before OCR, which is a scalability regression.
The new code in unstructured/partition/pdf.py first collects every rendered page into pages and only then starts OCR. The previous flow processed page-by-page. This increases peak memory substantially for large PDFs and can turn a performance optimization into an OOM/swap risk.
Medium: the new async path duplicates a large amount of the existing sync OCR logic, and the two paths have already drifted.
async_ocr_page() and async_table_extraction() largely reimplement behavior that already exists in supplement_page_layout_with_ocr(), supplement_element_with_table_extraction(), and get_table_tokens(). This is not just a style concern: the table-agent failure handling has already diverged between the sync and async implementations. I’d strongly prefer extracting shared logic so future fixes don’t have to be applied in two places.
Low: test coverage does not directly protect the new async machinery.
The branch does exercise the new path indirectly, but there are still no focused tests for _run_coro, concurrent page execution, async table extraction behavior, event-loop edge cases, or OCR_CONCURRENCY handling. Given that the primary change here is new async orchestration, those gaps make the higher-risk behavior hard to trust.
Low: some existing tests were weakened around the code that changed most.
test_unstructured/partition/pdf_image/test_ocr.py::test_get_ocr_layout_from_image_tesseract no longer exercises the Tesseract layout-from-image path at all; it now only tests parse_data(). That means regressions in the new async HOCR/zoom flow would not be caught there. Also, test_unstructured/partition/pdf_image/test_pdf.py::test_ocr_language_passes_through only proves that the first aiopytesseract.execute() call receives lang; on the OCR-only path there are two concurrent Tesseract calls, so the test can still pass even if one of them stops forwarding the language correctly.

Overall

The biggest issues to fix before merging are:

event-loop breakage from the new asyncio.run() usage,
the silent failure regression in async table extraction,
the incorrect page numbering in OCR analysis output,
and the leaked PDF image handles on OCR failure.

After that, I’d tackle the event-loop blocking TATR call and the sync/async duplication, because those both undercut the main goal of the PR.

KRRT7 added 15 commits March 24, 2026 05:18

chore: replace unstructured-pytesseract with standard pytesseract

6214852

The forked unstructured-pytesseract has only 15 trivial line differences from upstream pytesseract (type annotation style, unused extra_config param). Drop the fork and use the actively maintained upstream package.

refactor: move async OCR logic into OCRAgentTesseract

45c2a75

Add get_layout_from_image_async() to OCRAgentTesseract so the zoom logic lives in one place instead of being duplicated in ocr.py. Drop _ prefixes from module-level functions (run_ocr_concurrent, async_ocr_page, OCR_CONCURRENCY).

refactor: extract compute_zoom to DRY sync/async OCR paths

a1fb2e6

The zoom calculation logic was duplicated between get_layout_from_image and get_layout_from_image_async. Extract it into compute_zoom() so both methods share the same implementation.

chore: remove aiopytesseract import guards

ad7d590

aiopytesseract is a required dependency, so try/except ImportError fallbacks are unnecessary. Simplify to direct isinstance checks.

Add CHANGELOG entry for concurrent OCR with aiopytesseract

96cce72

Bump version to 0.22.2 for changelog CI check

fab6af6

format

50910b5

formatting.

bff579a

comp

56eb0c8

chore: update uv.lock for aiopytesseract in image group

8354cc0

cragwolfe approved these changes Mar 25, 2026

View reviewed changes

cragwolfe requested changes Mar 25, 2026

View reviewed changes

PastelStorm requested changes Mar 26, 2026

View reviewed changes

KRRT7 closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294

perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294
KRRT7 wants to merge 15 commits intoUnstructured-IO:mainfrom
KRRT7:perf/concurrent-ocr-aiopytesseract

KRRT7 commented Mar 24, 2026

Uh oh!

cragwolfe left a comment

Uh oh!

PastelStorm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KRRT7 commented Mar 24, 2026

Summary

Why

Benchmark

Changes

Test plan

Uh oh!

cragwolfe left a comment

Choose a reason for hiding this comment

Uh oh!

PastelStorm left a comment

Choose a reason for hiding this comment

Findings

Overall

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants