perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294
perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294KRRT7 wants to merge 15 commits intoUnstructured-IO:mainfrom
Conversation
Replace the sequential per-page OCR loop in process_file_with_ocr with concurrent execution using asyncio.gather + aiopytesseract. Each page's Tesseract subprocess now runs in parallel instead of waiting for the previous page to finish. Benchmark (loremipsum_multipage.pdf, 10 pages, CPU): Stage Before After Change OCR 77.5s 32.9s -57% Total 169.8s 115.9s -32% Falls back to sequential processing when: - OCR agent is not tesseract (e.g., PaddleOCR) - OCR mode is individual_blocks (not full_page) - aiopytesseract is not installed Concurrency is controlled by OCR_CONCURRENCY env var (default: cpu_count).
The forked unstructured-pytesseract has only 15 trivial line differences from upstream pytesseract (type annotation style, unused extra_config param). Drop the fork and use the actively maintained upstream package.
Add get_layout_from_image_async() to OCRAgentTesseract so the zoom logic lives in one place instead of being duplicated in ocr.py. Drop _ prefixes from module-level functions (run_ocr_concurrent, async_ocr_page, OCR_CONCURRENCY).
The zoom calculation logic was duplicated between get_layout_from_image and get_layout_from_image_async. Extract it into compute_zoom() so both methods share the same implementation.
Add get_text_from_image_async and get_layout_elements_from_image_async to OCRAgentTesseract. The latter runs hocr + plaintext extraction in parallel per page via asyncio.gather. Wire into _partition_pdf_or_image_with_ocr so all pages are OCR'd concurrently when tesseract + aiopytesseract are available. Removes the now-unused _partition_pdf_or_image_with_ocr_from_image function.
Make async_ocr_page handle both full_page and individual_blocks modes. In individual_blocks mode, gather all get_text_from_image_async calls concurrently instead of sequential pytesseract subprocesses. Add async_table_extraction that pre-fetches OCR tokens for all tables concurrently via get_layout_from_image_async, then runs table model predict sequentially (CPU-bound). Extend run_ocr_concurrent to use async for all tesseract OCR modes, not just full_page.
aiopytesseract is a required dependency, so try/except ImportError fallbacks are unnecessary. Simplify to direct isinstance checks.
…pytesseract Sync methods (get_text_from_image, get_layout_from_image, get_layout_elements_from_image) now call asyncio.run() on their async counterparts, eliminating the pytesseract Python package dependency. aiopytesseract (already a required dep) is the sole tesseract interface.
…sertions - Change aiopytesseract dpi=300 to dpi=70 to match tesseract's default when reading temp PNGs without DPI metadata (the old unstructured-pytesseract behavior). This was the root cause of OCR accuracy regressions (NarrativeText vs Title, DC 20224 vs DC 224). - Move aiopytesseract from core deps to image optional group (replacing unstructured-pytesseract) to preserve original dependency scope. - Restore original strict test assertion for test_partition_image_hi_res_ocr_mode since DPI=70 preserves identical OCR output.
PastelStorm
left a comment
There was a problem hiding this comment.
Findings
-
High: new
asyncio.run()calls break callers that already have an active event loop.
Inunstructured/partition/pdf.py,_partition_pdf_or_image_with_ocr()now doesasyncio.run(gather_pages()). Inunstructured/partition/utils/ocr_models/tesseract_ocr.py, the sync wrappersget_text_from_image(),get_layout_from_image(), andget_layout_elements_from_image()now also callasyncio.run(...). That will raiseRuntimeError: asyncio.run() cannot be called from a running event loopin notebooks, async services, or any sync wrapper invoked from async code. This is especially notable because the same branch already introduced_run_coro()inunstructured/partition/pdf_image/ocr.pyto avoid exactly that issue elsewhere, so the behavior is now inconsistent across entry points. -
High: the new async table-extraction path silently suppresses table-agent load failures.
There is now a semantic mismatch between the old and new paths inunstructured/partition/pdf_image/ocr.py.supplement_page_layout_with_ocr()still raisesRuntimeError("Unable to load table extraction agent.")whentables.tables_agentis unavailable, butasync_table_extraction()just returns. That means the concurrent Tesseract path can silently skip table extraction where the existing path would fail loudly, which is a user-visible behavior regression. -
Medium:
OCRLayoutDumper.add_ocred_page()can label pages in completion order instead of document order.
OCRLayoutDumpertracks page numbers with an internal incrementing counter rather than accepting the actual page number, and the new async OCR path callsadd_ocred_page()from concurrently running page tasks. As a result, analysis output can mislabel OCR pages whenever page N finishes before page N-1. This does not affect the returned partition elements, but it does make the OCR analysis artifact incorrect. -
Medium:
tables.tables_agent.predict()still blocks the event loop insideasync_table_extraction().
The async table path prefetches OCR concurrently, but the actual TATR inference still runs synchronously in the coroutine viatables.tables_agent.predict(...). For table-heavy documents, this stalls other coroutines while each table prediction runs and cuts into the throughput gains this PR is trying to achieve. -
Medium: PDF page image handles are leaked on OCR failure.
process_file_with_ocr()now opens all rendered PDF page images up front, stores them inpage_args, then closes them only after_run_coro(...)returns successfully. If OCR raises, the close loop is never reached. That is a real resource leak and can also interfere with temp-file cleanup on platforms with stricter file locking semantics. -
Medium:
OCR_CONCURRENCYis not validated and can either crash or deadlock the OCR path.
OCR_CONCURRENCY = int(os.environ.get(...))will raise immediately on invalid values, andOCR_CONCURRENCY=0creates a zero-capacity semaphore that blocks every guarded OCR call forever. Since this branch introduces the setting as a user-facing control, it should be validated and clamped. Also, unlike most of the rest of the codebase, this value is frozen at module import time rather than going throughenv_config. -
Medium:
_partition_pdf_or_image_with_ocr()now buffers all page images before OCR, which is a scalability regression.
The new code inunstructured/partition/pdf.pyfirst collects every rendered page intopagesand only then starts OCR. The previous flow processed page-by-page. This increases peak memory substantially for large PDFs and can turn a performance optimization into an OOM/swap risk. -
Medium: the new async path duplicates a large amount of the existing sync OCR logic, and the two paths have already drifted.
async_ocr_page()andasync_table_extraction()largely reimplement behavior that already exists insupplement_page_layout_with_ocr(),supplement_element_with_table_extraction(), andget_table_tokens(). This is not just a style concern: the table-agent failure handling has already diverged between the sync and async implementations. I’d strongly prefer extracting shared logic so future fixes don’t have to be applied in two places. -
Low: test coverage does not directly protect the new async machinery.
The branch does exercise the new path indirectly, but there are still no focused tests for_run_coro, concurrent page execution, async table extraction behavior, event-loop edge cases, orOCR_CONCURRENCYhandling. Given that the primary change here is new async orchestration, those gaps make the higher-risk behavior hard to trust. -
Low: some existing tests were weakened around the code that changed most.
test_unstructured/partition/pdf_image/test_ocr.py::test_get_ocr_layout_from_image_tesseractno longer exercises the Tesseract layout-from-image path at all; it now only testsparse_data(). That means regressions in the new async HOCR/zoom flow would not be caught there. Also,test_unstructured/partition/pdf_image/test_pdf.py::test_ocr_language_passes_throughonly proves that the firstaiopytesseract.execute()call receiveslang; on the OCR-only path there are two concurrent Tesseract calls, so the test can still pass even if one of them stops forwarding the language correctly.
Overall
The biggest issues to fix before merging are:
- event-loop breakage from the new
asyncio.run()usage, - the silent failure regression in async table extraction,
- the incorrect page numbering in OCR analysis output,
- and the leaked PDF image handles on OCR failure.
After that, I’d tackle the event-loop blocking TATR call and the sync/async duplication, because those both undercut the main goal of the PR.
Summary
aiopytesseract(async subprocess execution viaasyncio.gather) for all Tesseract OCR paths:hi_resstrategy: full_page and individual_blocks modesocr_onlystrategy: layout elements extraction (hocr + plaintext in parallel per page)pytesseractdependency entirely — sync methods delegate to asyncaiopytesseractviaasyncio.run(),aiopytesseractis the sole tesseract interfaceOCRAgentTesseractasync methods:get_layout_from_image_async,get_text_from_image_async,get_layout_elements_from_image_async— each accepts an optionalasyncio.Semaphorefor concurrency controlcompute_zoom()shared by sync and async paths (zoom logic in one place)OCR_CONCURRENCYenv var (defaults tocpu_count)Why
End-to-end profiling of
partition_pdf(strategy="hi_res")on a 10-page PDF showed OCR at 45.6% of total time. Each page spawns 1-2 Tesseract subprocesses sequentially, but pages are independent.aiopytesseractusesasyncio.create_subprocess_execinstead ofsubprocess.Popen, enablingasyncio.gatheracross pages.Benchmark
Isolated OCR benchmark (pre-rendered pages, no model inference competing for CPU):
In the full
partition_pdf(strategy="hi_res")pipeline, the speedup is smaller (~1.1x) because model inference also uses CPU. On production machines with more cores (16-64), the concurrent path will scale better since Tesseract subprocesses are independently schedulable.Changes
pyproject.tomlaiopytesseract>=1.1.0as required dep, removepytesseractfrom image extrastesseract_ocr.pyasyncio.run(),compute_zoom(), removepytesseractimportocr.pyrun_ocr_concurrentdispatcher,async_ocr_page(full_page + individual_blocks),async_table_extractionpdf.pyocr_onlystrategy viaget_layout_elements_from_image_asyncstrategies.pypytesseracttoaiopytesseractscripts/collect_env.pyaiopytesseractversion instead ofpytesseracttest_*.py(3 files)pytesseracttoaiopytesseractTest plan
partition_pdf(strategy="hi_res")produces identical elements with/without asyncpartition_pdf(strategy="ocr_only")produces identical elementsOCR_CONCURRENCY=1falls back to effectively sequential executioninfer_table_structure=Trueproduces same HTML