Skip to content

perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294

Closed
KRRT7 wants to merge 15 commits intoUnstructured-IO:mainfrom
KRRT7:perf/concurrent-ocr-aiopytesseract
Closed

perf: concurrent OCR with aiopytesseract + drop unstructured-pytesseract#4294
KRRT7 wants to merge 15 commits intoUnstructured-IO:mainfrom
KRRT7:perf/concurrent-ocr-aiopytesseract

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Mar 24, 2026

Summary

  • Concurrent OCR across pages using aiopytesseract (async subprocess execution via asyncio.gather) for all Tesseract OCR paths:
    • hi_res strategy: full_page and individual_blocks modes
    • ocr_only strategy: layout elements extraction (hocr + plaintext in parallel per page)
    • Table OCR: pre-fetch all table crop tokens concurrently
  • Remove pytesseract dependency entirely — sync methods delegate to async aiopytesseract via asyncio.run(), aiopytesseract is the sole tesseract interface
  • OCRAgentTesseract async methods: get_layout_from_image_async, get_text_from_image_async, get_layout_elements_from_image_async — each accepts an optional asyncio.Semaphore for concurrency control
  • DRY: Extract compute_zoom() shared by sync and async paths (zoom logic in one place)
  • Concurrency controlled by OCR_CONCURRENCY env var (defaults to cpu_count)

Why

End-to-end profiling of partition_pdf(strategy="hi_res") on a 10-page PDF showed OCR at 45.6% of total time. Each page spawns 1-2 Tesseract subprocesses sequentially, but pages are independent. aiopytesseract uses asyncio.create_subprocess_exec instead of subprocess.Popen, enabling asyncio.gather across pages.

Benchmark

Isolated OCR benchmark (pre-rendered pages, no model inference competing for CPU):

PDF: loremipsum_multipage.pdf (10 pages)
CPUs: 8, OCR_CONCURRENCY: 8

Sequential median: 61.7s (6.17s/page)
Concurrent median: 28.4s (2.84s/page)
Speedup: 2.17x
Time saved: 33.2s (54%)

In the full partition_pdf(strategy="hi_res") pipeline, the speedup is smaller (~1.1x) because model inference also uses CPU. On production machines with more cores (16-64), the concurrent path will scale better since Tesseract subprocesses are independently schedulable.

Changes

File Change
pyproject.toml Add aiopytesseract>=1.1.0 as required dep, remove pytesseract from image extras
tesseract_ocr.py Sync methods delegate to async via asyncio.run(), compute_zoom(), remove pytesseract import
ocr.py run_ocr_concurrent dispatcher, async_ocr_page (full_page + individual_blocks), async_table_extraction
pdf.py Concurrent ocr_only strategy via get_layout_elements_from_image_async
strategies.py Update dependency check and messages from pytesseract to aiopytesseract
scripts/collect_env.py Report aiopytesseract version instead of pytesseract
test_*.py (3 files) Update mocks from pytesseract to aiopytesseract

Test plan

  • partition_pdf(strategy="hi_res") produces identical elements with/without async
  • partition_pdf(strategy="ocr_only") produces identical elements
  • OCR_CONCURRENCY=1 falls back to effectively sequential execution
  • Non-tesseract agents (paddle, google_vision) still work via sync fallback
  • Table extraction with infer_table_structure=True produces same HTML

KRRT7 added 15 commits March 24, 2026 05:18
Replace the sequential per-page OCR loop in process_file_with_ocr with
concurrent execution using asyncio.gather + aiopytesseract. Each page's
Tesseract subprocess now runs in parallel instead of waiting for the
previous page to finish.

Benchmark (loremipsum_multipage.pdf, 10 pages, CPU):
  Stage            Before      After     Change
  OCR              77.5s       32.9s     -57%
  Total           169.8s      115.9s     -32%

Falls back to sequential processing when:
- OCR agent is not tesseract (e.g., PaddleOCR)
- OCR mode is individual_blocks (not full_page)
- aiopytesseract is not installed

Concurrency is controlled by OCR_CONCURRENCY env var (default: cpu_count).
The forked unstructured-pytesseract has only 15 trivial line differences
from upstream pytesseract (type annotation style, unused extra_config
param). Drop the fork and use the actively maintained upstream package.
Add get_layout_from_image_async() to OCRAgentTesseract so the zoom
logic lives in one place instead of being duplicated in ocr.py.
Drop _ prefixes from module-level functions (run_ocr_concurrent,
async_ocr_page, OCR_CONCURRENCY).
The zoom calculation logic was duplicated between get_layout_from_image
and get_layout_from_image_async. Extract it into compute_zoom() so both
methods share the same implementation.
Add get_text_from_image_async and get_layout_elements_from_image_async
to OCRAgentTesseract. The latter runs hocr + plaintext extraction in
parallel per page via asyncio.gather.

Wire into _partition_pdf_or_image_with_ocr so all pages are OCR'd
concurrently when tesseract + aiopytesseract are available. Removes
the now-unused _partition_pdf_or_image_with_ocr_from_image function.
Make async_ocr_page handle both full_page and individual_blocks modes.
In individual_blocks mode, gather all get_text_from_image_async calls
concurrently instead of sequential pytesseract subprocesses.

Add async_table_extraction that pre-fetches OCR tokens for all tables
concurrently via get_layout_from_image_async, then runs table model
predict sequentially (CPU-bound).

Extend run_ocr_concurrent to use async for all tesseract OCR modes,
not just full_page.
aiopytesseract is a required dependency, so try/except ImportError
fallbacks are unnecessary. Simplify to direct isinstance checks.
…pytesseract

Sync methods (get_text_from_image, get_layout_from_image,
get_layout_elements_from_image) now call asyncio.run() on their async
counterparts, eliminating the pytesseract Python package dependency.
aiopytesseract (already a required dep) is the sole tesseract interface.
…sertions

- Change aiopytesseract dpi=300 to dpi=70 to match tesseract's default
  when reading temp PNGs without DPI metadata (the old
  unstructured-pytesseract behavior). This was the root cause of OCR
  accuracy regressions (NarrativeText vs Title, DC 20224 vs DC 224).
- Move aiopytesseract from core deps to image optional group (replacing
  unstructured-pytesseract) to preserve original dependency scope.
- Restore original strict test assertion for
  test_partition_image_hi_res_ocr_mode since DPI=70 preserves identical
  OCR output.
Copy link
Copy Markdown
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold up

Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. High: new asyncio.run() calls break callers that already have an active event loop.
    In unstructured/partition/pdf.py, _partition_pdf_or_image_with_ocr() now does asyncio.run(gather_pages()). In unstructured/partition/utils/ocr_models/tesseract_ocr.py, the sync wrappers get_text_from_image(), get_layout_from_image(), and get_layout_elements_from_image() now also call asyncio.run(...). That will raise RuntimeError: asyncio.run() cannot be called from a running event loop in notebooks, async services, or any sync wrapper invoked from async code. This is especially notable because the same branch already introduced _run_coro() in unstructured/partition/pdf_image/ocr.py to avoid exactly that issue elsewhere, so the behavior is now inconsistent across entry points.

  2. High: the new async table-extraction path silently suppresses table-agent load failures.
    There is now a semantic mismatch between the old and new paths in unstructured/partition/pdf_image/ocr.py. supplement_page_layout_with_ocr() still raises RuntimeError("Unable to load table extraction agent.") when tables.tables_agent is unavailable, but async_table_extraction() just returns. That means the concurrent Tesseract path can silently skip table extraction where the existing path would fail loudly, which is a user-visible behavior regression.

  3. Medium: OCRLayoutDumper.add_ocred_page() can label pages in completion order instead of document order.
    OCRLayoutDumper tracks page numbers with an internal incrementing counter rather than accepting the actual page number, and the new async OCR path calls add_ocred_page() from concurrently running page tasks. As a result, analysis output can mislabel OCR pages whenever page N finishes before page N-1. This does not affect the returned partition elements, but it does make the OCR analysis artifact incorrect.

  4. Medium: tables.tables_agent.predict() still blocks the event loop inside async_table_extraction().
    The async table path prefetches OCR concurrently, but the actual TATR inference still runs synchronously in the coroutine via tables.tables_agent.predict(...). For table-heavy documents, this stalls other coroutines while each table prediction runs and cuts into the throughput gains this PR is trying to achieve.

  5. Medium: PDF page image handles are leaked on OCR failure.
    process_file_with_ocr() now opens all rendered PDF page images up front, stores them in page_args, then closes them only after _run_coro(...) returns successfully. If OCR raises, the close loop is never reached. That is a real resource leak and can also interfere with temp-file cleanup on platforms with stricter file locking semantics.

  6. Medium: OCR_CONCURRENCY is not validated and can either crash or deadlock the OCR path.
    OCR_CONCURRENCY = int(os.environ.get(...)) will raise immediately on invalid values, and OCR_CONCURRENCY=0 creates a zero-capacity semaphore that blocks every guarded OCR call forever. Since this branch introduces the setting as a user-facing control, it should be validated and clamped. Also, unlike most of the rest of the codebase, this value is frozen at module import time rather than going through env_config.

  7. Medium: _partition_pdf_or_image_with_ocr() now buffers all page images before OCR, which is a scalability regression.
    The new code in unstructured/partition/pdf.py first collects every rendered page into pages and only then starts OCR. The previous flow processed page-by-page. This increases peak memory substantially for large PDFs and can turn a performance optimization into an OOM/swap risk.

  8. Medium: the new async path duplicates a large amount of the existing sync OCR logic, and the two paths have already drifted.
    async_ocr_page() and async_table_extraction() largely reimplement behavior that already exists in supplement_page_layout_with_ocr(), supplement_element_with_table_extraction(), and get_table_tokens(). This is not just a style concern: the table-agent failure handling has already diverged between the sync and async implementations. I’d strongly prefer extracting shared logic so future fixes don’t have to be applied in two places.

  9. Low: test coverage does not directly protect the new async machinery.
    The branch does exercise the new path indirectly, but there are still no focused tests for _run_coro, concurrent page execution, async table extraction behavior, event-loop edge cases, or OCR_CONCURRENCY handling. Given that the primary change here is new async orchestration, those gaps make the higher-risk behavior hard to trust.

  10. Low: some existing tests were weakened around the code that changed most.
    test_unstructured/partition/pdf_image/test_ocr.py::test_get_ocr_layout_from_image_tesseract no longer exercises the Tesseract layout-from-image path at all; it now only tests parse_data(). That means regressions in the new async HOCR/zoom flow would not be caught there. Also, test_unstructured/partition/pdf_image/test_pdf.py::test_ocr_language_passes_through only proves that the first aiopytesseract.execute() call receives lang; on the OCR-only path there are two concurrent Tesseract calls, so the test can still pass even if one of them stops forwarding the language correctly.

Overall

The biggest issues to fix before merging are:

  • event-loop breakage from the new asyncio.run() usage,
  • the silent failure regression in async table extraction,
  • the incorrect page numbering in OCR analysis output,
  • and the leaked PDF image handles on OCR failure.

After that, I’d tackle the event-loop blocking TATR call and the sync/async duplication, because those both undercut the main goal of the PR.

@KRRT7 KRRT7 closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants