Skip to content

fix(easyocr): unwrap DataParallel to prevent SIGABRT under concurrent load#1070

Open
anandray wants to merge 3 commits into
developfrom
fix/easyocr-datparallel-crash
Open

fix(easyocr): unwrap DataParallel to prevent SIGABRT under concurrent load#1070
anandray wants to merge 3 commits into
developfrom
fix/easyocr-datparallel-crash

Conversation

@anandray
Copy link
Copy Markdown
Contributor

@anandray anandray commented Jun 2, 2026

Summary

EasyOCR initialises its detector and recogniser wrapped in torch.nn.DataParallel, which scatters every inference call across all visible GPUs via parallel_apply() worker threads. On an 8× H200 box this spawns up to 8 CUDA threads per inference request. Under the chaos test's OVERLOAD phase (32 concurrent workers), these threads collide and produce a FATAL crashtcache_thread_shutdown() SIGABRT from a parallel_apply worker thread:

easyocr/recognition.py: recognizer_predict
easyocr/recognition.py: get_text
easyocr/easyocr.py: recognize / readtext
torch/nn/parallel/parallel_apply.py: parallel_apply   ← crash here
FATAL ERROR: Application has terminated unexpectedly
[14,250MB] Minidump created: /tmp/...dmp

Fix: After Reader() initialises, unwrap DataParallel for both detector and recogniser and move them to the single GPU the model server allocated. Each EasyOCR instance stays on its own device with no cross-GPU scatter.

Test plan

  • ./builder model_server:test — 35 passed, 11 deselected
  • ./builder model_server:test-chaos1 passed (17 min), server previously crashed 9× per run; with this fix the server stays up for the full chaos duration

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved OCR GPU handling to keep OCR processing components pinned to the selected GPU, preventing unintended replication across devices and improving stability and performance under concurrent workloads.
    • Added safer handling and informative logging when GPU pinning isn't applicable to avoid crashes or silent failures.

… load

EasyOCR initialises its detector and recogniser wrapped in
torch.nn.DataParallel, which scatters every inference call across ALL
visible GPUs via parallel_apply() worker threads.  On an 8× H200 box
this spawns up to 8 CUDA threads per request; under the 32-worker chaos
test's OVERLOAD phase they collide and produce a FATAL crash
(tcache/SIGABRT from a CUDA kernel thread).

After Reader() initialises, unwrap DataParallel for both sub-models and
move them to the single GPU that the model server allocated.  This keeps
each EasyOCR copy on its own device without cross-GPU scatter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Pins EasyOCR's internal detector and recognizer modules to a single CUDA device by unwrapping torch.nn.DataParallel and moving the underlying modules to cuda:{gpu_index} when GPU use is enabled and gpu_index >= 0.

Changes

EasyOCR GPU Pinning

Layer / File(s) Summary
DataParallel unwrapping for GPU device pinning
packages/ai/src/ai/common/models/ocr/easyocr.py
After reader creation, inspects detector and recognizer; skips missing attributes, unwraps torch.nn.DataParallel by replacing with .module and moves the module to cuda:{gpu_index}, otherwise logs module type and inferred parameter device.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through code at break of dawn,
Unwrapped the wraps that wandered on,
Pinned detector, nudged recognizer near,
Now GPUs hum without a fear,
Quiet carrots, compute, and brawn.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: unwrapping DataParallel to fix a SIGABRT crash under concurrent load, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/easyocr-datparallel-crash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the module:ai AI/ML modules label Jun 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 148-152: Add info-level logging around the existing loop over
('detector','recognizer') to surface missing or non-DataParallel attributes: for
each attr, retrieve module = getattr(reader, attr, None); if module is None log
logger.info indicating the reader is missing that expected attribute (include
attr and reader class/name), else if not isinstance(module,
torch.nn.DataParallel) log logger.info that the attribute exists but is not
wrapped in DataParallel (include attr and type(module) and target/gpu_index
context); leave the current DataParallel unwrapping and logger.debug line
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2c9e4c28-7723-4d89-b1b7-b7a62e7bc69e

📥 Commits

Reviewing files that changed from the base of the PR and between eae941c and 6559e59.

📒 Files selected for processing (1)
  • packages/ai/src/ai/common/models/ocr/easyocr.py

Comment thread packages/ai/src/ai/common/models/ocr/easyocr.py
…butes

Log at info level when detector or recognizer is absent from the reader
(EasyOCR API change) or present but not wrapped in DataParallel (future
version change), so device-pinning failures are diagnosable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 155-158: The non-DataParallel logging path in easyocr.py can raise
StopIteration when calling next(module.parameters()) for parameterless modules;
update the logic in the block that logs "EasyOCR {attr}: not wrapped in
DataParallel" to use a safe probe like next(module.parameters(), None) and if it
returns None set device='unknown' (or infer from module if possible) before
calling logger.info so parameterless modules do not throw during model loading;
adjust the code around the module/device determination where module and attr are
referenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0bd828ec-7ef7-4f17-a29e-92aedc66887e

📥 Commits

Reviewing files that changed from the base of the PR and between 6559e59 and e2fbdcc.

📒 Files selected for processing (1)
  • packages/ai/src/ai/common/models/ocr/easyocr.py

Comment thread packages/ai/src/ai/common/models/ocr/easyocr.py
…less modules

next(module.parameters()) raises StopIteration when the module has no
registered parameters — a valid case for some EasyOCR sub-modules.
Use the default-value form and guard device access on None.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/ai/src/ai/common/models/ocr/easyocr.py (1)

132-159: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass the allocated CUDA device into easyocr.Reader as well.

The loader unpacks/moves reader.detector and reader.recognizer to cuda:{gpu_index}, but easyocr.Reader is constructed with gpu=use_gpu (boolean), so EasyOCR keeps reader.device as the generic CUDA device (typically cuda/cuda:0). EasyOCR uses reader.device during inference to place tensors, which can cause device mismatches or silently target the wrong GPU when gpu_index != 0. EasyOCR supports passing a concrete device string to gpu (e.g., gpu='cuda:3').

🔧 Minimal fix
         try:
             reader = easyocr.Reader(
                 languages,
-                gpu=use_gpu,
+                gpu=torch_device if use_gpu else False,
                 verbose=False,
             )
         except Exception as e:
             logger.error(f'Failed to load EasyOCR: {e}')
             raise Exception(f'Failed to load EasyOCR: {e}')
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/ai/src/ai/common/models/ocr/easyocr.py` around lines 132 - 159, When
constructing easyocr.Reader, pass the concrete CUDA device string when a
specific GPU index is allocated instead of the boolean use_gpu; i.e., compute
gpu_arg = f'cuda:{gpu_index}' if use_gpu and gpu_index >= 0 else use_gpu and
pass that into easyocr.Reader(...) so reader.device is set to the same device
you later pin detector/recognizer to (symbols: easyocr.Reader, reader, use_gpu,
gpu_index, detector, recognizer, reader.device).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 132-159: When constructing easyocr.Reader, pass the concrete CUDA
device string when a specific GPU index is allocated instead of the boolean
use_gpu; i.e., compute gpu_arg = f'cuda:{gpu_index}' if use_gpu and gpu_index >=
0 else use_gpu and pass that into easyocr.Reader(...) so reader.device is set to
the same device you later pin detector/recognizer to (symbols: easyocr.Reader,
reader, use_gpu, gpu_index, detector, recognizer, reader.device).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 731a4e10-a401-46c0-acf9-b78b411e5b4a

📥 Commits

Reviewing files that changed from the base of the PR and between e2fbdcc and 47925a4.

📒 Files selected for processing (1)
  • packages/ai/src/ai/common/models/ocr/easyocr.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ai AI/ML modules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant