fix(easyocr): unwrap DataParallel to prevent SIGABRT under concurrent load#1070
fix(easyocr): unwrap DataParallel to prevent SIGABRT under concurrent load#1070anandray wants to merge 3 commits into
Conversation
… load EasyOCR initialises its detector and recogniser wrapped in torch.nn.DataParallel, which scatters every inference call across ALL visible GPUs via parallel_apply() worker threads. On an 8× H200 box this spawns up to 8 CUDA threads per request; under the 32-worker chaos test's OVERLOAD phase they collide and produce a FATAL crash (tcache/SIGABRT from a CUDA kernel thread). After Reader() initialises, unwrap DataParallel for both sub-models and move them to the single GPU that the model server allocated. This keeps each EasyOCR copy on its own device without cross-GPU scatter. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughPins EasyOCR's internal ChangesEasyOCR GPU Pinning
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
No description provided. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 148-152: Add info-level logging around the existing loop over
('detector','recognizer') to surface missing or non-DataParallel attributes: for
each attr, retrieve module = getattr(reader, attr, None); if module is None log
logger.info indicating the reader is missing that expected attribute (include
attr and reader class/name), else if not isinstance(module,
torch.nn.DataParallel) log logger.info that the attribute exists but is not
wrapped in DataParallel (include attr and type(module) and target/gpu_index
context); leave the current DataParallel unwrapping and logger.debug line
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2c9e4c28-7723-4d89-b1b7-b7a62e7bc69e
📒 Files selected for processing (1)
packages/ai/src/ai/common/models/ocr/easyocr.py
…butes Log at info level when detector or recognizer is absent from the reader (EasyOCR API change) or present but not wrapped in DataParallel (future version change), so device-pinning failures are diagnosable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 155-158: The non-DataParallel logging path in easyocr.py can raise
StopIteration when calling next(module.parameters()) for parameterless modules;
update the logic in the block that logs "EasyOCR {attr}: not wrapped in
DataParallel" to use a safe probe like next(module.parameters(), None) and if it
returns None set device='unknown' (or infer from module if possible) before
calling logger.info so parameterless modules do not throw during model loading;
adjust the code around the module/device determination where module and attr are
referenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0bd828ec-7ef7-4f17-a29e-92aedc66887e
📒 Files selected for processing (1)
packages/ai/src/ai/common/models/ocr/easyocr.py
…less modules next(module.parameters()) raises StopIteration when the module has no registered parameters — a valid case for some EasyOCR sub-modules. Use the default-value form and guard device access on None. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/ai/src/ai/common/models/ocr/easyocr.py (1)
132-159:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPass the allocated CUDA device into
easyocr.Readeras well.The loader unpacks/moves
reader.detectorandreader.recognizertocuda:{gpu_index}, buteasyocr.Readeris constructed withgpu=use_gpu(boolean), so EasyOCR keepsreader.deviceas the generic CUDA device (typicallycuda/cuda:0). EasyOCR usesreader.deviceduring inference to place tensors, which can cause device mismatches or silently target the wrong GPU whengpu_index != 0. EasyOCR supports passing a concrete device string togpu(e.g.,gpu='cuda:3').🔧 Minimal fix
try: reader = easyocr.Reader( languages, - gpu=use_gpu, + gpu=torch_device if use_gpu else False, verbose=False, ) except Exception as e: logger.error(f'Failed to load EasyOCR: {e}') raise Exception(f'Failed to load EasyOCR: {e}')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/ai/src/ai/common/models/ocr/easyocr.py` around lines 132 - 159, When constructing easyocr.Reader, pass the concrete CUDA device string when a specific GPU index is allocated instead of the boolean use_gpu; i.e., compute gpu_arg = f'cuda:{gpu_index}' if use_gpu and gpu_index >= 0 else use_gpu and pass that into easyocr.Reader(...) so reader.device is set to the same device you later pin detector/recognizer to (symbols: easyocr.Reader, reader, use_gpu, gpu_index, detector, recognizer, reader.device).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@packages/ai/src/ai/common/models/ocr/easyocr.py`:
- Around line 132-159: When constructing easyocr.Reader, pass the concrete CUDA
device string when a specific GPU index is allocated instead of the boolean
use_gpu; i.e., compute gpu_arg = f'cuda:{gpu_index}' if use_gpu and gpu_index >=
0 else use_gpu and pass that into easyocr.Reader(...) so reader.device is set to
the same device you later pin detector/recognizer to (symbols: easyocr.Reader,
reader, use_gpu, gpu_index, detector, recognizer, reader.device).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 731a4e10-a401-46c0-acf9-b78b411e5b4a
📒 Files selected for processing (1)
packages/ai/src/ai/common/models/ocr/easyocr.py
Summary
EasyOCR initialises its
detectorandrecogniserwrapped intorch.nn.DataParallel, which scatters every inference call across all visible GPUs viaparallel_apply()worker threads. On an 8× H200 box this spawns up to 8 CUDA threads per inference request. Under the chaos test's OVERLOAD phase (32 concurrent workers), these threads collide and produce a FATAL crash —tcache_thread_shutdown()SIGABRT from aparallel_applyworker thread:Fix: After
Reader()initialises, unwrapDataParallelfor bothdetectorandrecogniserand move them to the single GPU the model server allocated. Each EasyOCR instance stays on its own device with no cross-GPU scatter.Test plan
./builder model_server:test— 35 passed, 11 deselected./builder model_server:test-chaos— 1 passed (17 min), server previously crashed 9× per run; with this fix the server stays up for the full chaos duration🤖 Generated with Claude Code
Summary by CodeRabbit