You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor(hybrid): move OCR settings from CLI to server startup
Move OCR configuration (--ocr-lang, --force-ocr) from Java CLI runtime
options to Python hybrid server startup options. This improves performance
by creating a single DocumentConverter instance at server startup instead
of per-request converter creation with different language settings.
Changes:
- Add --ocr-lang and --force-ocr options to hybrid_server.py
- Deprecate --hybrid-ocr CLI option (prints warning, no-op)
- Remove forceOcr from DoclingFastServerClient and HybridConfig
- Remove OCR-related constants and methods from Config.java
- Update tests to reflect deprecated/removed functionality
- Regenerate Python/Node.js wrappers via npm run sync
Closes#161
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: content/docs/cli-options-reference.mdx
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,6 @@ This page documents all available CLI options for opendataloader-pdf.
32
32
|`--pages`| - |`string`| - | Pages to extract (e.g., "1,3,5-7"). Default: all pages |
33
33
|`--hybrid`| - |`string`|`"off"`| Hybrid backend for AI processing. Values: off (default), docling (docling-fast is deprecated alias) |
34
34
|`--hybrid-mode`| - |`string`|`"auto"`| Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend) |
35
-
|`--hybrid-ocr`| - |`string`|`"auto"`| Hybrid OCR mode for Docling backend. Values: auto (default, OCR only where needed), force (force full-page OCR) |
privatestaticfinalStringHYBRID_OCR_DESC = "Hybrid OCR mode for Docling backend. Values: auto (default, OCR only where needed), force (force full-page OCR)";
105
+
privatestaticfinalStringHYBRID_OCR_DESC = "[Deprecated] OCR settings are now configured on the hybrid server (--ocr-lang, --force-ocr)";
Copy file name to clipboardExpand all lines: java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/HybridDocumentProcessorTest.java
-17Lines changed: 0 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -324,23 +324,6 @@ public void testHybridConfigModeFullMode() {
Copy file name to clipboardExpand all lines: node/opendataloader-pdf/src/cli-options.generated.ts
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,6 @@ export function registerCliOptions(program: Command): void {
26
26
program.option('--pages <value>','Pages to extract (e.g., "1,3,5-7"). Default: all pages');
27
27
program.option('--hybrid <value>','Hybrid backend for AI processing. Values: off (default), docling (docling-fast is deprecated alias)');
28
28
program.option('--hybrid-mode <value>','Hybrid triage mode. Values: auto (default, dynamic triage), full (skip triage, all pages to backend)');
29
-
program.option('--hybrid-ocr <value>','Hybrid OCR mode for Docling backend. Values: auto (default, OCR only where needed), force (force full-page OCR)');
30
29
program.option('--hybrid-url <value>','Hybrid backend server URL (overrides default)');
31
30
program.option('--hybrid-timeout <value>','Hybrid backend request timeout in milliseconds. Default: 30000');
32
31
program.option('--hybrid-fallback','Fallback to Java processing on hybrid backend error. Default: true');
0 commit comments