Skip to content

fix(hybrid): activate --hybrid-fallback on server-absent path (PDFDLOSP-21)#511

Merged
bundolee merged 2 commits into
opendataloader-project:mainfrom
hnc-jglee:fix/pdfdlosp-21-hybrid-fallback-on-server-absent
May 15, 2026
Merged

fix(hybrid): activate --hybrid-fallback on server-absent path (PDFDLOSP-21)#511
bundolee merged 2 commits into
opendataloader-project:mainfrom
hnc-jglee:fix/pdfdlosp-21-hybrid-fallback-on-server-absent

Conversation

@hnc-jglee

@hnc-jglee hnc-jglee commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Phase 0 health check (HybridDocumentProcessor.processDocument) now respects --hybrid-fallback — connection refused / health check IOException routes through the Java-only path instead of aborting the run
  • Hybrid server unavailability message now points users to --hybrid-fallback as an alternative escape hatch (was only suggesting --hybrid removal)
  • Adds P0 regression coverage for the server-absent path so the call-site bug cannot regress

Root cause

HybridDocumentProcessor.java:187 invoked getClient(config).checkAvailability() outside the fallback try/catch at lines 240-252. When the hybrid server was absent, the resulting IOException propagated past every isFallbackToJava() check (lines 246/259/306) and reached CLIMain.processFile, so the option was effectively a no-op in the server-absent path. The timeout path remained healthy because chunk-level IOException (line 546) was caught inside the protected region.

This explains the H-6/H-7 equivalence reported in PDFDLOSP-21 (same 23 PDFs failed with identical exit codes regardless of --hybrid-timeout).

Changes

File Change
processors/HybridDocumentProcessor.java Wrap Phase 0 checkAvailability() in try/catch; on IOException with fallback enabled, route through new processAllPagesAsJavaFallback helper (filterAllPages → processJavaPath → mergeResults → postProcess)
hybrid/DoclingFastServerClient.java Add --hybrid-fallback hint to the unavailability message
hybrid/HancomClient.java Same hint
test/HybridBackendFailureIntegrationTest.java P0 cases: fallback ON / OFF / ON+timeout=1 against a closed port via reserveClosedPort() helper
cli/CLIMainTest.java Stack-trace exposure guard for hybrid-unavailable path (\tat grep, mirrors PDFDLOSP-14 pattern)

Test plan

  • `mvn -pl opendataloader-pdf-core -am -o -Dtest=HybridBackendFailureIntegrationTest test` — 4/4 PASS (1 existing + 3 new)
  • `mvn -pl opendataloader-pdf-core test -Dtest='Hybrid'` — 111/111 PASS, no regression
  • `mvn -pl opendataloader-pdf-cli test -Dtest=CLIMainTest` — 22/22 PASS
  • CLI smoke test (hybrid server stopped, `samples/pdf/lorem.pdf`):
    • `--hybrid docling-fast` (no fallback) → exit 1, stderr suggests both `--hybrid-fallback` and dropping `--hybrid`
    • `--hybrid docling-fast --hybrid-fallback` → exit 0, JSON generated (was: exit 1, no JSON)
    • `--hybrid docling-fast --hybrid-fallback --hybrid-timeout 1` → exit 0, JSON generated (H-6/H-7 equivalence restored)

Out of scope (separate PRs)

  • Stack-trace sanitization for the fallback-disabled path (PDFDLOSP-21 recommendation docs: Update project name to OpenDataLoader PDF #3, PDFDLOSP-9/12/14 family). Requires a `HybridUnavailableException` domain type to avoid brittle message-prefix matching in `CLIMain`. Filed as follow-up.
  • `--hybrid-timeout` default (currently 0 = infinite) — should be a finite value to prevent hangs after server `bad_alloc`. Behavior change, separate discussion.
  • Circuit breaker for batch processing (long-running automation).

Summary by CodeRabbit

  • New Features

    • Added an explicit runtime fallback to run Java-only processing when the hybrid backend is unavailable.
  • Bug Fixes

    • Improved user-facing error messages to mention the fallback hint.
    • Suppressed Java stack traces from standard output on hybrid connection failures.
    • When fallback is enabled, requests automatically route through Java-only processing instead of aborting.
  • Tests

    • Added regression and integration tests for hybrid absent/unreachable scenarios (with/without fallback and tight timeouts).

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 514afa7b-3bc7-49c2-b519-857d0061a5c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6c39533 and 7379db8.

📒 Files selected for processing (5)
  • java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIMainTest.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/DoclingFastServerClient.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HancomClient.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/HybridBackendFailureIntegrationTest.java

Walkthrough

Adds client messages that suggest --hybrid-fallback, implements a Java-only fallback when hybrid health checks fail and fallback is enabled, and adds integration and CLI tests that simulate an unreachable hybrid server via a reserved-then-closed ephemeral port.

Changes

Hybrid fallback and tests

Layer / File(s) Summary
Client availability messages
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/DoclingFastServerClient.java, java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HancomClient.java
Extend IOException messages from hybrid client health checks to include a --hybrid-fallback hint for falling back to Java-only processing.
HybridDocumentProcessor fallback path
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java
Wraps checkAvailability() in try/catch; on IOException and when config.getHybridConfig().isFallbackToJava() is true, logs a warning and routes requested pages through a new processAllPagesAsJavaFallback(...) helper that runs the Java processing path and returns the expected result shape.
Integration tests simulating unreachable hybrid
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/HybridBackendFailureIntegrationTest.java
Adds tests that reserve-and-close an ephemeral port to force connection refusal and assert: (1) with fallback enabled, processing does not throw and writes non-empty JSON; (2) with fallback disabled, processing fails fast with an IOException mentioning --hybrid-fallback; (3) with fallback enabled and tiny timeout, processing does not throw. Adds reserveClosedPort() helper and import updates.
CLIMainTest regression
java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIMainTest.java
Adds testHybridUnavailableWithoutFallbackEmitsFriendlyMessageNoStackTrace which builds a minimal PDF, forces hybrid connection refusal via a closed ephemeral port, captures stdout and JUL logs, asserts non-zero exit code, ensures stdout contains no Java stack-trace lines, and verifies logs include “Hybrid server is not available” plus the --hybrid-fallback hint.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • bundolee
  • hyunhee-jo
  • MaximPlusov
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: implementing the --hybrid-fallback option to handle server-absent scenarios in the hybrid backend system.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIMainTest.java`:
- Around line 406-409: The test CLIMainTest currently checks absence of stack
traces using stdoutHolder[0]; extend it to also assert the user-facing fallback
guidance appears by verifying the captured output (out) contains the phrase
"backend unavailable" (or the exact message used by the CLI) and a hint about
using "--hybrid-fallback" (or the exact flag text), while keeping the existing
assertFalse that checks for "\tat " intact; update the assertions to use the
same variable out and make the checks case-sensitive to match the CLI message
text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: af988605-0846-41d0-bba3-8edb3c5b1bcc

📥 Commits

Reviewing files that changed from the base of the PR and between 105bcea and 4f6f372.

📒 Files selected for processing (5)
  • java/opendataloader-pdf-cli/src/test/java/org/opendataloader/pdf/cli/CLIMainTest.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/DoclingFastServerClient.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/HancomClient.java
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/HybridDocumentProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/HybridBackendFailureIntegrationTest.java

@codecov

codecov Bot commented May 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...loader/pdf/processors/HybridDocumentProcessor.java 95.45% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

hnc-jglee added a commit to hnc-jglee/opendataloader-pdf that referenced this pull request May 15, 2026
…LOSP-21)

CodeRabbit nit on PR opendataloader-project#511: the regression guarded against stack-trace
leakage but didn't lock in the friendly message contract this PR
promises. CLIMain emits the hybrid unavailability message via
LOGGER.SEVERE, which JUL routes to stderr — out of reach of the existing
stdout-only capture helper. Attach a LogRecord handler scoped to the
CLIMain logger so the test can assert that the captured SEVERE message
contains both 'Hybrid server is not available' and the '--hybrid-fallback'
hint, while keeping the existing stdout stack-trace guard intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hnc-jglee and others added 2 commits May 15, 2026 16:36
…SP-21)

Phase 0 health check in HybridDocumentProcessor invoked
checkAvailability() outside the fallback try/catch, so a connection
refused IOException propagated past every isFallbackToJava() check and
the option was a no-op in the server-absent path. Wrap the call and
route through a new processAllPagesAsJavaFallback helper when fallback
is enabled. Hint --hybrid-fallback in the unavailability message and
add regression coverage for fallback ON/OFF/timeout combos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LOSP-21)

CodeRabbit nit on PR opendataloader-project#511: the regression guarded against stack-trace
leakage but didn't lock in the friendly message contract this PR
promises. CLIMain emits the hybrid unavailability message via
LOGGER.SEVERE, which JUL routes to stderr — out of reach of the existing
stdout-only capture helper. Attach a LogRecord handler scoped to the
CLIMain logger so the test can assert that the captured SEVERE message
contains both 'Hybrid server is not available' and the '--hybrid-fallback'
hint, while keeping the existing stdout stack-trace guard intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hnc-jglee hnc-jglee force-pushed the fix/pdfdlosp-21-hybrid-fallback-on-server-absent branch from fcdeb3a to 7379db8 Compare May 15, 2026 07:37
@hnc-jglee hnc-jglee enabled auto-merge (rebase) May 15, 2026 07:43
@bundolee bundolee disabled auto-merge May 15, 2026 08:02
@hnc-jglee hnc-jglee enabled auto-merge (rebase) May 15, 2026 08:03
@bundolee bundolee disabled auto-merge May 15, 2026 08:04
@bundolee bundolee merged commit dc6410c into opendataloader-project:main May 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants