fix: add input validation with clear error messages by hyunhee-jo · Pull Request #4 · opendataloader-project/opendataloader-pdf-llamaindex

hyunhee-jo · 2026-04-10T04:59:03Z

Objective

Users get cryptic CalledProcessError or Java stack traces when passing non-existent file paths or running without Java installed. This makes debugging difficult, especially for first-time users.

Approach

Add early validation in lazy_load_data() before invoking the Java engine:

File existence check — raises FileNotFoundError with the specific missing path
Java availability check — raises RuntimeError with installation instructions (minimum version, download URL, verification command)

Design decisions:

Validation runs before tempfile.mkdtemp to avoid creating temp dirs that would need cleanup on validation failure
Java check caches only positive results; negative results re-probe so mid-session Java installs are recognised without a restart
Catches OSError (not just FileNotFoundError) to handle PermissionError on platforms where java exists but isn't executable

Evidence

Scenario	Before	After
Missing file path	`CalledProcessError` from Java subprocess	`FileNotFoundError: Input path does not exist: missing.pdf`
Java not installed	`FileNotFoundError: [Errno 2] No such file or directory: 'java'`	`RuntimeError: Java is not found on the system PATH. OpenDataLoader PDF requires Java 11+. Install Java from https://adoptium.net/...`
Java installed mid-session	N/A (no check existed)	Re-probed and detected on next call
All existing tests	30 passed	33 passed (+3 new validation tests)

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- PDF import now verifies Java is available and fails fast with clear install/verification guidance if not found.
- Input PDF paths are validated before conversion, raising descriptive errors for missing files.
Tests
- Added tests covering file-existence validation and Java-availability error handling.

Raise FileNotFoundError for missing input paths and RuntimeError when Java is not found on the system PATH, instead of letting CalledProcessError propagate from the subprocess. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…w Path.exists mock, catch OSError - Replace lru_cache with manual caching that only caches True (Java found); negative results are re-probed so mid-session Java installs are recognised. - Narrow the autouse Path.exists mock to .pdf suffix only, delegating all other calls to the real implementation. - Catch OSError (covers PermissionError) instead of just FileNotFoundError in _java_available(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-10T04:59:18Z

Walkthrough

Adds a Java availability probe and input-path existence validation to OpenDataLoaderPDFReader.lazy_load_data, raising clear errors when Java is missing or when resolved input paths do not exist.

Changes

Cohort / File(s)	Summary
Java availability & path validation `llama_index/readers/opendataloader_pdf/base.py`	Added internal `_java_available()` that runs `java -version` with a 10s timeout and caches positive results; updated `lazy_load_data()` to verify every resolved input path exists (raise `FileNotFoundError`) and to raise `RuntimeError` when Java is not on PATH with guidance to install/verify.
Tests: validation & java checks `tests/test_readers_opendataloader_pdf.py`	Added autouse pytest fixture to mock Java availability and fake `.pdf` existence for most tests; added `TestFileNotFound` asserting `FileNotFoundError` for missing inputs and `TestJavaCheck` asserting `RuntimeError` when `_java_available()` is False.

Sequence Diagram(s)

(Skipped — changes are localized and do not introduce a multi-component control flow that requires visualization.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: add OpenDataLoaderPDFReader with full extraction support #1: Modifies the same OpenDataLoaderPDFReader implementation (adds Java probe and input-path validation) and is directly related.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: add input validation with clear error messages' accurately reflects the main changes: adding file existence and Java availability checks with explicit error messages before Java subprocess execution.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@llama_index/readers/opendataloader_pdf/base.py`:
- Around line 22-43: The current _java_available function wrongly caches any
java with returncode 0 as available; change it to run ["java","-version"],
capture output (stderr/stdout), parse the version string produced by java
-version to extract the major version number (handle formats like "1.8.0_xx" ->
major 8 and "11.0.x" -> major 11), and only set the module-level _java_found
True (and return True) if the major version is >= 11; keep the existing behavior
of not caching negatives (i.e. only cache successful checks), handle
subprocess.TimeoutExpired/OSError as False, and update the assignment to
_java_found in _java_available accordingly so downstream code requiring Java 11+
will not be cached as available when an older Java is present.

In `@tests/test_readers_opendataloader_pdf.py`:
- Around line 609-617: Update the assertion in test_no_java_raises_runtime_error
to verify the full actionable guidance in the RuntimeError from
OpenDataLoaderPDFReader.lazy_load_data: require the error message to include
both the "Java 11+" version requirement and a hint to run "java -version" (e.g.,
match a regex or exact substrings "Java 11+" and "java -version") so the test
fails if those helpful instructions are removed or changed.
- Around line 15-33: Add focused unit tests for the new _java_available()
semantics: stop using the global fixture stub for these tests and instead
directly call llama_index.readers.opendataloader_pdf.base._java_available while
monkeypatching the underlying Java-check implementation to first return True and
verify a subsequent call uses the cached positive result, then test a sequence
where the underlying check returns False and verify that False is not cached (a
later underlying True is observed), and finally simulate the underlying check
raising OSError (or a subclass) to confirm the widened OSError handling occurs
and that the exception behavior is not erroneously cached; place these tests
alongside the existing suite but ensure they do not rely on the module-wide
Path.exists/_java_available monkeypatch.
- Around line 592-600: Update the two tests so they assert the FileNotFoundError
message includes the actual missing path returned by
OpenDataLoaderPDFReader.lazy_load_data; specifically change the pytest.raises
match in test_single_missing_file and test_missing_file_in_list to include
"nonexistent.pdf" (e.g., use a regex matching the filename or full path) so the
exception message is locked to the offending path rather than just the generic
"does not exist".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6574c428-767d-4796-9f02-f2148d0f245d

📥 Commits

Reviewing files that changed from the base of the PR and between a01da1f and 4e6dd13.

📒 Files selected for processing (2)

llama_index/readers/opendataloader_pdf/base.py
tests/test_readers_opendataloader_pdf.py

@patch

…patch - Add monkeypatch.setattr for _java_found in autouse fixture to prevent global state leakage between tests. - Remove redundant @patch("pathlib.Path.exists") from TestJavaCheck since the autouse fixture already handles .pdf suffix paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (3)

tests/test_readers_opendataloader_pdf.py (3)

15-36: ⚠️ Potential issue | 🟡 Minor

Global _java_available stubbing still hides core probe/cache behavior tests

The autouse fixture at Line 15 forces _java_available to True for most tests, so regressions in positive-only caching and widened OSError handling can still slip through. Add focused tests that call llama_index.readers.opendataloader_pdf.base._java_available directly with controlled subprocess outcomes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_readers_opendataloader_pdf.py` around lines 15 - 36, The autouse
fixture _bypass_input_validation currently monkeypatches
llama_index.readers.opendataloader_pdf.base._java_available to always return
True, which hides tests for its probe/caching behavior; update the test suite by
removing or limiting that global stub and add focused tests that call
llama_index.readers.opendataloader_pdf.base._java_available directly while
monkeypatching subprocess outcomes (simulate success, failure, and OSError) to
verify positive-only caching and widened OSError handling; ensure these new
tests restore the original _java_available (or avoid the autouse patch for them)
so they exercise the real probe logic and cache behavior.

616-619: ⚠️ Potential issue | 🟡 Minor

Verify full Java remediation guidance in the RuntimeError message

Line 618 currently checks only "Java is not found". The test should also assert the actionable guidance ("Java 11+" and "java -version"), so those instructions can’t regress silently.

Suggested assertion upgrade

-        with pytest.raises(RuntimeError, match="Java is not found"):
-            list(reader.lazy_load_data(file_path="doc.pdf"))
+        with pytest.raises(RuntimeError) as exc_info:
+            list(reader.lazy_load_data(file_path="doc.pdf"))
+        message = str(exc_info.value)
+        assert "Java is not found" in message
+        assert "Java 11+" in message
+        assert "java -version" in message

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_readers_opendataloader_pdf.py` around lines 616 - 619, Update the
test_no_java_raises_runtime_error assertion to verify the full remediation
guidance in the RuntimeError from OpenDataLoaderPDFReader.lazy_load_data:
instead of only matching "Java is not found", assert the error message also
contains the actionable phrases "Java 11+" and "java -version" (e.g., update the
pytest.raises(match=...) to a regex or combined string that includes "Java is
not found", "Java 11+", and "java -version") so the test ensures the full
guidance is present.

595-603: ⚠️ Potential issue | 🟡 Minor

Assert the specific missing path in FileNotFoundError

Line 597 and Line 602 only match a generic phrase. These tests should lock in the offending filename/path (nonexistent.pdf) to preserve the PR’s UX contract.

Suggested test assertion tightening

-        with pytest.raises(FileNotFoundError, match="does not exist"):
+        with pytest.raises(
+            FileNotFoundError,
+            match=r"Input path does not exist: nonexistent\.pdf",
+        ):
             list(reader.lazy_load_data(file_path="nonexistent.pdf"))
@@
-        with pytest.raises(FileNotFoundError, match="does not exist"):
+        with pytest.raises(
+            FileNotFoundError,
+            match=r"Input path does not exist: nonexistent\.pdf",
+        ):
             list(reader.lazy_load_data(file_path=["nonexistent.pdf"]))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_readers_opendataloader_pdf.py` around lines 595 - 603, Update the
two tests that assert FileNotFoundError to match the exact missing filename so
the error message includes the offending path: in tests test_single_missing_file
and test_missing_file_in_list, when calling
OpenDataLoaderPDFReader().lazy_load_data use pytest.raises(FileNotFoundError,
match="nonexistent.pdf") (or a regex that includes "nonexistent.pdf") so the
raised error message explicitly contains the missing file path; ensure both
usages referencing OpenDataLoaderPDFReader and lazy_load_data are updated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_readers_opendataloader_pdf.py`:
- Around line 15-36: The autouse fixture _bypass_input_validation currently
monkeypatches llama_index.readers.opendataloader_pdf.base._java_available to
always return True, which hides tests for its probe/caching behavior; update the
test suite by removing or limiting that global stub and add focused tests that
call llama_index.readers.opendataloader_pdf.base._java_available directly while
monkeypatching subprocess outcomes (simulate success, failure, and OSError) to
verify positive-only caching and widened OSError handling; ensure these new
tests restore the original _java_available (or avoid the autouse patch for them)
so they exercise the real probe logic and cache behavior.
- Around line 616-619: Update the test_no_java_raises_runtime_error assertion to
verify the full remediation guidance in the RuntimeError from
OpenDataLoaderPDFReader.lazy_load_data: instead of only matching "Java is not
found", assert the error message also contains the actionable phrases "Java 11+"
and "java -version" (e.g., update the pytest.raises(match=...) to a regex or
combined string that includes "Java is not found", "Java 11+", and "java
-version") so the test ensures the full guidance is present.
- Around line 595-603: Update the two tests that assert FileNotFoundError to
match the exact missing filename so the error message includes the offending
path: in tests test_single_missing_file and test_missing_file_in_list, when
calling OpenDataLoaderPDFReader().lazy_load_data use
pytest.raises(FileNotFoundError, match="nonexistent.pdf") (or a regex that
includes "nonexistent.pdf") so the raised error message explicitly contains the
missing file path; ensure both usages referencing OpenDataLoaderPDFReader and
lazy_load_data are updated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 217d1e98-a3a1-4cb6-a678-a20f3e915826

📥 Commits

Reviewing files that changed from the base of the PR and between 4e6dd13 and 49b0494.

📒 Files selected for processing (1)

tests/test_readers_opendataloader_pdf.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

tests/test_readers_opendataloader_pdf.py (1)

16-16: ⚠️ Potential issue | 🟡 Minor

Add missing return type annotations to private fixtures/helpers.

These functions are currently missing return type annotations and should be annotated to keep lint clean.

Suggested fix

 `@pytest.fixture`(autouse=True)
-def _bypass_input_validation(monkeypatch):
+def _bypass_input_validation(monkeypatch) -> None:
     """Bypass file-existence and Java checks in unit tests.
 
     Tests that verify validation behaviour override these explicitly.
     Only dummy PDF paths are faked; all other Path.exists calls delegate
     to the real implementation.
     """
     monkeypatch.setattr(
         "llama_index.readers.opendataloader_pdf.base._java_available",
         lambda: True,
     )
     monkeypatch.setattr(
         "llama_index.readers.opendataloader_pdf.base._java_found", None
     )
 
-    def _fake_exists(self):
+    def _fake_exists(self) -> bool:
         if self.suffix == ".pdf":
             return True
         return _original_path_exists(self)

     `@pytest.fixture`(autouse=True)
-    def _real_path_exists(self, monkeypatch):
+    def _real_path_exists(self, monkeypatch) -> None:
         """Restore real Path.exists so file-not-found is testable."""
         monkeypatch.setattr("pathlib.Path.exists", _original_path_exists)

Also applies to: 31-31, 591-591

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_readers_opendataloader_pdf.py` at line 16, The helper/fixture
_bypass_input_validation(monkeypatch) is missing a return type annotation (and
the other private helpers/fixtures at the indicated locations); add explicit
return type annotations (e.g., -> None for fixtures that only mutate state or ->
bytes/str/Iterator[...] or the correct specific type for helpers that return
values) to _bypass_input_validation and the other private functions mentioned to
satisfy the linter and keep signatures explicit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_readers_opendataloader_pdf.py`:
- Line 16: The helper/fixture _bypass_input_validation(monkeypatch) is missing a
return type annotation (and the other private helpers/fixtures at the indicated
locations); add explicit return type annotations (e.g., -> None for fixtures
that only mutate state or -> bytes/str/Iterator[...] or the correct specific
type for helpers that return values) to _bypass_input_validation and the other
private functions mentioned to satisfy the linter and keep signatures explicit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d3dcd02f-b9b1-483d-8922-ac836fd5c72d

📥 Commits

Reviewing files that changed from the base of the PR and between 49b0494 and 572af16.

📒 Files selected for processing (1)

tests/test_readers_opendataloader_pdf.py

hyunhee-jo · 2026-04-10T06:58:42Z

Squash-merged to main as 4fbbd54.

hyunhee-jo and others added 2 commits April 10, 2026 13:52

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread llama_index/readers/opendataloader_pdf/base.py

Comment thread tests/test_readers_opendataloader_pdf.py

Comment thread tests/test_readers_opendataloader_pdf.py

Comment thread tests/test_readers_opendataloader_pdf.py

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

fix: tighten FileNotFoundError assertions to include specific path

572af16

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

hyunhee-jo closed this Apr 10, 2026

hyunhee-jo deleted the fix/improve-error-messages branch April 10, 2026 06:59

hyunhee-jo restored the fix/improve-error-messages branch April 10, 2026 07:04

hyunhee-jo reopened this Apr 10, 2026

hyunhee-jo merged commit ceb5aaa into main Apr 10, 2026
10 checks passed

hyunhee-jo deleted the fix/improve-error-messages branch April 13, 2026 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add input validation with clear error messages#4

fix: add input validation with clear error messages#4
hyunhee-jo merged 4 commits into
mainfrom
fix/improve-error-messages

hyunhee-jo commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

hyunhee-jo commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hyunhee-jo commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Approach

Evidence

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hyunhee-jo commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hyunhee-jo commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading