fix: add input validation with clear error messages#4
Conversation
Raise FileNotFoundError for missing input paths and RuntimeError when Java is not found on the system PATH, instead of letting CalledProcessError propagate from the subprocess. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…w Path.exists mock, catch OSError - Replace lru_cache with manual caching that only caches True (Java found); negative results are re-probed so mid-session Java installs are recognised. - Narrow the autouse Path.exists mock to .pdf suffix only, delegating all other calls to the real implementation. - Catch OSError (covers PermissionError) instead of just FileNotFoundError in _java_available(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughAdds a Java availability probe and input-path existence validation to OpenDataLoaderPDFReader.lazy_load_data, raising clear errors when Java is missing or when resolved input paths do not exist. Changes
Sequence Diagram(s)(Skipped — changes are localized and do not introduce a multi-component control flow that requires visualization.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@llama_index/readers/opendataloader_pdf/base.py`:
- Around line 22-43: The current _java_available function wrongly caches any
java with returncode 0 as available; change it to run ["java","-version"],
capture output (stderr/stdout), parse the version string produced by java
-version to extract the major version number (handle formats like "1.8.0_xx" ->
major 8 and "11.0.x" -> major 11), and only set the module-level _java_found
True (and return True) if the major version is >= 11; keep the existing behavior
of not caching negatives (i.e. only cache successful checks), handle
subprocess.TimeoutExpired/OSError as False, and update the assignment to
_java_found in _java_available accordingly so downstream code requiring Java 11+
will not be cached as available when an older Java is present.
In `@tests/test_readers_opendataloader_pdf.py`:
- Around line 609-617: Update the assertion in test_no_java_raises_runtime_error
to verify the full actionable guidance in the RuntimeError from
OpenDataLoaderPDFReader.lazy_load_data: require the error message to include
both the "Java 11+" version requirement and a hint to run "java -version" (e.g.,
match a regex or exact substrings "Java 11+" and "java -version") so the test
fails if those helpful instructions are removed or changed.
- Around line 15-33: Add focused unit tests for the new _java_available()
semantics: stop using the global fixture stub for these tests and instead
directly call llama_index.readers.opendataloader_pdf.base._java_available while
monkeypatching the underlying Java-check implementation to first return True and
verify a subsequent call uses the cached positive result, then test a sequence
where the underlying check returns False and verify that False is not cached (a
later underlying True is observed), and finally simulate the underlying check
raising OSError (or a subclass) to confirm the widened OSError handling occurs
and that the exception behavior is not erroneously cached; place these tests
alongside the existing suite but ensure they do not rely on the module-wide
Path.exists/_java_available monkeypatch.
- Around line 592-600: Update the two tests so they assert the FileNotFoundError
message includes the actual missing path returned by
OpenDataLoaderPDFReader.lazy_load_data; specifically change the pytest.raises
match in test_single_missing_file and test_missing_file_in_list to include
"nonexistent.pdf" (e.g., use a regex matching the filename or full path) so the
exception message is locked to the offending path rather than just the generic
"does not exist".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6574c428-767d-4796-9f02-f2148d0f245d
📒 Files selected for processing (2)
llama_index/readers/opendataloader_pdf/base.pytests/test_readers_opendataloader_pdf.py
…patch - Add monkeypatch.setattr for _java_found in autouse fixture to prevent global state leakage between tests. - Remove redundant @patch("pathlib.Path.exists") from TestJavaCheck since the autouse fixture already handles .pdf suffix paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (3)
tests/test_readers_opendataloader_pdf.py (3)
15-36:⚠️ Potential issue | 🟡 MinorGlobal
_java_availablestubbing still hides core probe/cache behavior testsThe autouse fixture at Line 15 forces
_java_availabletoTruefor most tests, so regressions in positive-only caching and widenedOSErrorhandling can still slip through. Add focused tests that callllama_index.readers.opendataloader_pdf.base._java_availabledirectly with controlled subprocess outcomes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_readers_opendataloader_pdf.py` around lines 15 - 36, The autouse fixture _bypass_input_validation currently monkeypatches llama_index.readers.opendataloader_pdf.base._java_available to always return True, which hides tests for its probe/caching behavior; update the test suite by removing or limiting that global stub and add focused tests that call llama_index.readers.opendataloader_pdf.base._java_available directly while monkeypatching subprocess outcomes (simulate success, failure, and OSError) to verify positive-only caching and widened OSError handling; ensure these new tests restore the original _java_available (or avoid the autouse patch for them) so they exercise the real probe logic and cache behavior.
616-619:⚠️ Potential issue | 🟡 MinorVerify full Java remediation guidance in the RuntimeError message
Line 618 currently checks only
"Java is not found". The test should also assert the actionable guidance ("Java 11+"and"java -version"), so those instructions can’t regress silently.Suggested assertion upgrade
- with pytest.raises(RuntimeError, match="Java is not found"): - list(reader.lazy_load_data(file_path="doc.pdf")) + with pytest.raises(RuntimeError) as exc_info: + list(reader.lazy_load_data(file_path="doc.pdf")) + message = str(exc_info.value) + assert "Java is not found" in message + assert "Java 11+" in message + assert "java -version" in message🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_readers_opendataloader_pdf.py` around lines 616 - 619, Update the test_no_java_raises_runtime_error assertion to verify the full remediation guidance in the RuntimeError from OpenDataLoaderPDFReader.lazy_load_data: instead of only matching "Java is not found", assert the error message also contains the actionable phrases "Java 11+" and "java -version" (e.g., update the pytest.raises(match=...) to a regex or combined string that includes "Java is not found", "Java 11+", and "java -version") so the test ensures the full guidance is present.
595-603:⚠️ Potential issue | 🟡 MinorAssert the specific missing path in
FileNotFoundErrorLine 597 and Line 602 only match a generic phrase. These tests should lock in the offending filename/path (
nonexistent.pdf) to preserve the PR’s UX contract.Suggested test assertion tightening
- with pytest.raises(FileNotFoundError, match="does not exist"): + with pytest.raises( + FileNotFoundError, + match=r"Input path does not exist: nonexistent\.pdf", + ): list(reader.lazy_load_data(file_path="nonexistent.pdf")) @@ - with pytest.raises(FileNotFoundError, match="does not exist"): + with pytest.raises( + FileNotFoundError, + match=r"Input path does not exist: nonexistent\.pdf", + ): list(reader.lazy_load_data(file_path=["nonexistent.pdf"]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_readers_opendataloader_pdf.py` around lines 595 - 603, Update the two tests that assert FileNotFoundError to match the exact missing filename so the error message includes the offending path: in tests test_single_missing_file and test_missing_file_in_list, when calling OpenDataLoaderPDFReader().lazy_load_data use pytest.raises(FileNotFoundError, match="nonexistent.pdf") (or a regex that includes "nonexistent.pdf") so the raised error message explicitly contains the missing file path; ensure both usages referencing OpenDataLoaderPDFReader and lazy_load_data are updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/test_readers_opendataloader_pdf.py`:
- Around line 15-36: The autouse fixture _bypass_input_validation currently
monkeypatches llama_index.readers.opendataloader_pdf.base._java_available to
always return True, which hides tests for its probe/caching behavior; update the
test suite by removing or limiting that global stub and add focused tests that
call llama_index.readers.opendataloader_pdf.base._java_available directly while
monkeypatching subprocess outcomes (simulate success, failure, and OSError) to
verify positive-only caching and widened OSError handling; ensure these new
tests restore the original _java_available (or avoid the autouse patch for them)
so they exercise the real probe logic and cache behavior.
- Around line 616-619: Update the test_no_java_raises_runtime_error assertion to
verify the full remediation guidance in the RuntimeError from
OpenDataLoaderPDFReader.lazy_load_data: instead of only matching "Java is not
found", assert the error message also contains the actionable phrases "Java 11+"
and "java -version" (e.g., update the pytest.raises(match=...) to a regex or
combined string that includes "Java is not found", "Java 11+", and "java
-version") so the test ensures the full guidance is present.
- Around line 595-603: Update the two tests that assert FileNotFoundError to
match the exact missing filename so the error message includes the offending
path: in tests test_single_missing_file and test_missing_file_in_list, when
calling OpenDataLoaderPDFReader().lazy_load_data use
pytest.raises(FileNotFoundError, match="nonexistent.pdf") (or a regex that
includes "nonexistent.pdf") so the raised error message explicitly contains the
missing file path; ensure both usages referencing OpenDataLoaderPDFReader and
lazy_load_data are updated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 217d1e98-a3a1-4cb6-a678-a20f3e915826
📒 Files selected for processing (1)
tests/test_readers_opendataloader_pdf.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/test_readers_opendataloader_pdf.py (1)
16-16:⚠️ Potential issue | 🟡 MinorAdd missing return type annotations to private fixtures/helpers.
These functions are currently missing return type annotations and should be annotated to keep lint clean.
Suggested fix
`@pytest.fixture`(autouse=True) -def _bypass_input_validation(monkeypatch): +def _bypass_input_validation(monkeypatch) -> None: """Bypass file-existence and Java checks in unit tests. Tests that verify validation behaviour override these explicitly. Only dummy PDF paths are faked; all other Path.exists calls delegate to the real implementation. """ monkeypatch.setattr( "llama_index.readers.opendataloader_pdf.base._java_available", lambda: True, ) monkeypatch.setattr( "llama_index.readers.opendataloader_pdf.base._java_found", None ) - def _fake_exists(self): + def _fake_exists(self) -> bool: if self.suffix == ".pdf": return True return _original_path_exists(self)`@pytest.fixture`(autouse=True) - def _real_path_exists(self, monkeypatch): + def _real_path_exists(self, monkeypatch) -> None: """Restore real Path.exists so file-not-found is testable.""" monkeypatch.setattr("pathlib.Path.exists", _original_path_exists)Also applies to: 31-31, 591-591
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_readers_opendataloader_pdf.py` at line 16, The helper/fixture _bypass_input_validation(monkeypatch) is missing a return type annotation (and the other private helpers/fixtures at the indicated locations); add explicit return type annotations (e.g., -> None for fixtures that only mutate state or -> bytes/str/Iterator[...] or the correct specific type for helpers that return values) to _bypass_input_validation and the other private functions mentioned to satisfy the linter and keep signatures explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/test_readers_opendataloader_pdf.py`:
- Line 16: The helper/fixture _bypass_input_validation(monkeypatch) is missing a
return type annotation (and the other private helpers/fixtures at the indicated
locations); add explicit return type annotations (e.g., -> None for fixtures
that only mutate state or -> bytes/str/Iterator[...] or the correct specific
type for helpers that return values) to _bypass_input_validation and the other
private functions mentioned to satisfy the linter and keep signatures explicit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d3dcd02f-b9b1-483d-8922-ac836fd5c72d
📒 Files selected for processing (1)
tests/test_readers_opendataloader_pdf.py
|
Squash-merged to main as 4fbbd54. |
Objective
Users get cryptic
CalledProcessErroror Java stack traces when passing non-existent file paths or running without Java installed. This makes debugging difficult, especially for first-time users.Approach
Add early validation in
lazy_load_data()before invoking the Java engine:FileNotFoundErrorwith the specific missing pathRuntimeErrorwith installation instructions (minimum version, download URL, verification command)Design decisions:
tempfile.mkdtempto avoid creating temp dirs that would need cleanup on validation failureOSError(not justFileNotFoundError) to handlePermissionErroron platforms wherejavaexists but isn't executableEvidence
CalledProcessErrorfrom Java subprocessFileNotFoundError: Input path does not exist: missing.pdfFileNotFoundError: [Errno 2] No such file or directory: 'java'RuntimeError: Java is not found on the system PATH. OpenDataLoader PDF requires Java 11+. Install Java from https://adoptium.net/...🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests