feat: add PDF text extraction and image passthrough to get_drive_file_content#557
feat: add PDF text extraction and image passthrough to get_drive_file_content#557chagel wants to merge 6 commits into
Conversation
…_content PDF files are now parsed with pypdf to extract readable text instead of returning a binary placeholder. Image files (png, jpeg, gif, webp, etc.) are base64-encoded with mime type metadata for multimodal consumption. Scanned/image-only PDFs gracefully fall back to a message suggesting get_drive_file_download_url. Closes PIP-1242 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port-to-workspace-mcp-fork feat: add PDF text extraction and image passthrough
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds PDF text extraction and image base64-encoding utilities to Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant get_drive_file_content
participant Downloader as MediaIoBaseDownload
participant extract_pdf_text
participant encode_image_content
Client->>get_drive_file_content: request file content (file_path)
get_drive_file_content->>Downloader: download file bytes
Downloader-->>get_drive_file_content: file_content_bytes + mime_type
alt mime_type == "application/pdf"
get_drive_file_content->>extract_pdf_text: extract_pdf_text(file_content_bytes)
alt extraction succeeds
extract_pdf_text-->>get_drive_file_content: extracted_text
else extraction fails
extract_pdf_text-->>get_drive_file_content: None
get_drive_file_content-->>Client: fallback guidance (download URL)
end
else mime_type in IMAGE_MIME_TYPES
get_drive_file_content->>encode_image_content: encode_image_content(file_bytes, mime_type)
encode_image_content-->>get_drive_file_content: "[base64_image:<mime_type>]" + base64_payload
else
get_drive_file_content->>get_drive_file_content: attempt UTF-8 decode or binary placeholder
end
get_drive_file_content-->>Client: body_text (extracted text, base64 payload, or placeholder)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/gdrive/test_drive_file_content.py (3)
101-111: Unusedinit_side_effectfunction.The
init_side_effectfunction defined at lines 104-106 is not used - the patch uses_FakeDownloaderclass directly. Consider removing this dead code.🧹 Proposed cleanup
def _patch_downloader(content_bytes): """Patch MediaIoBaseDownload to write content_bytes into the BytesIO handle.""" - - def init_side_effect(fh, request): - fh.write(content_bytes) - fh.seek(0) - return patch( "gdrive.drive_tools.MediaIoBaseDownload", side_effect=lambda fh, req: _FakeDownloader(fh, content_bytes), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdrive/test_drive_file_content.py` around lines 101 - 111, The helper function init_side_effect inside _patch_downloader is dead code because the patch call returns a _FakeDownloader via its lambda; remove the unused init_side_effect function definition from _patch_downloader and keep the patch(...) that constructs _FakeDownloader (or, if you intended to use the side effect, replace the lambda with side_effect=init_side_effect and adjust _FakeDownloader usage accordingly). Update references in _patch_downloader so only the active patch that returns _FakeDownloader remains.
56-70: Unused helper function.The
_mock_downloadfunction appears unused in this test file. The tests use_patch_downloaderand fixtures instead. Consider removing this dead code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdrive/test_drive_file_content.py` around lines 56 - 70, The helper function _mock_download is dead code in tests/gdrive/test_drive_file_content.py; remove the entire _mock_download definition (including its inner fake_next_chunk and returned mock_service/content_bytes) and run the test suite to confirm nothing references it (search for _mock_download, fake_next_chunk, or the returned "media_request"); if any test expects the same behavior, replace usages with the existing _patch_downloader fixture or the current mocking utilities instead.
24-53: Consider extracting shared test helper to avoid duplication.The
_make_minimal_pdfhelper is duplicated intests/core/test_pdf_image_utils.py. Consider extracting it to a shared test utilities module (e.g.,tests/conftest.pyortests/helpers.py) to maintain DRY.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdrive/test_drive_file_content.py` around lines 24 - 53, The _make_minimal_pdf test helper is duplicated; extract it into a shared test utilities module (e.g., conftest or helpers) and import it from there in tests instead of keeping local copies. Locate the function _make_minimal_pdf (which constructs a PdfWriter, page resources, stream, and returns bytes), move its implementation into the shared module, replace the local definition in this test with an import of _make_minimal_pdf, and update the other test that contains the duplicate to import the same shared helper so both tests use a single canonical implementation.tests/core/test_pdf_image_utils.py (1)
14-48: Minor: importioat module level.The
ioimport at line 44 could be moved to the module's top-level imports for consistency. This is a minor nit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_pdf_image_utils.py` around lines 14 - 48, Move the local import of io in _make_minimal_pdf to a module-level import: remove the inline "import io" inside the function and add "import io" to the file's top-level imports so _make_minimal_pdf (and other tests) use the standard module import; update references to io.BytesIO() in _make_minimal_pdf accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 26: The pypdf dependency constraint is too permissive and must be
tightened: in the pyproject toml dependency line referencing pypdf (the "pypdf"
requirement entry currently set to >=4.0.0), update the version specifier to
require a safe minimum (preferably pypdf>=6.8.0, or at minimum pypdf>=6.7.4) so
installs exclude versions affected by multiple DoS CVEs; modify the existing
requirement line to the new constraint and save the file.
---
Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 14-48: Move the local import of io in _make_minimal_pdf to a
module-level import: remove the inline "import io" inside the function and add
"import io" to the file's top-level imports so _make_minimal_pdf (and other
tests) use the standard module import; update references to io.BytesIO() in
_make_minimal_pdf accordingly.
In `@tests/gdrive/test_drive_file_content.py`:
- Around line 101-111: The helper function init_side_effect inside
_patch_downloader is dead code because the patch call returns a _FakeDownloader
via its lambda; remove the unused init_side_effect function definition from
_patch_downloader and keep the patch(...) that constructs _FakeDownloader (or,
if you intended to use the side effect, replace the lambda with
side_effect=init_side_effect and adjust _FakeDownloader usage accordingly).
Update references in _patch_downloader so only the active patch that returns
_FakeDownloader remains.
- Around line 56-70: The helper function _mock_download is dead code in
tests/gdrive/test_drive_file_content.py; remove the entire _mock_download
definition (including its inner fake_next_chunk and returned
mock_service/content_bytes) and run the test suite to confirm nothing references
it (search for _mock_download, fake_next_chunk, or the returned
"media_request"); if any test expects the same behavior, replace usages with the
existing _patch_downloader fixture or the current mocking utilities instead.
- Around line 24-53: The _make_minimal_pdf test helper is duplicated; extract it
into a shared test utilities module (e.g., conftest or helpers) and import it
from there in tests instead of keeping local copies. Locate the function
_make_minimal_pdf (which constructs a PdfWriter, page resources, stream, and
returns bytes), move its implementation into the shared module, replace the
local definition in this test with an import of _make_minimal_pdf, and update
the other test that contains the duplicate to import the same shared helper so
both tests use a single canonical implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8fb20e36-f963-4b17-b15f-ae4db92a5372
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
core/utils.pygdrive/drive_tools.pypyproject.tomltests/core/test_pdf_image_utils.pytests/gdrive/test_drive_file_content.py
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs) - Remove unused init_side_effect and _mock_download dead code - Extract duplicated _make_minimal_pdf to shared tests/helpers.py - Move io import to module level in test_pdf_image_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/helpers.py (1)
20-20: Escape PDF literal text before building the stream.Raw interpolation at Line 20 can generate invalid PDFs when
textincludes(,), or\.Proposed fix
- stream.set_data(f"BT /F1 12 Tf 10 50 Td ({text}) Tj ET".encode()) + safe_text = text.replace("\\", "\\\\").replace("(", "\\(").replace(")", "\\)") + stream.set_data(f"BT /F1 12 Tf 10 50 Td ({safe_text}) Tj ET".encode())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/helpers.py` at line 20, The string interpolation into the PDF literal at stream.set_data(f"BT /F1 12 Tf 10 50 Td ({text}) Tj ET".encode()) must escape PDF literal special chars; implement a small helper (e.g. escape_pdf_literal(text)) that replaces backslash with \\ , left paren with \( and right paren with \) and use escape_pdf_literal(text) inside the formatted string before encoding; update the stream.set_data call to interpolate the escaped text so generated PDFs remain valid.tests/core/test_pdf_image_utils.py (1)
53-57: Strengthen JPEG test to verify payload round-trip, not just prefix.Right now it only checks metadata prefix; adding a decode/compare assertion gives parity with the PNG test and catches encoding regressions.
Proposed enhancement
def test_encode_image_content_jpeg(): raw = b"\xff\xd8\xff" + b"\x00" * 50 result = encode_image_content(raw, "image/jpeg") assert result.startswith("[base64_image:image/jpeg]") + encoded_part = result[len("[base64_image:image/jpeg]") :] + assert base64.b64decode(encoded_part) == raw🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_pdf_image_utils.py` around lines 53 - 57, The JPEG test only asserts the prefix; update test_encode_image_content_jpeg to also extract and base64-decode the payload produced by encode_image_content("image/jpeg") and assert the decoded bytes equal the original raw input, mirroring the PNG test; locate encode_image_content and the test function name test_encode_image_content_jpeg and use the same payload-splitting and base64 decoding logic used in the PNG test to perform the round-trip comparison.tests/gdrive/test_drive_file_content.py (1)
9-13: Avoid mutatingsys.pathinside the test module.Line 12 changes global import resolution and can make test behavior environment-dependent.
Proposed cleanup
-import sys -import os - -sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdrive/test_drive_file_content.py` around lines 9 - 13, Remove the sys.path mutation in tests/gdrive/test_drive_file_content.py (the sys.path.insert call) and instead ensure imports are resolved by proper packaging or test configuration: uninstall the inline sys.path change and either import the target module via the package namespace (e.g., from your_project.package import ...), rely on an editable/installable package in the test environment (pip install -e . in CI), or configure PYTHONPATH/pytest's conftest to adjust sys.path globally; do not mutate sys.path inside the test module itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 53-57: The JPEG test only asserts the prefix; update
test_encode_image_content_jpeg to also extract and base64-decode the payload
produced by encode_image_content("image/jpeg") and assert the decoded bytes
equal the original raw input, mirroring the PNG test; locate
encode_image_content and the test function name test_encode_image_content_jpeg
and use the same payload-splitting and base64 decoding logic used in the PNG
test to perform the round-trip comparison.
In `@tests/gdrive/test_drive_file_content.py`:
- Around line 9-13: Remove the sys.path mutation in
tests/gdrive/test_drive_file_content.py (the sys.path.insert call) and instead
ensure imports are resolved by proper packaging or test configuration: uninstall
the inline sys.path change and either import the target module via the package
namespace (e.g., from your_project.package import ...), rely on an
editable/installable package in the test environment (pip install -e . in CI),
or configure PYTHONPATH/pytest's conftest to adjust sys.path globally; do not
mutate sys.path inside the test module itself.
In `@tests/helpers.py`:
- Line 20: The string interpolation into the PDF literal at stream.set_data(f"BT
/F1 12 Tf 10 50 Td ({text}) Tj ET".encode()) must escape PDF literal special
chars; implement a small helper (e.g. escape_pdf_literal(text)) that replaces
backslash with \\ , left paren with \( and right paren with \) and use
escape_pdf_literal(text) inside the formatted string before encoding; update the
stream.set_data call to interpolate the escaped text so generated PDFs remain
valid.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 922c6e2a-99a0-4a87-9c95-2a0a9a9d198c
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
pyproject.tomltests/core/test_pdf_image_utils.pytests/gdrive/test_drive_file_content.pytests/helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
|
Don't have edit rights, can't push my refac until you fix: |
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs) - Remove unused init_side_effect and _mock_download dead code - Extract duplicated _make_minimal_pdf to shared tests/helpers.py - Move io import to module level in test_pdf_image_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Superseded by #659 — re-submitted from personal fork to enable maintainer edit access. Rebased on current main. |
Summary
pypdfdependency andextract_pdf_text()to extract readable text from PDF files instead of returning a binary placeholderencode_image_content()to base64-encode images (png, jpeg, gif, webp, bmp, tiff, svg) with mime type metadata for multimodal consumptionget_drive_file_contentto handle PDF and image mime types before the generic UTF-8 fallback; scanned/image-only PDFs fall back to a message suggestingget_drive_file_download_urlCloses #365
Closes #263
Test plan
extract_pdf_text— valid PDF, corrupted input, empty/blank PDFencode_image_content— PNG and JPEG encodingget_drive_file_contentintegration — PDF with text, empty PDF fallback, image passthrough🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests