Skip to content

feat: add PDF text extraction and image passthrough to get_drive_file_content#557

Closed
chagel wants to merge 6 commits into
taylorwilsdon:mainfrom
pipihosting:main
Closed

feat: add PDF text extraction and image passthrough to get_drive_file_content#557
chagel wants to merge 6 commits into
taylorwilsdon:mainfrom
pipihosting:main

Conversation

@chagel
Copy link
Copy Markdown
Contributor

@chagel chagel commented Mar 11, 2026

Summary

  • Add pypdf dependency and extract_pdf_text() to extract readable text from PDF files instead of returning a binary placeholder
  • Add encode_image_content() to base64-encode images (png, jpeg, gif, webp, bmp, tiff, svg) with mime type metadata for multimodal consumption
  • Update get_drive_file_content to handle PDF and image mime types before the generic UTF-8 fallback; scanned/image-only PDFs fall back to a message suggesting get_drive_file_download_url
  • 9 new tests covering unit and integration scenarios

Closes #365
Closes #263

Test plan

  • extract_pdf_text — valid PDF, corrupted input, empty/blank PDF
  • encode_image_content — PNG and JPEG encoding
  • get_drive_file_content integration — PDF with text, empty PDF fallback, image passthrough
  • Full test suite passes (270 tests)
  • Ruff lint and format clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added PDF text extraction for Drive files with a fallback message if extraction yields no text.
    • Added image handling by embedding images as portable base64‑prefixed content for display/transfer.
  • Tests

    • Added unit and integration tests for PDF extraction (valid/empty/corrupt), image encoding/decoding, and supported image MIME types.

chagel and others added 3 commits March 10, 2026 10:59
…_content

PDF files are now parsed with pypdf to extract readable text instead of
returning a binary placeholder. Image files (png, jpeg, gif, webp, etc.)
are base64-encoded with mime type metadata for multimodal consumption.
Scanned/image-only PDFs gracefully fall back to a message suggesting
get_drive_file_download_url.

Closes PIP-1242

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port-to-workspace-mcp-fork

feat: add PDF text extraction and image passthrough
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0399016f-bb49-4da3-b46a-37d2842fa7f0

📥 Commits

Reviewing files that changed from the base of the PR and between c67eced and 012e133.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • core/utils.py
  • pyproject.toml
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • core/utils.py

📝 Walkthrough

Walkthrough

Adds PDF text extraction and image base64-encoding utilities to core/utils.py, integrates them into get_drive_file_content() to handle application/pdf and common image MIME types, adds pypdf as a runtime dependency, and includes unit/integration tests and a test helper for minimal PDFs.

Changes

Cohort / File(s) Summary
Core utilities & tests
core/utils.py, tests/core/test_pdf_image_utils.py
Add IMAGE_MIME_TYPES constant, extract_pdf_text(file_bytes), and encode_image_content(file_bytes, mime_type). Tests cover PDF extraction success/failure, empty/corrupt PDFs, image base64 encoding/decoding, and MIME set membership.
Google Drive integration & tests
gdrive/drive_tools.py, tests/gdrive/test_drive_file_content.py
get_drive_file_content now uses extract_pdf_text for application/pdf (falls back to download guidance when extraction yields no text) and encode_image_content for image MIME types; tests validate extraction, fallback guidance, and base64 passthrough.
Dependency
pyproject.toml
Add pypdf>=6.8.0 to project dependencies for PDF parsing.
Test helpers
tests/helpers.py
Add _make_minimal_pdf(text: str) to generate minimal in-memory PDFs used by tests.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant get_drive_file_content
    participant Downloader as MediaIoBaseDownload
    participant extract_pdf_text
    participant encode_image_content

    Client->>get_drive_file_content: request file content (file_path)
    get_drive_file_content->>Downloader: download file bytes
    Downloader-->>get_drive_file_content: file_content_bytes + mime_type

    alt mime_type == "application/pdf"
        get_drive_file_content->>extract_pdf_text: extract_pdf_text(file_content_bytes)
        alt extraction succeeds
            extract_pdf_text-->>get_drive_file_content: extracted_text
        else extraction fails
            extract_pdf_text-->>get_drive_file_content: None
            get_drive_file_content-->>Client: fallback guidance (download URL)
        end
    else mime_type in IMAGE_MIME_TYPES
        get_drive_file_content->>encode_image_content: encode_image_content(file_bytes, mime_type)
        encode_image_content-->>get_drive_file_content: "[base64_image:<mime_type>]" + base64_payload
    else
        get_drive_file_content->>get_drive_file_content: attempt UTF-8 decode or binary placeholder
    end

    get_drive_file_content-->>Client: body_text (extracted text, base64 payload, or placeholder)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰
I hopped through bytes both dark and bright,
Pulled words from PDFs into the light,
Wrapped images snug in base64 fur,
Now Drive's content hums and purrs,
A rabbit's tweak — concise and neat!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: adding PDF text extraction and image encoding to get_drive_file_content function.
Description check ✅ Passed The description covers key aspects (pypdf dependency, extract_pdf_text, encode_image_content, get_drive_file_content updates, tests, issue closures) but is missing some template sections like Type of Change checkboxes and complete Testing/Checklist items.
Linked Issues check ✅ Passed All primary coding requirements from #365 and #263 are met: PDF text extraction via extract_pdf_text, image base64 encoding via encode_image_content, updated get_drive_file_content handling, and fallback for empty PDFs.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the linked issues: PDF handling, image encoding, updated get_drive_file_content, new dependency (pypdf), and comprehensive test coverage. No unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/gdrive/test_drive_file_content.py (3)

101-111: Unused init_side_effect function.

The init_side_effect function defined at lines 104-106 is not used - the patch uses _FakeDownloader class directly. Consider removing this dead code.

🧹 Proposed cleanup
 def _patch_downloader(content_bytes):
     """Patch MediaIoBaseDownload to write content_bytes into the BytesIO handle."""
-
-    def init_side_effect(fh, request):
-        fh.write(content_bytes)
-        fh.seek(0)
-
     return patch(
         "gdrive.drive_tools.MediaIoBaseDownload",
         side_effect=lambda fh, req: _FakeDownloader(fh, content_bytes),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 101 - 111, The helper
function init_side_effect inside _patch_downloader is dead code because the
patch call returns a _FakeDownloader via its lambda; remove the unused
init_side_effect function definition from _patch_downloader and keep the
patch(...) that constructs _FakeDownloader (or, if you intended to use the side
effect, replace the lambda with side_effect=init_side_effect and adjust
_FakeDownloader usage accordingly). Update references in _patch_downloader so
only the active patch that returns _FakeDownloader remains.

56-70: Unused helper function.

The _mock_download function appears unused in this test file. The tests use _patch_downloader and fixtures instead. Consider removing this dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 56 - 70, The helper
function _mock_download is dead code in tests/gdrive/test_drive_file_content.py;
remove the entire _mock_download definition (including its inner fake_next_chunk
and returned mock_service/content_bytes) and run the test suite to confirm
nothing references it (search for _mock_download, fake_next_chunk, or the
returned "media_request"); if any test expects the same behavior, replace usages
with the existing _patch_downloader fixture or the current mocking utilities
instead.

24-53: Consider extracting shared test helper to avoid duplication.

The _make_minimal_pdf helper is duplicated in tests/core/test_pdf_image_utils.py. Consider extracting it to a shared test utilities module (e.g., tests/conftest.py or tests/helpers.py) to maintain DRY.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 24 - 53, The
_make_minimal_pdf test helper is duplicated; extract it into a shared test
utilities module (e.g., conftest or helpers) and import it from there in tests
instead of keeping local copies. Locate the function _make_minimal_pdf (which
constructs a PdfWriter, page resources, stream, and returns bytes), move its
implementation into the shared module, replace the local definition in this test
with an import of _make_minimal_pdf, and update the other test that contains the
duplicate to import the same shared helper so both tests use a single canonical
implementation.
tests/core/test_pdf_image_utils.py (1)

14-48: Minor: import io at module level.

The io import at line 44 could be moved to the module's top-level imports for consistency. This is a minor nit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_pdf_image_utils.py` around lines 14 - 48, Move the local
import of io in _make_minimal_pdf to a module-level import: remove the inline
"import io" inside the function and add "import io" to the file's top-level
imports so _make_minimal_pdf (and other tests) use the standard module import;
update references to io.BytesIO() in _make_minimal_pdf accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 26: The pypdf dependency constraint is too permissive and must be
tightened: in the pyproject toml dependency line referencing pypdf (the "pypdf"
requirement entry currently set to >=4.0.0), update the version specifier to
require a safe minimum (preferably pypdf>=6.8.0, or at minimum pypdf>=6.7.4) so
installs exclude versions affected by multiple DoS CVEs; modify the existing
requirement line to the new constraint and save the file.

---

Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 14-48: Move the local import of io in _make_minimal_pdf to a
module-level import: remove the inline "import io" inside the function and add
"import io" to the file's top-level imports so _make_minimal_pdf (and other
tests) use the standard module import; update references to io.BytesIO() in
_make_minimal_pdf accordingly.

In `@tests/gdrive/test_drive_file_content.py`:
- Around line 101-111: The helper function init_side_effect inside
_patch_downloader is dead code because the patch call returns a _FakeDownloader
via its lambda; remove the unused init_side_effect function definition from
_patch_downloader and keep the patch(...) that constructs _FakeDownloader (or,
if you intended to use the side effect, replace the lambda with
side_effect=init_side_effect and adjust _FakeDownloader usage accordingly).
Update references in _patch_downloader so only the active patch that returns
_FakeDownloader remains.
- Around line 56-70: The helper function _mock_download is dead code in
tests/gdrive/test_drive_file_content.py; remove the entire _mock_download
definition (including its inner fake_next_chunk and returned
mock_service/content_bytes) and run the test suite to confirm nothing references
it (search for _mock_download, fake_next_chunk, or the returned
"media_request"); if any test expects the same behavior, replace usages with the
existing _patch_downloader fixture or the current mocking utilities instead.
- Around line 24-53: The _make_minimal_pdf test helper is duplicated; extract it
into a shared test utilities module (e.g., conftest or helpers) and import it
from there in tests instead of keeping local copies. Locate the function
_make_minimal_pdf (which constructs a PdfWriter, page resources, stream, and
returns bytes), move its implementation into the shared module, replace the
local definition in this test with an import of _make_minimal_pdf, and update
the other test that contains the duplicate to import the same shared helper so
both tests use a single canonical implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8fb20e36-f963-4b17-b15f-ae4db92a5372

📥 Commits

Reviewing files that changed from the base of the PR and between 5a2e1f3 and ba2e4dc.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • core/utils.py
  • gdrive/drive_tools.py
  • pyproject.toml
  • tests/core/test_pdf_image_utils.py
  • tests/gdrive/test_drive_file_content.py

Comment thread pyproject.toml Outdated
chagel and others added 2 commits March 14, 2026 22:51
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs)
- Remove unused init_side_effect and _mock_download dead code
- Extract duplicated _make_minimal_pdf to shared tests/helpers.py
- Move io import to module level in test_pdf_image_utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/helpers.py (1)

20-20: Escape PDF literal text before building the stream.

Raw interpolation at Line 20 can generate invalid PDFs when text includes (, ), or \.

Proposed fix
-    stream.set_data(f"BT /F1 12 Tf 10 50 Td ({text}) Tj ET".encode())
+    safe_text = text.replace("\\", "\\\\").replace("(", "\\(").replace(")", "\\)")
+    stream.set_data(f"BT /F1 12 Tf 10 50 Td ({safe_text}) Tj ET".encode())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/helpers.py` at line 20, The string interpolation into the PDF literal
at stream.set_data(f"BT /F1 12 Tf 10 50 Td ({text}) Tj ET".encode()) must escape
PDF literal special chars; implement a small helper (e.g.
escape_pdf_literal(text)) that replaces backslash with \\ , left paren with \(
and right paren with \) and use escape_pdf_literal(text) inside the formatted
string before encoding; update the stream.set_data call to interpolate the
escaped text so generated PDFs remain valid.
tests/core/test_pdf_image_utils.py (1)

53-57: Strengthen JPEG test to verify payload round-trip, not just prefix.

Right now it only checks metadata prefix; adding a decode/compare assertion gives parity with the PNG test and catches encoding regressions.

Proposed enhancement
 def test_encode_image_content_jpeg():
     raw = b"\xff\xd8\xff" + b"\x00" * 50
     result = encode_image_content(raw, "image/jpeg")
     assert result.startswith("[base64_image:image/jpeg]")
+    encoded_part = result[len("[base64_image:image/jpeg]") :]
+    assert base64.b64decode(encoded_part) == raw
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_pdf_image_utils.py` around lines 53 - 57, The JPEG test only
asserts the prefix; update test_encode_image_content_jpeg to also extract and
base64-decode the payload produced by encode_image_content("image/jpeg") and
assert the decoded bytes equal the original raw input, mirroring the PNG test;
locate encode_image_content and the test function name
test_encode_image_content_jpeg and use the same payload-splitting and base64
decoding logic used in the PNG test to perform the round-trip comparison.
tests/gdrive/test_drive_file_content.py (1)

9-13: Avoid mutating sys.path inside the test module.

Line 12 changes global import resolution and can make test behavior environment-dependent.

Proposed cleanup
-import sys
-import os
-
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../..")))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 9 - 13, Remove the
sys.path mutation in tests/gdrive/test_drive_file_content.py (the
sys.path.insert call) and instead ensure imports are resolved by proper
packaging or test configuration: uninstall the inline sys.path change and either
import the target module via the package namespace (e.g., from
your_project.package import ...), rely on an editable/installable package in the
test environment (pip install -e . in CI), or configure PYTHONPATH/pytest's
conftest to adjust sys.path globally; do not mutate sys.path inside the test
module itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 53-57: The JPEG test only asserts the prefix; update
test_encode_image_content_jpeg to also extract and base64-decode the payload
produced by encode_image_content("image/jpeg") and assert the decoded bytes
equal the original raw input, mirroring the PNG test; locate
encode_image_content and the test function name test_encode_image_content_jpeg
and use the same payload-splitting and base64 decoding logic used in the PNG
test to perform the round-trip comparison.

In `@tests/gdrive/test_drive_file_content.py`:
- Around line 9-13: Remove the sys.path mutation in
tests/gdrive/test_drive_file_content.py (the sys.path.insert call) and instead
ensure imports are resolved by proper packaging or test configuration: uninstall
the inline sys.path change and either import the target module via the package
namespace (e.g., from your_project.package import ...), rely on an
editable/installable package in the test environment (pip install -e . in CI),
or configure PYTHONPATH/pytest's conftest to adjust sys.path globally; do not
mutate sys.path inside the test module itself.

In `@tests/helpers.py`:
- Line 20: The string interpolation into the PDF literal at stream.set_data(f"BT
/F1 12 Tf 10 50 Td ({text}) Tj ET".encode()) must escape PDF literal special
chars; implement a small helper (e.g. escape_pdf_literal(text)) that replaces
backslash with \\ , left paren with \( and right paren with \) and use
escape_pdf_literal(text) inside the formatted string before encoding; update the
stream.set_data call to interpolate the escaped text so generated PDFs remain
valid.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 922c6e2a-99a0-4a87-9c95-2a0a9a9d198c

📥 Commits

Reviewing files that changed from the base of the PR and between ba2e4dc and c67eced.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • pyproject.toml
  • tests/core/test_pdf_image_utils.py
  • tests/gdrive/test_drive_file_content.py
  • tests/helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml

@taylorwilsdon taylorwilsdon self-assigned this Apr 4, 2026
@taylorwilsdon taylorwilsdon added the enhancement New feature or request label Apr 4, 2026
@taylorwilsdon
Copy link
Copy Markdown
Owner

Don't have edit rights, can't push my refac until you fix:

fatal: unable to access 'https://github.com/pipihosting/google_workspace_mcp.git/': The requested URL returned error: 403

@taylorwilsdon taylorwilsdon marked this pull request as draft April 4, 2026 01:14
chagel added a commit to chagel/google_workspace_mcp that referenced this pull request Apr 4, 2026
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs)
- Remove unused init_side_effect and _mock_download dead code
- Extract duplicated _make_minimal_pdf to shared tests/helpers.py
- Move io import to module level in test_pdf_image_utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@chagel
Copy link
Copy Markdown
Contributor Author

chagel commented Apr 4, 2026

Superseded by #659 — re-submitted from personal fork to enable maintainer edit access. Rebased on current main.

@chagel chagel closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

2 participants