feat: add PDF text extraction and image passthrough to get_drive_file_content by chagel · Pull Request #659 · taylorwilsdon/google_workspace_mcp

chagel · 2026-04-04T22:58:26Z

Summary

Add pypdf dependency and extract_pdf_text() to extract readable text from PDF files instead of returning a binary placeholder
Add encode_image_content() to base64-encode images (png, jpeg, gif, webp, bmp, tiff, svg) with mime type metadata for multimodal consumption
Update get_drive_file_content to handle PDF and image mime types before the generic UTF-8 fallback; scanned/image-only PDFs fall back to a message suggesting get_drive_file_download_url
9 new tests covering unit and integration scenarios
Rebased on current main (resolves uv.lock conflicts from feat: add PDF text extraction and image passthrough to get_drive_file_content #557)

Closes #365
Closes #263
Supersedes #557 (re-submitted from personal fork to enable maintainer edits)

Test plan

extract_pdf_text — valid PDF, corrupted input, empty/blank PDF
encode_image_content — PNG and JPEG encoding
get_drive_file_content integration — PDF with text, empty PDF fallback, image passthrough
Full test suite passes (584 tests)
Ruff lint and format clean

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Extract text from PDF files retrieved from Google Drive; returns a clear fallback when no text is found
- Encode and return image file content with MIME metadata (base64-embedded)
Tests
- Added tests covering PDF text extraction, image encoding/validation, and recognized image MIME types for Drive file handling
Documentation
- Updated Drive tool docs to list Office, PDF, and image support

…_content PDF files are now parsed with pypdf to extract readable text instead of returning a binary placeholder. Image files (png, jpeg, gif, webp, etc.) are base64-encoded with mime type metadata for multimodal consumption. Scanned/image-only PDFs gracefully fall back to a message suggesting get_drive_file_download_url. Closes PIP-1242 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs) - Remove unused init_side_effect and _mock_download dead code - Extract duplicated _make_minimal_pdf to shared tests/helpers.py - Move io import to module level in test_pdf_image_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-04T22:58:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25ddf293-e2e4-43e8-9e42-c2d2d9e12071

📥 Commits

Reviewing files that changed from the base of the PR and between 052de9b and b5946c9.

📒 Files selected for processing (5)

core/utils.py
gdrive/drive_tools.py
tests/core/test_pdf_image_utils.py
tests/gcontacts/golden/contacts_tool_schemas.json
tests/gdocs/golden/docs_tool_schemas.json

✅ Files skipped from review due to trivial changes (3)

tests/gcontacts/golden/contacts_tool_schemas.json
tests/gdocs/golden/docs_tool_schemas.json
tests/core/test_pdf_image_utils.py

🚧 Files skipped from review as they are similar to previous changes (2)

gdrive/drive_tools.py
core/utils.py

📝 Walkthrough

Walkthrough

Adds PDF text extraction and image encoding: new utilities (extract_pdf_text, encode_image_content, IMAGE_MIME_TYPES) in core/utils.py, integrated into gdrive.get_drive_file_content to return extracted PDF text or base64-encoded images; adds pypdf dependency, tests, and a test PDF helper.

Changes

Cohort / File(s)	Summary
PDF & Image Utilities `core/utils.py`	Added `IMAGE_MIME_TYPES` set, `extract_pdf_text(file_bytes)` using `pypdf.PdfReader` to concatenate per-page text (returns `None` on empty/error), and `encode_image_content(file_bytes, mime_type)` returning `[base64_image:{mime_type}]{base64}`.
Drive Tools Integration `gdrive/drive_tools.py`	Updated `get_drive_file_content()` to offload Office XML extraction to `asyncio.to_thread(...)`, attempt PDF text extraction via `extract_pdf_text` with fallback message pointing to `get_drive_file_download_url` when empty, and return base64-encoded images for MIME types in `IMAGE_MIME_TYPES`.
Dependencies & Test Config `pyproject.toml`	Added runtime dependency `pypdf>=6.8.0` and configured pytest `pythonpath = ["."]`.
Core Utility Tests `tests/core/test_pdf_image_utils.py`	New unit tests for `extract_pdf_text()` (valid PDF -> text, corrupted/blank -> `None`), `encode_image_content()` (PNG/JPEG round-trip, non-image MIME raises `ValueError`), and `IMAGE_MIME_TYPES` membership checks.
Drive Integration Tests `tests/gdrive/test_drive_file_content.py`	Async tests mocking `resolve_drive_item` and `MediaIoBaseDownload` to validate PDF extraction (including blank PDF fallback) and image base64 encoding returned by `get_drive_file_content()`.
Test Helpers `tests/helpers.py`	Added `_make_minimal_pdf(text="Hello World") -> bytes` building a one-page in-memory PDF with `pypdf.PdfWriter` for tests.
Docs `README.md`	Updated Drive capability description to include Office, PDF, and images.
Schema Docs `tests/gcontacts/golden/...`, `tests/gdocs/golden/...`	Added/expanded descriptive `description` fields in JSON schema test fixtures (metadata-only changes).

Sequence Diagram

sequenceDiagram
    participant Client
    participant DriveTool as get_drive_file_content()
    participant Downloader as MediaIoBaseDownload
    participant PDFUtil as extract_pdf_text()
    participant ImgUtil as encode_image_content()

    Client->>DriveTool: request file content (file_id)
    DriveTool->>Downloader: download file bytes
    Downloader-->>DriveTool: file bytes

    alt mime == "application/pdf"
        DriveTool->>PDFUtil: extract_pdf_text(file bytes)
        PDFUtil-->>DriveTool: text or None
        alt text present
            DriveTool-->>Client: return extracted text
        else no text
            DriveTool-->>Client: return fallback message (use download URL)
        end
    else mime in IMAGE_MIME_TYPES
        DriveTool->>ImgUtil: encode_image_content(file bytes, mime)
        ImgUtil-->>DriveTool: "[base64_image:mime]..." string
        DriveTool-->>Client: return base64 image string
    else
        DriveTool-->>Client: existing Office/XML or UTF-8/binary behavior
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇
I nibble bytes in moonlit night,
PDFs whisper words to light,
Images curl in base64 dreams,
I hop and stitch them into streams,
Happy rabbit, files take flight!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The PR includes out-of-scope documentation updates to gcontacts and gdocs golden schema files that add descriptions to JSON schema properties unrelated to PDF/image handling objectives.	Remove changes to tests/gcontacts/golden/contacts_tool_schemas.json and tests/gdocs/golden/docs_tool_schemas.json, or clarify why schema documentation updates are necessary for this PDF/image feature.
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main feature additions: PDF text extraction and image base64 encoding in get_drive_file_content.
Description check	✅ Passed	The description covers all required template sections: summary of changes, testing confirmation, and checklist items; it provides sufficient context and completion details.
Linked Issues check	✅ Passed	The PR fully addresses requirements from `#365` (PDF extraction via extract_pdf_text) and `#263` (image base64 encoding via encode_image_content) with corresponding tests and integration into get_drive_file_content.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

tests/core/test_pdf_image_utils.py (1)

45-56: Good image encoding tests.

The PNG test includes base64 roundtrip verification. Consider adding the same roundtrip assertion to test_encode_image_content_jpeg for consistency:

encoded_part = result[len("[base64_image:image/jpeg]"):]
assert base64.b64decode(encoded_part) == raw

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_pdf_image_utils.py` around lines 45 - 56, The JPEG test
test_encode_image_content_jpeg should perform the same base64 roundtrip
verification as the PNG test: after calling encode_image_content(raw,
"image/jpeg"), extract the encoded_part by slicing off the
"[base64_image:image/jpeg]" prefix and assert that
base64.b64decode(encoded_part) equals the original raw bytes; update
test_encode_image_content_jpeg to include this assertion to ensure the
encoding/decoding roundtrip for encode_image_content is validated.

tests/gdrive/test_drive_file_content.py (1)

9-12: Consider using pytest's pythonpath configuration.

The sys.path.insert manipulation works but is fragile. Consider adding pythonpath = ["."] to [tool.pytest.ini_options] in pyproject.toml instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 9 - 12, The test file
tests/gdrive/test_drive_file_content.py currently mutates sys.path via
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__),
"../.."))); remove that fragile path-manipulation and instead configure pytest
to set the import root by adding pythonpath = ["."] under
[tool.pytest.ini_options] in pyproject.toml so tests can import project modules
without runtime sys.path edits; update the test file to delete the
sys.path/os.path import and insertion lines and commit the pyproject.toml
change.

tests/helpers.py (1)

34-34: Private API usage may break on pypdf updates.

writer._add_object is an internal method (underscore prefix). While acceptable for test helpers, be aware this could break if pypdf's internals change.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/helpers.py` at line 34, The test helper uses the private API
writer._add_object (page[NameObject("/Contents")] = writer._add_object(stream));
replace this direct private call with a safe public fallback: prefer calling
writer.add_object(stream) if present, and only fall back to
writer._add_object(stream) when add_object is unavailable (use getattr to
detect), and emit a warning in that fallback path so tests make the dependency
explicit; update the assignment to use the chosen function instead of calling
_add_object directly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 45-56: The JPEG test test_encode_image_content_jpeg should perform
the same base64 roundtrip verification as the PNG test: after calling
encode_image_content(raw, "image/jpeg"), extract the encoded_part by slicing off
the "[base64_image:image/jpeg]" prefix and assert that
base64.b64decode(encoded_part) equals the original raw bytes; update
test_encode_image_content_jpeg to include this assertion to ensure the
encoding/decoding roundtrip for encode_image_content is validated.

In `@tests/gdrive/test_drive_file_content.py`:
- Around line 9-12: The test file tests/gdrive/test_drive_file_content.py
currently mutates sys.path via sys.path.insert(0,
os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))); remove that
fragile path-manipulation and instead configure pytest to set the import root by
adding pythonpath = ["."] under [tool.pytest.ini_options] in pyproject.toml so
tests can import project modules without runtime sys.path edits; update the test
file to delete the sys.path/os.path import and insertion lines and commit the
pyproject.toml change.

In `@tests/helpers.py`:
- Line 34: The test helper uses the private API writer._add_object
(page[NameObject("/Contents")] = writer._add_object(stream)); replace this
direct private call with a safe public fallback: prefer calling
writer.add_object(stream) if present, and only fall back to
writer._add_object(stream) when add_object is unavailable (use getattr to
detect), and emit a warning in that fallback path so tests make the dependency
explicit; update the assignment to use the chosen function instead of calling
_add_object directly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd312cdc-d395-4eab-9898-c05c0911d6be

📥 Commits

Reviewing files that changed from the base of the PR and between a7a0152 and c806f00.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

core/utils.py
gdrive/drive_tools.py
pyproject.toml
tests/core/test_pdf_image_utils.py
tests/gdrive/test_drive_file_content.py
tests/helpers.py

…n#659, pending upstream)

taylorwilsdon · 2026-04-15T00:34:13Z

@coderabbitai review

coderabbitai · 2026-04-15T00:34:19Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

taylorwilsdon · 2026-04-15T00:34:24Z

Thanks, will get this merged today!

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

core/utils.py (1)

469-475: Add a defensive MIME-type guard in encode_image_content.

Consider rejecting non-image MIME types here to prevent accidental mislabeled payloads if future callers bypass pre-checks.

Proposed diff

 def encode_image_content(file_bytes: bytes, mime_type: str) -> str:
     """
     Base64-encode image bytes with a mime type metadata prefix.
     """
+    if mime_type not in IMAGE_MIME_TYPES:
+        raise ValueError(f"Unsupported image MIME type: {mime_type}")
     encoded = base64.b64encode(file_bytes).decode("ascii")
     return f"[base64_image:{mime_type}]{encoded}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@core/utils.py` around lines 469 - 475, The encode_image_content function
currently base64-encodes any bytes with whatever mime_type is passed; add a
defensive guard that rejects non-image MIME types (e.g., ensure
mime_type.startswith("image/")) and raise a clear ValueError if the check fails
so mislabeled payloads are prevented. Locate encode_image_content and before
base64 encoding validate mime_type, returning/raising on invalid values (include
a helpful message like "expected image/* mime type, got {mime_type}"). Keep the
existing return format when the mime type is valid.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/utils.py`:
- Around line 447-467: The extract_pdf_text function is being called
synchronously inside an async context (e.g., from the async
get_drive_file_content flow), which can block the event loop; update callers to
offload extract_pdf_text(file_bytes) to a thread by using asyncio.to_thread(...)
or loop.run_in_executor(None, extract_pdf_text, file_bytes). Locate calls to
extract_pdf_text (reference function name) and replace the direct call with
await asyncio.to_thread(extract_pdf_text, file_bytes) (or equivalent
run_in_executor usage), preserving error handling and return semantics.

---

Nitpick comments:
In `@core/utils.py`:
- Around line 469-475: The encode_image_content function currently
base64-encodes any bytes with whatever mime_type is passed; add a defensive
guard that rejects non-image MIME types (e.g., ensure
mime_type.startswith("image/")) and raise a clear ValueError if the check fails
so mislabeled payloads are prevented. Locate encode_image_content and before
base64 encoding validate mime_type, returning/raising on invalid values (include
a helpful message like "expected image/* mime type, got {mime_type}"). Keep the
existing return format when the mime type is valid.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c37c676f-b11f-4ffe-a13c-19742efe010a

📥 Commits

Reviewing files that changed from the base of the PR and between c806f00 and 052de9b.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

README.md
core/utils.py
gdrive/drive_tools.py
pyproject.toml
tests/core/test_pdf_image_utils.py
tests/gdrive/test_drive_file_content.py
tests/helpers.py

✅ Files skipped from review due to trivial changes (3)

README.md
tests/helpers.py
tests/core/test_pdf_image_utils.py

🚧 Files skipped from review as they are similar to previous changes (3)

pyproject.toml
tests/gdrive/test_drive_file_content.py
gdrive/drive_tools.py

taylorwilsdon · 2026-04-21T01:22:09Z

@chagel pypdf has proven to be a security minefield so far - any alternatives we could use for the same purpose?

chagel · 2026-04-21T02:52:56Z

@taylorwilsdon Thanks for the heads up. I looked through the advisory list too. ~17 CVEs since Feb, all medium.

Two paths:

Stay on pypdf, harden the call site. Bump to >=6.10.2 and run extract_pdf_text in a subprocess.
Switch to pypdfium2. Bindings to Chrome's PDFium, Apache-2.0 / BSD-3. It's read-only, so we'd keep pypdf as a test-only dep for fixtures.

I lean (2) + a lighter version of (1). Happy to open a follow-up PR either way. Preference?

chagel and others added 2 commits April 4, 2026 15:44

chagel mentioned this pull request Apr 4, 2026

feat: add PDF text extraction and image passthrough to get_drive_file_content #557

Closed

5 tasks

coderabbitai Bot reviewed Apr 4, 2026

View reviewed changes

chagel added a commit to pipihosting/google_workspace_mcp that referenced this pull request Apr 4, 2026

merge: add PDF text extraction and image passthrough (PR taylorwilsdo…

7501915

…n#659, pending upstream)

taylorwilsdon self-assigned this Apr 6, 2026

taylorwilsdon added the enhancement New feature or request label Apr 6, 2026

taylorwilsdon added 2 commits April 14, 2026 20:15

merge

b790322

merge

052de9b

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread core/utils.py

refac

b5946c9

taylorwilsdon merged commit 393bc4b into taylorwilsdon:main Apr 15, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add PDF text extraction and image passthrough to get_drive_file_content#659

feat: add PDF text extraction and image passthrough to get_drive_file_content#659
taylorwilsdon merged 5 commits into
taylorwilsdon:mainfrom
chagel:mgc/pip-1242-add-pdf-image-support

chagel commented Apr 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

taylorwilsdon commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Uh oh!

taylorwilsdon commented Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

taylorwilsdon commented Apr 21, 2026

Uh oh!

chagel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chagel commented Apr 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

taylorwilsdon commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026

Uh oh!

taylorwilsdon commented Apr 15, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

taylorwilsdon commented Apr 21, 2026

Uh oh!

chagel commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chagel commented Apr 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading