Skip to content

feat: add PDF text extraction and image passthrough to get_drive_file_content#659

Merged
taylorwilsdon merged 5 commits into
taylorwilsdon:mainfrom
chagel:mgc/pip-1242-add-pdf-image-support
Apr 15, 2026
Merged

feat: add PDF text extraction and image passthrough to get_drive_file_content#659
taylorwilsdon merged 5 commits into
taylorwilsdon:mainfrom
chagel:mgc/pip-1242-add-pdf-image-support

Conversation

@chagel
Copy link
Copy Markdown
Contributor

@chagel chagel commented Apr 4, 2026

Summary

  • Add pypdf dependency and extract_pdf_text() to extract readable text from PDF files instead of returning a binary placeholder
  • Add encode_image_content() to base64-encode images (png, jpeg, gif, webp, bmp, tiff, svg) with mime type metadata for multimodal consumption
  • Update get_drive_file_content to handle PDF and image mime types before the generic UTF-8 fallback; scanned/image-only PDFs fall back to a message suggesting get_drive_file_download_url
  • 9 new tests covering unit and integration scenarios
  • Rebased on current main (resolves uv.lock conflicts from feat: add PDF text extraction and image passthrough to get_drive_file_content #557)

Closes #365
Closes #263
Supersedes #557 (re-submitted from personal fork to enable maintainer edits)

Test plan

  • extract_pdf_text — valid PDF, corrupted input, empty/blank PDF
  • encode_image_content — PNG and JPEG encoding
  • get_drive_file_content integration — PDF with text, empty PDF fallback, image passthrough
  • Full test suite passes (584 tests)
  • Ruff lint and format clean

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Extract text from PDF files retrieved from Google Drive; returns a clear fallback when no text is found
    • Encode and return image file content with MIME metadata (base64-embedded)
  • Tests

    • Added tests covering PDF text extraction, image encoding/validation, and recognized image MIME types for Drive file handling
  • Documentation

    • Updated Drive tool docs to list Office, PDF, and image support

chagel and others added 2 commits April 4, 2026 15:44
…_content

PDF files are now parsed with pypdf to extract readable text instead of
returning a binary placeholder. Image files (png, jpeg, gif, webp, etc.)
are base64-encoded with mime type metadata for multimodal consumption.
Scanned/image-only PDFs gracefully fall back to a message suggesting
get_drive_file_download_url.

Closes PIP-1242

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs)
- Remove unused init_side_effect and _mock_download dead code
- Extract duplicated _make_minimal_pdf to shared tests/helpers.py
- Move io import to module level in test_pdf_image_utils.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 25ddf293-e2e4-43e8-9e42-c2d2d9e12071

📥 Commits

Reviewing files that changed from the base of the PR and between 052de9b and b5946c9.

📒 Files selected for processing (5)
  • core/utils.py
  • gdrive/drive_tools.py
  • tests/core/test_pdf_image_utils.py
  • tests/gcontacts/golden/contacts_tool_schemas.json
  • tests/gdocs/golden/docs_tool_schemas.json
✅ Files skipped from review due to trivial changes (3)
  • tests/gcontacts/golden/contacts_tool_schemas.json
  • tests/gdocs/golden/docs_tool_schemas.json
  • tests/core/test_pdf_image_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • gdrive/drive_tools.py
  • core/utils.py

📝 Walkthrough

Walkthrough

Adds PDF text extraction and image encoding: new utilities (extract_pdf_text, encode_image_content, IMAGE_MIME_TYPES) in core/utils.py, integrated into gdrive.get_drive_file_content to return extracted PDF text or base64-encoded images; adds pypdf dependency, tests, and a test PDF helper.

Changes

Cohort / File(s) Summary
PDF & Image Utilities
core/utils.py
Added IMAGE_MIME_TYPES set, extract_pdf_text(file_bytes) using pypdf.PdfReader to concatenate per-page text (returns None on empty/error), and encode_image_content(file_bytes, mime_type) returning [base64_image:{mime_type}]{base64}.
Drive Tools Integration
gdrive/drive_tools.py
Updated get_drive_file_content() to offload Office XML extraction to asyncio.to_thread(...), attempt PDF text extraction via extract_pdf_text with fallback message pointing to get_drive_file_download_url when empty, and return base64-encoded images for MIME types in IMAGE_MIME_TYPES.
Dependencies & Test Config
pyproject.toml
Added runtime dependency pypdf>=6.8.0 and configured pytest pythonpath = ["."].
Core Utility Tests
tests/core/test_pdf_image_utils.py
New unit tests for extract_pdf_text() (valid PDF -> text, corrupted/blank -> None), encode_image_content() (PNG/JPEG round-trip, non-image MIME raises ValueError), and IMAGE_MIME_TYPES membership checks.
Drive Integration Tests
tests/gdrive/test_drive_file_content.py
Async tests mocking resolve_drive_item and MediaIoBaseDownload to validate PDF extraction (including blank PDF fallback) and image base64 encoding returned by get_drive_file_content().
Test Helpers
tests/helpers.py
Added _make_minimal_pdf(text="Hello World") -> bytes building a one-page in-memory PDF with pypdf.PdfWriter for tests.
Docs
README.md
Updated Drive capability description to include Office, PDF, and images.
Schema Docs
tests/gcontacts/golden/..., tests/gdocs/golden/...
Added/expanded descriptive description fields in JSON schema test fixtures (metadata-only changes).

Sequence Diagram

sequenceDiagram
    participant Client
    participant DriveTool as get_drive_file_content()
    participant Downloader as MediaIoBaseDownload
    participant PDFUtil as extract_pdf_text()
    participant ImgUtil as encode_image_content()

    Client->>DriveTool: request file content (file_id)
    DriveTool->>Downloader: download file bytes
    Downloader-->>DriveTool: file bytes

    alt mime == "application/pdf"
        DriveTool->>PDFUtil: extract_pdf_text(file bytes)
        PDFUtil-->>DriveTool: text or None
        alt text present
            DriveTool-->>Client: return extracted text
        else no text
            DriveTool-->>Client: return fallback message (use download URL)
        end
    else mime in IMAGE_MIME_TYPES
        DriveTool->>ImgUtil: encode_image_content(file bytes, mime)
        ImgUtil-->>DriveTool: "[base64_image:mime]..." string
        DriveTool-->>Client: return base64 image string
    else
        DriveTool-->>Client: existing Office/XML or UTF-8/binary behavior
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇
I nibble bytes in moonlit night,
PDFs whisper words to light,
Images curl in base64 dreams,
I hop and stitch them into streams,
Happy rabbit, files take flight!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR includes out-of-scope documentation updates to gcontacts and gdocs golden schema files that add descriptions to JSON schema properties unrelated to PDF/image handling objectives. Remove changes to tests/gcontacts/golden/contacts_tool_schemas.json and tests/gdocs/golden/docs_tool_schemas.json, or clarify why schema documentation updates are necessary for this PDF/image feature.
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature additions: PDF text extraction and image base64 encoding in get_drive_file_content.
Description check ✅ Passed The description covers all required template sections: summary of changes, testing confirmation, and checklist items; it provides sufficient context and completion details.
Linked Issues check ✅ Passed The PR fully addresses requirements from #365 (PDF extraction via extract_pdf_text) and #263 (image base64 encoding via encode_image_content) with corresponding tests and integration into get_drive_file_content.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/core/test_pdf_image_utils.py (1)

45-56: Good image encoding tests.

The PNG test includes base64 roundtrip verification. Consider adding the same roundtrip assertion to test_encode_image_content_jpeg for consistency:

encoded_part = result[len("[base64_image:image/jpeg]"):]
assert base64.b64decode(encoded_part) == raw
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/core/test_pdf_image_utils.py` around lines 45 - 56, The JPEG test
test_encode_image_content_jpeg should perform the same base64 roundtrip
verification as the PNG test: after calling encode_image_content(raw,
"image/jpeg"), extract the encoded_part by slicing off the
"[base64_image:image/jpeg]" prefix and assert that
base64.b64decode(encoded_part) equals the original raw bytes; update
test_encode_image_content_jpeg to include this assertion to ensure the
encoding/decoding roundtrip for encode_image_content is validated.
tests/gdrive/test_drive_file_content.py (1)

9-12: Consider using pytest's pythonpath configuration.

The sys.path.insert manipulation works but is fragile. Consider adding pythonpath = ["."] to [tool.pytest.ini_options] in pyproject.toml instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdrive/test_drive_file_content.py` around lines 9 - 12, The test file
tests/gdrive/test_drive_file_content.py currently mutates sys.path via
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__),
"../.."))); remove that fragile path-manipulation and instead configure pytest
to set the import root by adding pythonpath = ["."] under
[tool.pytest.ini_options] in pyproject.toml so tests can import project modules
without runtime sys.path edits; update the test file to delete the
sys.path/os.path import and insertion lines and commit the pyproject.toml
change.
tests/helpers.py (1)

34-34: Private API usage may break on pypdf updates.

writer._add_object is an internal method (underscore prefix). While acceptable for test helpers, be aware this could break if pypdf's internals change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/helpers.py` at line 34, The test helper uses the private API
writer._add_object (page[NameObject("/Contents")] = writer._add_object(stream));
replace this direct private call with a safe public fallback: prefer calling
writer.add_object(stream) if present, and only fall back to
writer._add_object(stream) when add_object is unavailable (use getattr to
detect), and emit a warning in that fallback path so tests make the dependency
explicit; update the assignment to use the chosen function instead of calling
_add_object directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 45-56: The JPEG test test_encode_image_content_jpeg should perform
the same base64 roundtrip verification as the PNG test: after calling
encode_image_content(raw, "image/jpeg"), extract the encoded_part by slicing off
the "[base64_image:image/jpeg]" prefix and assert that
base64.b64decode(encoded_part) equals the original raw bytes; update
test_encode_image_content_jpeg to include this assertion to ensure the
encoding/decoding roundtrip for encode_image_content is validated.

In `@tests/gdrive/test_drive_file_content.py`:
- Around line 9-12: The test file tests/gdrive/test_drive_file_content.py
currently mutates sys.path via sys.path.insert(0,
os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))); remove that
fragile path-manipulation and instead configure pytest to set the import root by
adding pythonpath = ["."] under [tool.pytest.ini_options] in pyproject.toml so
tests can import project modules without runtime sys.path edits; update the test
file to delete the sys.path/os.path import and insertion lines and commit the
pyproject.toml change.

In `@tests/helpers.py`:
- Line 34: The test helper uses the private API writer._add_object
(page[NameObject("/Contents")] = writer._add_object(stream)); replace this
direct private call with a safe public fallback: prefer calling
writer.add_object(stream) if present, and only fall back to
writer._add_object(stream) when add_object is unavailable (use getattr to
detect), and emit a warning in that fallback path so tests make the dependency
explicit; update the assignment to use the chosen function instead of calling
_add_object directly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd312cdc-d395-4eab-9898-c05c0911d6be

📥 Commits

Reviewing files that changed from the base of the PR and between a7a0152 and c806f00.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • core/utils.py
  • gdrive/drive_tools.py
  • pyproject.toml
  • tests/core/test_pdf_image_utils.py
  • tests/gdrive/test_drive_file_content.py
  • tests/helpers.py

chagel added a commit to pipihosting/google_workspace_mcp that referenced this pull request Apr 4, 2026
@taylorwilsdon taylorwilsdon self-assigned this Apr 6, 2026
@taylorwilsdon taylorwilsdon added the enhancement New feature or request label Apr 6, 2026
@taylorwilsdon
Copy link
Copy Markdown
Owner

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@taylorwilsdon
Copy link
Copy Markdown
Owner

Thanks, will get this merged today!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
core/utils.py (1)

469-475: Add a defensive MIME-type guard in encode_image_content.

Consider rejecting non-image MIME types here to prevent accidental mislabeled payloads if future callers bypass pre-checks.

Proposed diff
 def encode_image_content(file_bytes: bytes, mime_type: str) -> str:
     """
     Base64-encode image bytes with a mime type metadata prefix.
     """
+    if mime_type not in IMAGE_MIME_TYPES:
+        raise ValueError(f"Unsupported image MIME type: {mime_type}")
     encoded = base64.b64encode(file_bytes).decode("ascii")
     return f"[base64_image:{mime_type}]{encoded}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/utils.py` around lines 469 - 475, The encode_image_content function
currently base64-encodes any bytes with whatever mime_type is passed; add a
defensive guard that rejects non-image MIME types (e.g., ensure
mime_type.startswith("image/")) and raise a clear ValueError if the check fails
so mislabeled payloads are prevented. Locate encode_image_content and before
base64 encoding validate mime_type, returning/raising on invalid values (include
a helpful message like "expected image/* mime type, got {mime_type}"). Keep the
existing return format when the mime type is valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/utils.py`:
- Around line 447-467: The extract_pdf_text function is being called
synchronously inside an async context (e.g., from the async
get_drive_file_content flow), which can block the event loop; update callers to
offload extract_pdf_text(file_bytes) to a thread by using asyncio.to_thread(...)
or loop.run_in_executor(None, extract_pdf_text, file_bytes). Locate calls to
extract_pdf_text (reference function name) and replace the direct call with
await asyncio.to_thread(extract_pdf_text, file_bytes) (or equivalent
run_in_executor usage), preserving error handling and return semantics.

---

Nitpick comments:
In `@core/utils.py`:
- Around line 469-475: The encode_image_content function currently
base64-encodes any bytes with whatever mime_type is passed; add a defensive
guard that rejects non-image MIME types (e.g., ensure
mime_type.startswith("image/")) and raise a clear ValueError if the check fails
so mislabeled payloads are prevented. Locate encode_image_content and before
base64 encoding validate mime_type, returning/raising on invalid values (include
a helpful message like "expected image/* mime type, got {mime_type}"). Keep the
existing return format when the mime type is valid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c37c676f-b11f-4ffe-a13c-19742efe010a

📥 Commits

Reviewing files that changed from the base of the PR and between c806f00 and 052de9b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • README.md
  • core/utils.py
  • gdrive/drive_tools.py
  • pyproject.toml
  • tests/core/test_pdf_image_utils.py
  • tests/gdrive/test_drive_file_content.py
  • tests/helpers.py
✅ Files skipped from review due to trivial changes (3)
  • README.md
  • tests/helpers.py
  • tests/core/test_pdf_image_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • pyproject.toml
  • tests/gdrive/test_drive_file_content.py
  • gdrive/drive_tools.py

Comment thread core/utils.py
@taylorwilsdon taylorwilsdon merged commit 393bc4b into taylorwilsdon:main Apr 15, 2026
7 checks passed
@taylorwilsdon
Copy link
Copy Markdown
Owner

@chagel pypdf has proven to be a security minefield so far - any alternatives we could use for the same purpose?

@chagel
Copy link
Copy Markdown
Contributor Author

chagel commented Apr 21, 2026

@taylorwilsdon Thanks for the heads up. I looked through the advisory list too. ~17 CVEs since Feb, all medium.

Two paths:

  1. Stay on pypdf, harden the call site. Bump to >=6.10.2 and run extract_pdf_text in a subprocess.
  2. Switch to pypdfium2. Bindings to Chrome's PDFium, Apache-2.0 / BSD-3. It's read-only, so we'd keep pypdf as a test-only dep for fixtures.

I lean (2) + a lighter version of (1). Happy to open a follow-up PR either way. Preference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

2 participants