feat: add PDF text extraction and image passthrough to get_drive_file_content#659
Conversation
…_content PDF files are now parsed with pypdf to extract readable text instead of returning a binary placeholder. Image files (png, jpeg, gif, webp, etc.) are base64-encoded with mime type metadata for multimodal consumption. Scanned/image-only PDFs gracefully fall back to a message suggesting get_drive_file_download_url. Closes PIP-1242 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Bump pypdf>=4.0.0 to >=6.8.0 (excludes DoS CVEs) - Remove unused init_side_effect and _mock_download dead code - Extract duplicated _make_minimal_pdf to shared tests/helpers.py - Move io import to module level in test_pdf_image_utils.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds PDF text extraction and image encoding: new utilities ( Changes
Sequence DiagramsequenceDiagram
participant Client
participant DriveTool as get_drive_file_content()
participant Downloader as MediaIoBaseDownload
participant PDFUtil as extract_pdf_text()
participant ImgUtil as encode_image_content()
Client->>DriveTool: request file content (file_id)
DriveTool->>Downloader: download file bytes
Downloader-->>DriveTool: file bytes
alt mime == "application/pdf"
DriveTool->>PDFUtil: extract_pdf_text(file bytes)
PDFUtil-->>DriveTool: text or None
alt text present
DriveTool-->>Client: return extracted text
else no text
DriveTool-->>Client: return fallback message (use download URL)
end
else mime in IMAGE_MIME_TYPES
DriveTool->>ImgUtil: encode_image_content(file bytes, mime)
ImgUtil-->>DriveTool: "[base64_image:mime]..." string
DriveTool-->>Client: return base64 image string
else
DriveTool-->>Client: existing Office/XML or UTF-8/binary behavior
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/core/test_pdf_image_utils.py (1)
45-56: Good image encoding tests.The PNG test includes base64 roundtrip verification. Consider adding the same roundtrip assertion to
test_encode_image_content_jpegfor consistency:encoded_part = result[len("[base64_image:image/jpeg]"):] assert base64.b64decode(encoded_part) == raw🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/core/test_pdf_image_utils.py` around lines 45 - 56, The JPEG test test_encode_image_content_jpeg should perform the same base64 roundtrip verification as the PNG test: after calling encode_image_content(raw, "image/jpeg"), extract the encoded_part by slicing off the "[base64_image:image/jpeg]" prefix and assert that base64.b64decode(encoded_part) equals the original raw bytes; update test_encode_image_content_jpeg to include this assertion to ensure the encoding/decoding roundtrip for encode_image_content is validated.tests/gdrive/test_drive_file_content.py (1)
9-12: Consider using pytest's pythonpath configuration.The
sys.path.insertmanipulation works but is fragile. Consider addingpythonpath = ["."]to[tool.pytest.ini_options]inpyproject.tomlinstead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdrive/test_drive_file_content.py` around lines 9 - 12, The test file tests/gdrive/test_drive_file_content.py currently mutates sys.path via sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))); remove that fragile path-manipulation and instead configure pytest to set the import root by adding pythonpath = ["."] under [tool.pytest.ini_options] in pyproject.toml so tests can import project modules without runtime sys.path edits; update the test file to delete the sys.path/os.path import and insertion lines and commit the pyproject.toml change.tests/helpers.py (1)
34-34: Private API usage may break on pypdf updates.
writer._add_objectis an internal method (underscore prefix). While acceptable for test helpers, be aware this could break if pypdf's internals change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/helpers.py` at line 34, The test helper uses the private API writer._add_object (page[NameObject("/Contents")] = writer._add_object(stream)); replace this direct private call with a safe public fallback: prefer calling writer.add_object(stream) if present, and only fall back to writer._add_object(stream) when add_object is unavailable (use getattr to detect), and emit a warning in that fallback path so tests make the dependency explicit; update the assignment to use the chosen function instead of calling _add_object directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/core/test_pdf_image_utils.py`:
- Around line 45-56: The JPEG test test_encode_image_content_jpeg should perform
the same base64 roundtrip verification as the PNG test: after calling
encode_image_content(raw, "image/jpeg"), extract the encoded_part by slicing off
the "[base64_image:image/jpeg]" prefix and assert that
base64.b64decode(encoded_part) equals the original raw bytes; update
test_encode_image_content_jpeg to include this assertion to ensure the
encoding/decoding roundtrip for encode_image_content is validated.
In `@tests/gdrive/test_drive_file_content.py`:
- Around line 9-12: The test file tests/gdrive/test_drive_file_content.py
currently mutates sys.path via sys.path.insert(0,
os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))); remove that
fragile path-manipulation and instead configure pytest to set the import root by
adding pythonpath = ["."] under [tool.pytest.ini_options] in pyproject.toml so
tests can import project modules without runtime sys.path edits; update the test
file to delete the sys.path/os.path import and insertion lines and commit the
pyproject.toml change.
In `@tests/helpers.py`:
- Line 34: The test helper uses the private API writer._add_object
(page[NameObject("/Contents")] = writer._add_object(stream)); replace this
direct private call with a safe public fallback: prefer calling
writer.add_object(stream) if present, and only fall back to
writer._add_object(stream) when add_object is unavailable (use getattr to
detect), and emit a warning in that fallback path so tests make the dependency
explicit; update the assignment to use the chosen function instead of calling
_add_object directly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fd312cdc-d395-4eab-9898-c05c0911d6be
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
core/utils.pygdrive/drive_tools.pypyproject.tomltests/core/test_pdf_image_utils.pytests/gdrive/test_drive_file_content.pytests/helpers.py
…n#659, pending upstream)
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Thanks, will get this merged today! |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
core/utils.py (1)
469-475: Add a defensive MIME-type guard inencode_image_content.Consider rejecting non-image MIME types here to prevent accidental mislabeled payloads if future callers bypass pre-checks.
Proposed diff
def encode_image_content(file_bytes: bytes, mime_type: str) -> str: """ Base64-encode image bytes with a mime type metadata prefix. """ + if mime_type not in IMAGE_MIME_TYPES: + raise ValueError(f"Unsupported image MIME type: {mime_type}") encoded = base64.b64encode(file_bytes).decode("ascii") return f"[base64_image:{mime_type}]{encoded}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/utils.py` around lines 469 - 475, The encode_image_content function currently base64-encodes any bytes with whatever mime_type is passed; add a defensive guard that rejects non-image MIME types (e.g., ensure mime_type.startswith("image/")) and raise a clear ValueError if the check fails so mislabeled payloads are prevented. Locate encode_image_content and before base64 encoding validate mime_type, returning/raising on invalid values (include a helpful message like "expected image/* mime type, got {mime_type}"). Keep the existing return format when the mime type is valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/utils.py`:
- Around line 447-467: The extract_pdf_text function is being called
synchronously inside an async context (e.g., from the async
get_drive_file_content flow), which can block the event loop; update callers to
offload extract_pdf_text(file_bytes) to a thread by using asyncio.to_thread(...)
or loop.run_in_executor(None, extract_pdf_text, file_bytes). Locate calls to
extract_pdf_text (reference function name) and replace the direct call with
await asyncio.to_thread(extract_pdf_text, file_bytes) (or equivalent
run_in_executor usage), preserving error handling and return semantics.
---
Nitpick comments:
In `@core/utils.py`:
- Around line 469-475: The encode_image_content function currently
base64-encodes any bytes with whatever mime_type is passed; add a defensive
guard that rejects non-image MIME types (e.g., ensure
mime_type.startswith("image/")) and raise a clear ValueError if the check fails
so mislabeled payloads are prevented. Locate encode_image_content and before
base64 encoding validate mime_type, returning/raising on invalid values (include
a helpful message like "expected image/* mime type, got {mime_type}"). Keep the
existing return format when the mime type is valid.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c37c676f-b11f-4ffe-a13c-19742efe010a
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
README.mdcore/utils.pygdrive/drive_tools.pypyproject.tomltests/core/test_pdf_image_utils.pytests/gdrive/test_drive_file_content.pytests/helpers.py
✅ Files skipped from review due to trivial changes (3)
- README.md
- tests/helpers.py
- tests/core/test_pdf_image_utils.py
🚧 Files skipped from review as they are similar to previous changes (3)
- pyproject.toml
- tests/gdrive/test_drive_file_content.py
- gdrive/drive_tools.py
|
@chagel pypdf has proven to be a security minefield so far - any alternatives we could use for the same purpose? |
|
@taylorwilsdon Thanks for the heads up. I looked through the advisory list too. ~17 CVEs since Feb, all medium. Two paths:
I lean (2) + a lighter version of (1). Happy to open a follow-up PR either way. Preference? |
Summary
pypdfdependency andextract_pdf_text()to extract readable text from PDF files instead of returning a binary placeholderencode_image_content()to base64-encode images (png, jpeg, gif, webp, bmp, tiff, svg) with mime type metadata for multimodal consumptionget_drive_file_contentto handle PDF and image mime types before the generic UTF-8 fallback; scanned/image-only PDFs fall back to a message suggestingget_drive_file_download_urlCloses #365
Closes #263
Supersedes #557 (re-submitted from personal fork to enable maintainer edits)
Test plan
extract_pdf_text— valid PDF, corrupted input, empty/blank PDFencode_image_content— PNG and JPEG encodingget_drive_file_contentintegration — PDF with text, empty PDF fallback, image passthrough🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Documentation