fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908
fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908Ryan Waskewich (rwask) wants to merge 8 commits intomainfrom
Conversation
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771425511-bump-unstructured-to-latest#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771425511-bump-unstructured-to-latestPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
…ck behavior Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
📝 WalkthroughWalkthroughRefactors unstructured file parsing to use per-filetype availability checks and lazy PDF import failure handling, changes filetype detection order (MIME → extension → content), updates multipart upload MIME usage to use Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Would you like a simple Mermaid sequence diagram showing the new detection and per-filetype availability flow? wdyt? 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)
420-427:detect_filetype(file_path=...)with a remote URI — handled by try/except.Since
remote_file.uriis a remote path (e.g.,s3://...),detect_filetypewill likely fail trying to access it locally. The broadexcept Exception: passcatches this gracefully and falls through to extension-based detection. This works, but the silent swallowing of all exceptions could hide unexpected failures. Would it be worth narrowing toexcept (FileNotFoundError, OSError)to surface truly unexpected errors, wdyt?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around lines 420 - 427, The try/except around detect_filetype(file_path=remote_file.uri) currently swallows all exceptions which can hide unexpected failures; replace the broad except Exception with a narrower except (FileNotFoundError, OSError) to only ignore missing/local-path errors when detect_filetype is called with a remote URI, and let other exceptions propagate (or re-raise/log them) so unexpected errors in detect_filetype are visible; locate the block using detect_filetype and remote_file.uri in unstructured_parser.py and update the exception handling accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".")[-1] is brittle for URIs with dots in directory names
(e.g., "s3://bucket/folder.name/file") and can return incorrect values; update
the logic that computes extension (the lines assigning extension and calling
FileType.from_extension) to parse only the path portion of the URI and then use
os.path.splitext or pathlib.PurePosixPath to get the suffix, e.g., obtain the
path via urllib.parse.urlparse(remote_file.uri).path (or strip any
query/fragment), call os.path.splitext or PurePosixPath(path).suffix to get a
single leading dot extension (lowercased), then pass that to
FileType.from_extension and keep the existing return behavior.
---
Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 420-427: The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.
…t changes Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
…ompatibility Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py (2)
465-521:⚠️ Potential issue | 🟡 Minor
corrupted_file_scenariono longer exercises corrupted-file handling — could it be reframed or split?With the new lazy-import guard, PDF parsing now short-circuits to the
unstructured_inferencemissing error before the file bytes are ever read. This meanscorrupted_file_scenarioandsimple_unstructured_scenarioboth traverse the exact same code path for PDFs. The"___ corrupted file ___"bytes are completely irrelevant to the outcome, and this scenario provides zero additional coverage over the PDF case insimple_unstructured_scenario.Two options to consider — wdyt about either of these?
- Rename / reframe the scenario to something like
pdf_without_inference_scenarioto accurately describe what it's actually testing now.- Add a companion scenario (guarded by a check that
unstructured_inferenceis available) that validates the truly-corrupted-file error path — otherwise that branch is untested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around lines 465 - 521, The test `corrupted_file_scenario` now only hits the unstructured_inference-missing path (same as `simple_unstructured_scenario`) because PDF parsing short-circuits before reading bytes; either rename the scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating the TestScenarioBuilder instance name and description, or add a second scenario that actually exercises the corrupted-file path by creating a guarded test that only runs when `unstructured_inference` is importable (use the same FileBasedSourceBuilder payload with corrupted bytes and check the parse-error message for a real PDF parsing failure), and keep `corrupted_file_scenario` or replace it accordingly so both code paths are covered.
13-14:⚠️ Potential issue | 🟡 MinorUpdate NLTK resource names to match NLTK 3.9.1 compatibility.
The test file downloads
"punkt"and"averaged_perceptron_tagger"(lines 13-14), but your production code inairbyte_cdk/sources/file_based/file_types/unstructured_parser.pyalready uses the NLTK 3.9+ resource names:"punkt_tab"and"averaged_perceptron_tagger_eng". With NLTK 3.9.1 pinned inpoetry.lock, consider updating the test file to match:nltk.download("punkt") nltk.download("punkt_tab") nltk.download("averaged_perceptron_tagger_eng")Or, sync the test setup with your production initialization pattern for consistency. The old resource names may download successfully but populate the wrong data directories, potentially causing lookup errors at test runtime. Wdyt?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around lines 13 - 14, Update the NLTK resources downloaded in the test setup to match NLTK 3.9.1 names used in production: replace or extend the existing nltk.download calls so that the tests download "punkt_tab" and "averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility). Locate the nltk.download calls in the test initialization (the lines currently calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger")) and change them to download the new resource names to ensure the test data directories match the production parser (unstructured_parser.py) expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py`:
- Around line 465-521: The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.
- Around line 13-14: Update the NLTK resources downloaded in the test setup to
match NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.
|
Security Note: CVE-2025-64712 (CVSS 9.8) |
| if ext_type is not None: | ||
| return ext_type |
There was a problem hiding this comment.
🟡 Missing FileType.UNK check for FileType.from_extension result, inconsistent with from_mime_type check
In _get_filetype, the from_mime_type call at line 410 correctly filters out FileType.UNK (if ft is not None and ft != FileType.UNK), but the from_extension call at line 433-434 only checks for None (if ext_type is not None). If FileType.from_extension returns FileType.UNK for an unrecognized extension, it would be returned directly, bypassing the content-based detection fallback at line 437. This would cause files with uncommon or missing extensions to fail with an unsupported file type error, even though content-based detection would have correctly identified them. The old code (if extension in EXT_TO_FILETYPE) could never return FileType.UNK since it only returned values actually mapped in the dictionary.
| if ext_type is not None: | |
| return ext_type | |
| if ext_type is not None and ext_type != FileType.UNK: | |
| return ext_type |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
❌ Cannot revive Devin session - the session is too old. Please start a new session instead. |
fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32
Summary
Bumps the
unstructureddocument parsing library from 0.10.27 to 0.18.32 in the CDK'sfile-basedextra. This is a large version jump (8 minor versions) that required migrating several removed/changed APIs inunstructured_parser.py:EXT_TO_FILETYPE,FILETYPE_TO_MIMETYPE,STR_TO_FILETYPE) → replaced withFileType.from_extension(),filetype.mime_type,FileType.from_mime_type()detect_filetypeparameter renamed:filename=→file_path=partition_pdfnow requiresunstructured_inference: wrapped import in try/except so DOCX/PPTX parsing still works without it_get_filetypedetection order changed: extension-based detection now runs before content sniffing (was the opposite)Updates since last revision
FileType.from_mime_type()fallthrough:from_mime_type()returnsNonefor unknown types (notValueErroras initially assumed). Added null check andFileType.UNKguard so files with ambiguous MIME types (e.g.,application/octet-stream) correctly fall through to extension/content-based detection instead of returningNoneimmediately.unstructured.partition.pdfcan no longer be imported withoutunstructured_inference, so test@patchdecorators now target the global variables inunstructured_parserinstead of the source modules. Added_import_unstructuredmock to prevent the real import from overwriting test mocks.pi-heifdependency: Per CodeRabbit feedback, removed thepi-heifoptional dependency as it's not directly imported by the CDK.pdfminer.sixpin: Changed from exact20221105to>=20231228for compatibility with unstructured 0.18.32. Note: unstructured 0.18.32's PDF module imports frompdfminer.psexceptionswhich was added in pdfminer.six20250327. If PDF parsing is needed, ensurepdfminer.six>=20250327is installed (this happens automatically whenunstructured[pdf]is installed).Production Impact — Backward Compatibility Scope
Queried the production database to assess the blast radius. Of the original ~610 source actors flagged by a broad text search for
document_file_type_handler, only 115 connections across 69 workspaces actually have streams configured with"filetype": "unstructured".Connections by Connector (total):
Sync Recency:
Only 12 connections are actively syncing today with unstructured parsing. The other 103 connections either:
Breaking Changes for Active Connections
For the ~12 active connections, the following will break:
unstructured_inference_ab_source_file_parse_errorinstead ofcontent"# Content"→"Content"(markdown heading removed)unstructured[pdf]extralibGL.so.1andlibglib2.0-0required for PDF inferenceapt-get install libgl1-mesa-glx libglib2.0-0Upgrade Path for Affected Customers
For PDF parsing (local mode):
unstructured[pdf]instead of justunstructured[docx,pptx]apt-get install -y libgl1-mesa-glx libglib2.0-0pdfminer.six>=20250327is installedpip install torch --index-url https://download.pytorch.org/whl/cpubefore installing unstructured (reduces ~10GB)For PDF parsing (API mode):
unstructured_inferenceFor DOCX/PPTX only:
unstructured_inferenceReview & Testing Checklist for Human
unstructured_inference: This is a breaking change. PDFs now emit_ab_source_file_parse_errorinstead ofcontentunlessunstructured_inferenceis installed. Verify this is acceptable for downstream connectors (source-s3, source-gcs, source-sharepoint-enterprise, etc.). The scenario tests have been updated to expect parse errors for PDFs.pdfminer.sixversion compatibility: The pin is>=20231228butpdfminer.psexceptions(required by unstructured 0.18.32's PDF module) was added in20250327. If someone installsunstructured[pdf]with a pdfminer.six version between these, PDF parsing will fail. Consider tightening the pin to>=20250327._get_filetypedetection order change: extension-based detection (FileType.from_extension) now runs before content sniffing (detect_filetype(file=...)). Confirm this doesn't change behavior for ambiguous files."# Content"→"Content"(markdown heading removed). Confirm this is expected behavior from the unstructured upgrade.Notes
devin/1771342600-bump-unstructured-0.18.18with similar changes targeting 0.18.18. This PR targets the latest (0.18.32) instead and includes additional fixes (correctfrom_mime_typehandling,pdfminer.sixpin update).partition_pdfimport is now gracefully handled — ifunstructured_inferenceisn't installed, PDF parsing will be unavailable but DOCX/PPTX will still work. This is a behavioral change from the old code which required all three partition functions to be available.aiofiles,unstructured-client,webencodings, etc.)test_unstructured_parser.py) pass locally with the new version.Link to Devin run: https://app.devin.ai/sessions/c5bdff87617345b0bdbe574512f84953
Requested by: Ryan Waskewich (@rwask)
Summary by CodeRabbit