Skip to content

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908

Open
Ryan Waskewich (rwask) wants to merge 8 commits intomainfrom
devin/1771425511-bump-unstructured-to-latest
Open

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32#908
Ryan Waskewich (rwask) wants to merge 8 commits intomainfrom
devin/1771425511-bump-unstructured-to-latest

Conversation

@rwask
Copy link
Copy Markdown
Contributor

@rwask Ryan Waskewich (rwask) commented Feb 18, 2026

fix(cdk): upgrade unstructured from 0.10.27 to 0.18.32

Summary

Bumps the unstructured document parsing library from 0.10.27 to 0.18.32 in the CDK's file-based extra. This is a large version jump (8 minor versions) that required migrating several removed/changed APIs in unstructured_parser.py:

  • Removed dict-based lookups (EXT_TO_FILETYPE, FILETYPE_TO_MIMETYPE, STR_TO_FILETYPE) → replaced with FileType.from_extension(), filetype.mime_type, FileType.from_mime_type()
  • detect_filetype parameter renamed: filename=file_path=
  • partition_pdf now requires unstructured_inference: wrapped import in try/except so DOCX/PPTX parsing still works without it
  • _get_filetype detection order changed: extension-based detection now runs before content sniffing (was the opposite)

Updates since last revision

  • Fixed FileType.from_mime_type() fallthrough: from_mime_type() returns None for unknown types (not ValueError as initially assumed). Added null check and FileType.UNK guard so files with ambiguous MIME types (e.g., application/octet-stream) correctly fall through to extension/content-based detection instead of returning None immediately.
  • Updated test mock targets: unstructured.partition.pdf can no longer be imported without unstructured_inference, so test @patch decorators now target the global variables in unstructured_parser instead of the source modules. Added _import_unstructured mock to prevent the real import from overwriting test mocks.
  • Removed pi-heif dependency: Per CodeRabbit feedback, removed the pi-heif optional dependency as it's not directly imported by the CDK.
  • Updated pdfminer.six pin: Changed from exact 20221105 to >=20231228 for compatibility with unstructured 0.18.32. Note: unstructured 0.18.32's PDF module imports from pdfminer.psexceptions which was added in pdfminer.six 20250327. If PDF parsing is needed, ensure pdfminer.six>=20250327 is installed (this happens automatically when unstructured[pdf] is installed).

Production Impact — Backward Compatibility Scope

Queried the production database to assess the blast radius. Of the original ~610 source actors flagged by a broad text search for document_file_type_handler, only 115 connections across 69 workspaces actually have streams configured with "filetype": "unstructured".

Connections by Connector (total):

  • Google Drive: 92 (80%)
  • S3: 14 (12%)
  • Azure Blob Storage: 4 (3.5%)
  • SharePoint Enterprise: 3 (2.6%)
  • GCS: 1 (0.9%)
  • SFTP Bulk: 1 (0.9%)

Sync Recency:

  • Active (0–1 days): 12 (10%)
  • Recent (2–7 days): 0 (0%)
  • Last month (8–30 days): 1 (1%)
  • Stale (31–90 days): 6 (5%)
  • Dormant (90+ days): 6 (5%)
  • Never synced successfully: 90 (78%)

⚠️ Real-world blast radius is extremely limited

Only 12 connections are actively syncing today with unstructured parsing. The other 103 connections either:

  • Never successfully synced (90 connections / 78%) — likely test/sandbox setups or abandoned configurations
  • Haven't synced in over a week (13 connections) — stale or dormant

Breaking Changes for Active Connections

For the ~12 active connections, the following will break:

Change Impact Who is affected
PDF parsing requires unstructured_inference PDFs emit _ab_source_file_parse_error instead of content Any connection parsing PDF files with local processing mode
DOCX output format changed "# Content""Content" (markdown heading removed) Downstream consumers expecting markdown headings in DOCX output
Connector image size +12GB Images balloon from ~1.4GB to ~13.7GB when PDF support is added All connectors that add unstructured[pdf] extra
System library dependencies libGL.so.1 and libglib2.0-0 required for PDF inference Connector Dockerfiles need apt-get install libgl1-mesa-glx libglib2.0-0

Upgrade Path for Affected Customers

  1. For PDF parsing (local mode):

    • Connector images must install unstructured[pdf] instead of just unstructured[docx,pptx]
    • Add system deps: apt-get install -y libgl1-mesa-glx libglib2.0-0
    • Ensure pdfminer.six>=20250327 is installed
    • To minimize image size: Use CPU-only PyTorch: pip install torch --index-url https://download.pytorch.org/whl/cpu before installing unstructured (reduces ~10GB)
  2. For PDF parsing (API mode):

    • No changes needed — API mode doesn't require unstructured_inference
    • Recommend customers use API mode if image size is a concern
  3. For DOCX/PPTX only:

    • No changes needed — these work without unstructured_inference

Review & Testing Checklist for Human

  • ⚠️ PDF parsing requires unstructured_inference: This is a breaking change. PDFs now emit _ab_source_file_parse_error instead of content unless unstructured_inference is installed. Verify this is acceptable for downstream connectors (source-s3, source-gcs, source-sharepoint-enterprise, etc.). The scenario tests have been updated to expect parse errors for PDFs.
  • Verify pdfminer.six version compatibility: The pin is >=20231228 but pdfminer.psexceptions (required by unstructured 0.18.32's PDF module) was added in 20250327. If someone installs unstructured[pdf] with a pdfminer.six version between these, PDF parsing will fail. Consider tightening the pin to >=20250327.
  • Test with downstream file-based connectors to verify no regressions in actual document parsing output. No integration testing has been performed — only unit tests pass.
  • Verify _get_filetype detection order change: extension-based detection (FileType.from_extension) now runs before content sniffing (detect_filetype(file=...)). Confirm this doesn't change behavior for ambiguous files.
  • Verify DOCX content format change: Scenario tests show "# Content""Content" (markdown heading removed). Confirm this is expected behavior from the unstructured upgrade.

Notes

  • There's an existing branch devin/1771342600-bump-unstructured-0.18.18 with similar changes targeting 0.18.18. This PR targets the latest (0.18.32) instead and includes additional fixes (correct from_mime_type handling, pdfminer.six pin update).
  • The partition_pdf import is now gracefully handled — if unstructured_inference isn't installed, PDF parsing will be unavailable but DOCX/PPTX will still work. This is a behavioral change from the old code which required all three partition functions to be available.
  • The poetry.lock diff is large due to new transitive dependencies (aiofiles, unstructured-client, webencodings, etc.)
  • Unit tests (27 in test_unstructured_parser.py) pass locally with the new version.

Link to Devin run: https://app.devin.ai/sessions/c5bdff87617345b0bdbe574512f84953
Requested by: Ryan Waskewich (@rwask)

Summary by CodeRabbit

  • Improvements
    • Upgraded document parsing libraries for broader file-type support and more robust MIME/extension-based detection.
    • Detection now prefers MIME type and falls back to extension before content-based checks.
    • Per-file-type availability checks surface clearer, user-friendly parse errors when optional parsers are missing.
  • Bug Fixes
    • Remote multipart uploads now send correct MIME types.
  • Tests
    • Updated tests to reflect new parse-error behavior and revised import/mocking approach.

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@rwask Ryan Waskewich (rwask) marked this pull request as ready for review February 18, 2026 14:47
@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771425511-bump-unstructured-to-latest#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771425511-bump-unstructured-to-latest

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

devin-ai-integration[bot]

This comment was marked as resolved.

…ck behavior

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Refactors unstructured file parsing to use per-filetype availability checks and lazy PDF import failure handling, changes filetype detection order (MIME → extension → content), updates multipart upload MIME usage to use FileType.mime_type, bumps unstructured and pdfminer.six versions, and adjusts tests and test expectations to mock and reflect the new lazy-import and per-filetype parse-error behavior.

Changes

Cohort / File(s) Summary
Unstructured Parser Refactoring
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py
Replaced legacy mappings with FileType/detect_filetype; prefer FileType.from_mime_type → extension → content in _get_filetype; removed global unstructured availability checks in favor of per-filetype guards; lazy-import partition_pdf with ImportError handled by disabling PDF parsing and logging; use filetype.mime_type for multipart uploads; added explicit parse errors when partition functions are unavailable.
Dependency Update
pyproject.toml
Bumped unstructured from 0.10.27 to 0.18.32 (extras ["docx","pptx"] unchanged) and relaxed pdfminer.six to >=20231228 for optional PDF support.
Unit Tests — Parser Patching
unit_tests/sources/file_based/file_types/test_unstructured_parser.py
Updated test patches to target the parser's internal wrappers (unstructured_partition_* and _import_unstructured); added mock_import_unstructured fixture/parameter to tests to stub lazy import behavior.
Unit Tests — Scenario Expectations
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py
Updated expected outputs to surface _ab_source_file_parse_error for PDF inputs when inference package is absent; adjusted content expectations for some DOCX/PDF cases to reflect new per-filetype parse-error propagation or plain-text parsing differences.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Would you like a simple Mermaid sequence diagram showing the new detection and per-filetype availability flow? wdyt?

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: upgrading the unstructured package from 0.10.27 to 0.18.32 and fixing related API incompatibilities in the file-based parser.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1771425511-bump-unstructured-to-latest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)

420-427: detect_filetype(file_path=...) with a remote URI — handled by try/except.

Since remote_file.uri is a remote path (e.g., s3://...), detect_filetype will likely fail trying to access it locally. The broad except Exception: pass catches this gracefully and falls through to extension-based detection. This works, but the silent swallowing of all exceptions could hide unexpected failures. Would it be worth narrowing to except (FileNotFoundError, OSError) to surface truly unexpected errors, wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py` around
lines 420 - 427, The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 432-435: The current extension extraction using
remote_file.uri.split(".")[-1] is brittle for URIs with dots in directory names
(e.g., "s3://bucket/folder.name/file") and can return incorrect values; update
the logic that computes extension (the lines assigning extension and calling
FileType.from_extension) to parse only the path portion of the URI and then use
os.path.splitext or pathlib.PurePosixPath to get the suffix, e.g., obtain the
path via urllib.parse.urlparse(remote_file.uri).path (or strip any
query/fragment), call os.path.splitext or PurePosixPath(path).suffix to get a
single leading dot extension (lowercased), then pass that to
FileType.from_extension and keep the existing return behavior.

---

Nitpick comments:
In `@airbyte_cdk/sources/file_based/file_types/unstructured_parser.py`:
- Around line 420-427: The try/except around
detect_filetype(file_path=remote_file.uri) currently swallows all exceptions
which can hide unexpected failures; replace the broad except Exception with a
narrower except (FileNotFoundError, OSError) to only ignore missing/local-path
errors when detect_filetype is called with a remote URI, and let other
exceptions propagate (or re-raise/log them) so unexpected errors in
detect_filetype are visible; locate the block using detect_filetype and
remote_file.uri in unstructured_parser.py and update the exception handling
accordingly.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 18, 2026

PyTest Results (Fast)

3 934 tests  ±0   3 922 ✅ ±0   6m 53s ⏱️ -16s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 6ca00a7. ± Comparison against base commit 0e57414.

♻️ This comment has been updated with latest results.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 18, 2026

PyTest Results (Full)

3 937 tests  ±0   3 925 ✅ ±0   10m 56s ⏱️ -18s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 6ca00a7. ± Comparison against base commit 0e57414.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 3 commits February 18, 2026 15:18
…t changes

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
…ompatibility

Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Co-Authored-By: Ryan Waskewich <ryan.waskewich@airbyte.io>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
unit_tests/sources/file_based/scenarios/unstructured_scenarios.py (2)

465-521: ⚠️ Potential issue | 🟡 Minor

corrupted_file_scenario no longer exercises corrupted-file handling — could it be reframed or split?

With the new lazy-import guard, PDF parsing now short-circuits to the unstructured_inference missing error before the file bytes are ever read. This means corrupted_file_scenario and simple_unstructured_scenario both traverse the exact same code path for PDFs. The "___ corrupted file ___" bytes are completely irrelevant to the outcome, and this scenario provides zero additional coverage over the PDF case in simple_unstructured_scenario.

Two options to consider — wdyt about either of these?

  1. Rename / reframe the scenario to something like pdf_without_inference_scenario to accurately describe what it's actually testing now.
  2. Add a companion scenario (guarded by a check that unstructured_inference is available) that validates the truly-corrupted-file error path — otherwise that branch is untested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around
lines 465 - 521, The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.

13-14: ⚠️ Potential issue | 🟡 Minor

Update NLTK resource names to match NLTK 3.9.1 compatibility.

The test file downloads "punkt" and "averaged_perceptron_tagger" (lines 13-14), but your production code in airbyte_cdk/sources/file_based/file_types/unstructured_parser.py already uses the NLTK 3.9+ resource names: "punkt_tab" and "averaged_perceptron_tagger_eng". With NLTK 3.9.1 pinned in poetry.lock, consider updating the test file to match:

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")

Or, sync the test setup with your production initialization pattern for consistency. The old resource names may download successfully but populate the wrong data directories, potentially causing lookup errors at test runtime. Wdyt?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py` around
lines 13 - 14, Update the NLTK resources downloaded in the test setup to match
NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@unit_tests/sources/file_based/scenarios/unstructured_scenarios.py`:
- Around line 465-521: The test `corrupted_file_scenario` now only hits the
unstructured_inference-missing path (same as `simple_unstructured_scenario`)
because PDF parsing short-circuits before reading bytes; either rename the
scenario to reflect that (e.g., `pdf_without_inference_scenario`) by updating
the TestScenarioBuilder instance name and description, or add a second scenario
that actually exercises the corrupted-file path by creating a guarded test that
only runs when `unstructured_inference` is importable (use the same
FileBasedSourceBuilder payload with corrupted bytes and check the parse-error
message for a real PDF parsing failure), and keep `corrupted_file_scenario` or
replace it accordingly so both code paths are covered.
- Around line 13-14: Update the NLTK resources downloaded in the test setup to
match NLTK 3.9.1 names used in production: replace or extend the existing
nltk.download calls so that the tests download "punkt_tab" and
"averaged_perceptron_tagger_eng" (keep "punkt" if desired for compatibility).
Locate the nltk.download calls in the test initialization (the lines currently
calling nltk.download("punkt") and nltk.download("averaged_perceptron_tagger"))
and change them to download the new resource names to ensure the test data
directories match the production parser (unstructured_parser.py) expectations.

@ryanwasko
Copy link
Copy Markdown

Security Note: CVE-2025-64712 (CVSS 9.8)
Worth flagging that there's a critical path traversal vulnerability (CVE-2025-64712) in unstructured versions prior to 0.18.18. The bug is in the .msg file parser, where attachment filenames aren't sanitized before being written to a temp directory. A crafted .msg attachment with a traversal path (e.g. ../../root/.ssh/authorized_keys) lets an attacker write arbitrary files to the filesystem, which can escalate to full RCE.
Disclosed by Cyera Research in February 2026, patched in 0.18.18. Our current pinned version (0.10.27) is vulnerable. This upgrade to 0.18.32 resolves it.
Ref: https://www.cyera.com/research/inside-destructured---critical-vulnerability-in-unstructured-io-cve-2025-64712

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment on lines +434 to +435
if ext_type is not None:
return ext_type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing FileType.UNK check for FileType.from_extension result, inconsistent with from_mime_type check

In _get_filetype, the from_mime_type call at line 410 correctly filters out FileType.UNK (if ft is not None and ft != FileType.UNK), but the from_extension call at line 433-434 only checks for None (if ext_type is not None). If FileType.from_extension returns FileType.UNK for an unrecognized extension, it would be returned directly, bypassing the content-based detection fallback at line 437. This would cause files with uncommon or missing extensions to fail with an unsupported file type error, even though content-based detection would have correctly identified them. The old code (if extension in EXT_TO_FILETYPE) could never return FileType.UNK since it only returned values actually mapped in the dictionary.

Suggested change
if ext_type is not None:
return ext_type
if ext_type is not None and ext_type != FileType.UNK:
return ext_type
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@devin-ai-integration
Copy link
Copy Markdown
Contributor

❌ Cannot revive Devin session - the session is too old. Please start a new session instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants