fix: use PyArrowFileIO for S3 access in get_dbt_model_as_dataframe#2218
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses production Dagster runs hanging during S3 reads when get_dbt_model_as_dataframe loads Iceberg tables via AWS Glue, by switching PyIceberg’s S3 access away from the default fsspec/aiobotocore path to PyArrow’s native S3 client.
Changes:
- Configure
GlueCatalogto usepyiceberg.io.pyarrow.PyArrowFileIOfor S3 access inget_dbt_model_as_dataframe. - Update the function’s docstring to reflect
pl.LazyFramesemantics and document the aiobotocore-related hang context.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What are the relevant tickets?
NA
These dagster runs hang on production
https://pipelines.odl.mit.edu/locations/lakehouse/jobs/instructor_onboarding_daily_job/runs
https://pipelines.odl.mit.edu/assets/reporting/student_risk_probability?view=events
Description (What does it do?)
Problem
After the aiobotocore upgrade from 3.4.0 → 3.5.0 (introduced in (#2178)), Dagster runs using get_dbt_model_as_dataframe would hang indefinitely.
The root cause: aiobotocore's async event loop thread was populating botocore's lazy-loader cache, blocking all pending S3 coroutines with no timeout or error.
Fix
Switch GlueCatalog to use PyArrowFileIO (pyiceberg.io.pyarrow.PyArrowFileIO) instead of the default FsspecFileIO. PyArrow uses a native C++ S3 client and bypasses aiobotocore entirely, eliminating the hang.
How can this be tested?
The hang only reproduces in a live Dagster environment with aiobotocore 3.5.0+, so this cannot be fully covered by local docker compose up: To verify manually:
Additional Context